# Homework Draft: Cal EnviroScreen

---

In this homework, students will gain experience with fundamental Exploratory Data Analysis and model exploration using the CalEnviroScreen data. This homework will build on methods introduced in lab. It will also serve as an application of data science in the field of social sciences and **environmental justice**. According to state law, environmental justice refers to the "fair treatment of people of all races, cultures, and incomes with respect to the development, adoption, implementation and enforcement of environmental laws, regulations, and policies." 

By the end of this homework, students will be able to:
- Perform basic tabular analysis using pandas and interpret results
- Visualize and analyze CalEnviroScreen data
- Identify how data-driven decision making can guide policy and resource allocation

## Table of Contents

1. [Introduction](#introduction)
2. [A Closer Look at Census Tracts and Regional Data](#a-closer-look-at-census-tracts-and-regional-data)
3. [Visualizing the Data](#visualizing-the-data)
4. [Data-Driven Decision Making](#data-driven-decision-making)

### Import Modules

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import seaborn as sns

---

## 1. Introduction


The [California Communities Environmental Health Screening Tool](https://oehha.ca.gov/calenviroscreen) (CalEnviroScreen) provides accessible demographic and environmental information to identify communities that are susceptible to certain types of pollution. This tool utilizes environmental, health, and socioeconomic information to produce scores for every census tract in California, allowing us to compare qualities of different communities. 

### 1.1 Reading in CalEnviroScreen Data

To begin exploring CalEnviroScreen, run the following cell to read in the data.

In [3]:
# Read in the data

ces = pd.read_csv('enviro.csv')
ces.head()

Unnamed: 0,Census Tract,Total Population,California County,ZIP,Approximate Location,Longitude,Latitude,CES 4.0 Score,CES 4.0 Percentile,CES 4.0 Percentile Range,...,Linguistic Isolation Pctl,Poverty,Poverty Pctl,Unemployment,Unemployment Pctl,Housing Burden,Housing Burden Pctl,Pop. Char.,Pop. Char. Score,Pop. Char. Pctl
0,6019001100,2780,Fresno,93706,Fresno,-119.781696,36.709695,93.18,100.0,95-100% (highest scores),...,79.37,76.0,98.92,12.8,93.83,30.3,91.04,93.16,9.66,99.72
1,6077000700,4680,San Joaquin,95206,Stockton,-121.287873,37.943173,86.65,99.99,95-100% (highest scores),...,95.53,73.2,98.39,19.8,99.21,31.2,92.28,93.17,9.66,99.74
2,6037204920,2751,Los Angeles,90023,Los Angeles,-118.197497,34.0175,82.39,99.97,95-100% (highest scores),...,81.55,62.6,93.39,6.4,61.53,20.3,63.97,83.75,8.69,95.79
3,6019000700,3664,Fresno,93706,Fresno,-119.827707,36.734535,81.33,99.96,95-100% (highest scores),...,78.71,65.7,95.35,15.7,97.35,35.4,96.41,94.64,9.82,99.89
4,6019000200,2689,Fresno,93706,Fresno,-119.805504,36.735491,80.75,99.95,95-100% (highest scores),...,86.56,72.7,98.3,13.7,95.29,32.7,94.16,95.4,9.9,99.95


Before we begin any tabular analysis, let's familiarize ourselves with this DataFrame by looking at its dimensions and the data types of its columns. Run the following code cell to print the dimensions of `ces`.

In [6]:
# Run this cell

print(ces.shape)

(2310, 59)


Now, let's take a look at some of the columns of our DataFrame. RUn the code cell below:

In [8]:
# Run this cell

print(ces.columns)

Index(['Census Tract', 'Total Population', 'California County', 'ZIP',
       'Approximate Location', 'Longitude', 'Latitude', 'CES 4.0 Score',
       ' CES 4.0 Percentile', 'CES 4.0 Percentile Range', 'DAC category',
       'Ozone', 'Ozone Pctl', 'PM2.5', 'PM2.5 Pctl', 'Diesel PM',
       'Diesel PM Pctl', 'Drinking Water', 'Drinking Water Pctl', 'Lead',
       'Lead Pctl', 'Pesticides', 'Pesticides Pctl', 'Tox. Release',
       'Tox. Release Pctl', 'Traffic', 'Traffic Pctl', 'Cleanup Sites',
       'Cleanup Sites Pctl', 'Groundwater Threats', 'Groundwater Threats Pctl',
       'Haz. Waste', 'Haz. Waste Pctl', 'Imp. Water Bodies',
       'Imp. Water Bodies Pctl', 'Solid Waste', 'Solid Waste Pctl',
       'Pollution Burden', 'Pollution Burden Score', 'Pollution Burden Pctl',
       'Asthma', 'Asthma Pctl', 'Low Birth Weight', 'Low Birth Weight Pctl',
       'Cardiovascular Disease', 'Cardiovascular Disease Pctl', 'Education',
       'Education Pctl', 'Linguistic Isolation', 'Linguistic

Let's also examine the data types of some of our columns

In [12]:
# Run this cell to return the data type of the first 10 columns of the DataFrame

ces.dtypes[:10]

Census Tract                  int64
Total Population              int64
California County            object
ZIP                           int64
Approximate Location         object
Longitude                   float64
Latitude                    float64
CES 4.0 Score               float64
 CES 4.0 Percentile         float64
CES 4.0 Percentile Range     object
dtype: object

**Question 1.1a:** Looking at this result, we can see that we will be primarily working with numerical data. Now that we've familiarized ourselves with the data, describe the granularity of `ces`. In other words, what does each row of the DataFrame represent? Refer back to the [CalEnviroScreen](https://oehha.ca.gov/calenviroscreen/report/calenviroscreen-40) website as needed.

*YOUR ANSWER HERE...*

**Question 1.1b:** Describe the `Pollution_Burden_Pctl` category in the enviro dataset and what it represents by navigating the to the [CalEnviroScreen](https://oehha.ca.gov/calenviroscreen/report/calenviroscreen-40) website **in your own words**. What is the difference between a high score and a low score? 

*YOUR ANSWER HERE...*

### 1.2 Reading in Community College Data

Let's also read in data containing information about the locations of community colleges in California. We will familiarize ourselves with this data the same way we did for `ces`.

In [17]:
# Read in the data

colleges = pd.read_excel("College_codes_EVDtype.xlsx")
colleges.head()

Unnamed: 0,OPEID,College,City,State,Zip,yrs,EVDCode
0,111100,ALLAN HANCOCK COLLEGE,SANTA MARIA,CA,93454,2,1
1,111300,ANTELOPE VALLEY COLLEGE,LANCASTER,CA,93534,2,1
2,111500,ARMSTRONG UNIVERSITY,BERKELEY,CA,94704,4,4
3,111600,ART CENTER COLLEGE OF DES,PASADENA,CA,91103,4,4
4,111700,AZUSA PACIFIC UNIVERSITY,AZUSA,CA,91702,4,4


**Question 1.2a:** What are the dimensions of the `colleges` dataset? Fill in the code cell with the necessary code and print your answer.

In [None]:
# TODO: Fill in the ellipses

shape = ... 
print(shape)

**Question 1.2b:** What are the columns of the dataset?

In [None]:
# TODO: Write code to print the columns of the enviro dataset

print(...)

**Question 1.2c:** What is the granularity of the `colleges` dataset? What does each row represent? 

*YOUR ANSWER HERE...*

### 1.3 Merging `ces` and `colleges`

Now that we have an understanding of both of our DataFrames, we will merge these two DataFrames to analyze the socioeconomic conditions of various community colleges in California using the [`pd.merge()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) function in pandas.

To use `pd.merge()`, we will pass in the following parameters into the function respectively:
- The first DataFrame being merged
- The second DataFrame being merged
- **how** indicates the type of merge to be used
- **left_on** and **right_on** parameters are assigned to the string names of the columns to be used when performing the join. These two on parameters tell pandas what values should act as pairing keys to determine which rows to merge across the DataFrames. We’ll talk more about this idea of a pairing key next lecture.

We will be using an *inner* merge which will use the intersection of keys from both DataFrames( similar to a SQL inner join). It will preserve the order of the left keys.

In [19]:
# Merge enviro and collegecodes

filtered_colleges = colleges[colleges['EVDCode'] != 4]
ces_cc = pd.merge(ces, filtered_colleges, how='inner', left_on='ZIP', right_on='Zip')
ces_cc.shape

(208, 66)

Notice that after joining both of our DataFrames, we are only left with 208 rows. It is important to keep track of the characteristics of our data as we perform manipulations. 

---

## A Closer Look At Census Tracts and Regional Data

**Question 4:** Find the most polluted zip codes and show the college there using the code cell below.

In [None]:
# TODO: Write code to find the most polluted zip codes and display the colleges

ces_cc.sort_values(by='CES 4.0 Score', ascending=False).head(10)

**Question 5:** Find the least polluted zip codes and show the college there using the code cell below.

In [None]:
# TODO: Write code to find the least polluted zip codes and display the colleges

ces_cc.sort_values(by='CES 4.0 Score', ascending=True).head(10)

Now that we have become more familiar with the data, let's take a closer look at the census tract for El Camino College. The relevant tract number is **6037603702**.

**Question 6:** Filter the dataset for this tract number using the code cell below.

In [None]:
# TODO: Fill in the ellipses

ecc = ces[...]
ecc

**Question 7:** Based on this filtered data, let's examine three new measures of environmental health and interpret the scores for El Camino College. Before writing any code let's establish what these measures are. Refer back to the CalEnviroScreen website for context on these three health measures in the data and write definitions for each. 

- `PM2.5 Pctl`: *YOUR ANSWER HERE...*
- `asthma_pctl`: *YOUR ANSWER HERE...*
- `CES 4.0 Percentile`: *YOUR ANSWER HERE...*

Now, write code in the cell below to obtain the values for these metrics, as well as `Pollution Burden Pctl`, for the **El Camino College** census tract. 

In [None]:
# TODO: Fill in the ellipses

pm25_pctl = ...
asthma_pctl = ...
ces4_pctl = ...
plltn_burden_pctl = ...

Briefly describe the real-world implications of at least two of these scores.

*YOUR ANSWER HERE...*

Let's compare these four measures across other census tracts in **Los Angeles**. 

**Question 8:** Write code to filter the enviro_cc dataset to contain only Los Angeles county data. Additionally, make sure only the following columns are included in this new dataset: `Approximate Location`, `PM2.5 Pctl`, ` CES 4.0 Percentile`, `Asthma Pctl`, and `Pollution Burden Pctl`.

Use specific column name

In [None]:
# TODO: Filter for Los Angeles data

la = ...
la.head()

TODO: Give a demo of groupby....include different agg functions...group by county find max PM2.5 for ex

**Question 9:** With this new dataset, find the mean value for each of these four metrics based on the unique `Approximate Location`. Assign this new dataset to a new DataFrame. 

*Hint: you might find the `.grouby()` method useful*

In [None]:
# TODO: Write code to find the mean of PM2.5, CES4.0, Asthma, and Pollution Burden percentiles by approximate location

city_means = ...

---

## Visualizing the Data

[TODO:] 
- Add histograms looking at distributions of diff pollution measures
- Add seaborn pairplots after assessing heatmap 
- Violinplot of multiple features side by side to compare distributions of multiple variables

CalEnviroScreen is also a mapping tool that allows us to visualize the socioeconomic disparities across California. We will be using a package called `geopandas` to facilitate this mapping. 

The following code cell reads in a shape file (.shp) and contains information that will allow us to map the data using geopandas. A shapefile is a popular geospatial vector data format that stores the geometry of the features in our data. They are common for handling geospatial data. Simply run the cell below to read in the shapefile associated with CalEnviroScreen.

In [None]:
# Provide the path to your GIS file (e.g., .shp file)
file_path = 'calenviroscreen40shpf2021shp/CES4 Final Shapefile.shp'

# Load the shapefile
gdf = gpd.read_file(file_path)

# Display the first few rows of the GeoDataFrame
gdf.head()

We can call `.plot()` to map the data.

In [None]:
# map the data
gdf.plot()

**Question:** Like we did with the enviro data, let's merge the gdf data with collegecodes_public. Following the format from before, merge the necessary datasets on zipcode. 

In [None]:
# TODO: Fill in the ellipses

enviro_cc_shp = pd.merge(..., ..., how=..., left_on=..., right_on=...)
enviro_cc_shp.head()

We can use geopandas to visualize the spread of our metrics, like `Pollution Burden`. Run the following code cell below. The left plot displays the spread of pollution burden score across California and the right plot displays the spread of poverty across California.  

In [None]:
# Create a figure with subplots to compare pollution burden and population vulnerability
fig, ax = plt.subplots(1, 2, figsize=(15, 7))

# Plot Pollution Burden on the left
gdf.plot(column='PolBurdSc', ax=ax[0], legend=True, cmap='OrRd')
ax[0].set_title('Pollution Burden Score')

# Plot Population Vulnerability on the right (example with 'Poverty' column)
gdf.plot(column='Poverty', ax=ax[1], legend=True, cmap='PuBu')
ax[1].set_title('Population Vulnerability (Poverty)')

plt.show()

**Question:** List an observation from each plot and how it relates to environmental justice.  

*YOUR ANSWER HERE...*

We can also draw some simple, but powerful analyses by observing the correlation between variables in our data. The code cell below computes the correlation coefficient between various socioeconomic variables in our data and plots a **heatmap** of these coefficients. A heatmap is a visualization that represents the magnitude of values using color. It allows us to easily observe the magnitudes of correlation across pairs of variables.

Run the code cell below and interpret the heatmap.

In [None]:
# Select columns for correlation matrix
corr_columns = ['Poverty', 'Unempl', 'HousBurd', 'PolBurdSc', 'Asthma', 'Cardiovas']

# Create a correlation matrix
corr_matrix = gdf[corr_columns].corr()

# Plot the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Between Population Characteristics and Pollution Burden')
plt.show()

**Question:** Which variables are strongly correlated with each other, in either the positive or negative direction? Why might these variables be strongly correlated with one another?

*YOUR ANSWER HERE...*

---

## Data-driven Decision Making

Throughout this notebook, you gained valuable information about how to manipulate data to draw powerful analyses about socioeconomic disparaties across Los Angeles and California. Tools like CalEnviroScreen are essential for evaluating environmnetal justice because they help identify and prioritize communities that face higher environmnetal and health risks due to disproportionate effects of pollution. 

[TODO:] tie this in with an article related to CalEnviroScreen data...

**Question:** Can you think of how an analysis of two other variables in the data not used in this notebook may be helpful for identifying marginalized communities? You may use the optional code cell as scratch work.

*YOUR ANSWER HERE...*

In [None]:
# Optional code cell for scratch work

**Question:** In 3-4 sentences, how might access to CalEnviroScreen data help guide resource allocation and inform policymaking?

*YOUR ANSWER HERE...*

---

# Congratulations, you are finished with the notebook!