# Homework Draft: Cal Enviroscreen

---

In this homework, students will gain experience with fundamental Exploratory Data Analysis using the CalEnviroscreen data. This homework will build on methods introduced in lab. It will also serve as an application of data science in the field of social sciences and **environmental justice**. According to state law, environmental justice refers to the "fair treatment of people of all races, cultures, and incomes with respect to the development, adoption, implementation and enforcement of environmental laws, regulations, and policies." 

By the end of this homework, students will be able to:
- Read in CalEnviroScreen data
- ...
- ...

## Table of Contents

1. [Introduction](#introduction)
2. [A Closer Look at Census Tracts and Regional Data](#a-closer-look-at-census-tracts-and-regional-data)

### Import Modules

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd

---

## Introduction


The [California Communities Environmental Health Screening Tool](https://oehha.ca.gov/calenviroscreen) (CalEnviroScreen) provides accessible demographic and environmental information to identify communities that are susceptible to certain types of pollution. This tool utilizes environmental, health, and socioeconomic information to produce scores for every census tract in California, allowing us to compare qualities of different communities. 

### An Initial Glance at the Data

To begin exploring CalEnviroScreen, run the following cell to read in the data.

In [8]:
# Read in the data

enviro = pd.read_csv('enviro.csv')
enviro.head()

Unnamed: 0,Census Tract,Total Population,California County,ZIP,Approximate Location,Longitude,Latitude,CES 4.0 Score,CES 4.0 Percentile,CES 4.0 Percentile Range,...,Linguistic Isolation Pctl,Poverty,Poverty Pctl,Unemployment,Unemployment Pctl,Housing Burden,Housing Burden Pctl,Pop. Char.,Pop. Char. Score,Pop. Char. Pctl
0,6019001100,2780,Fresno,93706,Fresno,-119.781696,36.709695,93.18,100.0,95-100% (highest scores),...,79.37,76.0,98.92,12.8,93.83,30.3,91.04,93.16,9.66,99.72
1,6077000700,4680,San Joaquin,95206,Stockton,-121.287873,37.943173,86.65,99.99,95-100% (highest scores),...,95.53,73.2,98.39,19.8,99.21,31.2,92.28,93.17,9.66,99.74
2,6037204920,2751,Los Angeles,90023,Los Angeles,-118.197497,34.0175,82.39,99.97,95-100% (highest scores),...,81.55,62.6,93.39,6.4,61.53,20.3,63.97,83.75,8.69,95.79
3,6019000700,3664,Fresno,93706,Fresno,-119.827707,36.734535,81.33,99.96,95-100% (highest scores),...,78.71,65.7,95.35,15.7,97.35,35.4,96.41,94.64,9.82,99.89
4,6019000200,2689,Fresno,93706,Fresno,-119.805504,36.735491,80.75,99.95,95-100% (highest scores),...,86.56,72.7,98.3,13.7,95.29,32.7,94.16,95.4,9.9,99.95


Let's also read in data containing information about the locations of community colleges in California. We will merge these two datasets to analyze the socioeconomic conditions of various community colleges in California.

In [16]:
# Read in the data

collegecodes= pd.read_excel("College_codes_EVDtype.xlsx")
collegecodes.head()

Unnamed: 0,OPEID,College,City,State,Zip,yrs,EVDCode
0,111100,ALLAN HANCOCK COLLEGE,SANTA MARIA,CA,93454,2,1
1,111300,ANTELOPE VALLEY COLLEGE,LANCASTER,CA,93534,2,1
2,111500,ARMSTRONG UNIVERSITY,BERKELEY,CA,94704,4,4
3,111600,ART CENTER COLLEGE OF DES,PASADENA,CA,91103,4,4
4,111700,AZUSA PACIFIC UNIVERSITY,AZUSA,CA,91702,4,4


Let's merge these two datasets.

In [18]:
# Merge enviro and collegecodes

CollegeCodes_Public = collegecodes[collegecodes['EVDCode'] != 4]
CES4_Public = pd.merge(enviro, CollegeCodes_Public, how='inner', left_on='ZIP', right_on='Zip')
CES4_Public.head()

Unnamed: 0,Census Tract,Total Population,California County,ZIP,Approximate Location,Longitude,Latitude,CES 4.0 Score,CES 4.0 Percentile,CES 4.0 Percentile Range,...,Pop. Char.,Pop. Char. Score,Pop. Char. Pctl,OPEID,College,City,State,Zip,yrs,EVDCode
0,6037542402,3306,Los Angeles,90221,Compton,-118.212413,33.881969,80.71,99.94,95-100% (highest scores),...,83.37,8.65,95.46,118800,COMPTON CMTY COLLEGE,COMPTON,CA,90221,2,1
1,6037542200,7155,Los Angeles,90221,Compton,-118.197151,33.886893,73.92,99.57,95-100% (highest scores),...,84.83,8.8,96.52,118800,COMPTON CMTY COLLEGE,COMPTON,CA,90221,2,1
2,6037542401,4735,Los Angeles,90221,Compton,-118.210904,33.892362,73.16,99.48,95-100% (highest scores),...,88.46,9.18,98.68,118800,COMPTON CMTY COLLEGE,COMPTON,CA,90221,2,1
3,6037541604,5917,Los Angeles,90221,Compton,-118.211896,33.907866,65.81,98.05,95-100% (highest scores),...,83.08,8.62,95.32,118800,COMPTON CMTY COLLEGE,COMPTON,CA,90221,2,1
4,6037542106,3523,Los Angeles,90221,East Rancho Dominguez,-118.189077,33.893404,65.67,98.0,95-100% (highest scores),...,85.51,8.87,96.97,118800,COMPTON CMTY COLLEGE,COMPTON,CA,90221,2,1


Now, let's familiarize ourselves with the data before performing any manipulations. 

**Question 1:** What are the dimensions of the `enviro` dataset? Fill in the code cell with the necessary code and print your answer.

In [None]:
# TODO: Fill in the ellipses

shape = ... 
print(shape)

**Question 2:** What is the granularity of the enviro dataset? What does each row represent? What about the collegecodes dataset?

*YOUR ANSWER HERE...*

**Question 3:** Describe the `Pollution_Burden_Pctl` category in the enviro dataset and what it represents by navigating the to the [CalEnviroScreen](https://oehha.ca.gov/calenviroscreen/report/calenviroscreen-40) website **in your own words**. What is the difference between a high score and a low score? 

*YOUR ANSWER HERE...*

---

## A Closer Look At Census Tracts and Regional Data

**Question 4:** Find the most polluted zip codes and show the college there using the code cell below.

In [19]:
# TODO: Write code to find the most polluted zip codes and display the colleges

CES4_Public.sort_values(by='CES 4.0 Score', ascending=False).head(10)

Unnamed: 0,Census Tract,Total Population,California County,ZIP,Approximate Location,Longitude,Latitude,CES 4.0 Score,CES 4.0 Percentile,CES 4.0 Percentile Range,...,Pop. Char.,Pop. Char. Score,Pop. Char. Pctl,OPEID,College,City,State,Zip,yrs,EVDCode
0,6037542402,3306,Los Angeles,90221,Compton,-118.212413,33.881969,80.71,99.94,95-100% (highest scores),...,83.37,8.65,95.46,118800,COMPTON CMTY COLLEGE,COMPTON,CA,90221,2,1
11,6099003700,4669,Stanislaus,95380,Unincorporated Stanislaus County area,-120.883606,37.464793,75.31,99.7,95-100% (highest scores),...,81.37,8.44,93.87,115700,CALIF ST UNIV STANISLAUS,TURLOCK,CA,95380,4,2
1,6037542200,7155,Los Angeles,90221,Compton,-118.197151,33.886893,73.92,99.57,95-100% (highest scores),...,84.83,8.8,96.52,118800,COMPTON CMTY COLLEGE,COMPTON,CA,90221,2,1
2,6037542401,4735,Los Angeles,90221,Compton,-118.210904,33.892362,73.16,99.48,95-100% (highest scores),...,88.46,9.18,98.68,118800,COMPTON CMTY COLLEGE,COMPTON,CA,90221,2,1
19,6037294701,3099,Los Angeles,90744,Los Angeles,-118.254908,33.778016,71.29,99.27,95-100% (highest scores),...,81.46,8.45,93.99,122400,LOS ANGELES HARBOR COLLEG,WILMINGTON,CA,90744,2,1
12,6099003802,5339,Stanislaus,95380,Turlock,-120.857699,37.486126,66.71,98.35,95-100% (highest scores),...,80.89,8.39,93.38,115700,CALIF ST UNIV STANISLAUS,TURLOCK,CA,95380,4,2
34,6071004900,7113,San Bernardino,92410,San Bernardino,-117.316879,34.100314,66.66,98.31,95-100% (highest scores),...,91.41,9.48,99.43,127200,SAN BERNARDINO VALLEY COL,SN BERNRDNO,CA,92410,2,1
43,6047001302,2873,Merced,95340,Merced,-120.478191,37.304035,66.5,98.25,95-100% (highest scores),...,96.4,10.0,100.0,123700,MERCED COLLEGE,MERCED,CA,95340,2,1
44,6047001301,2662,Merced,95340,Merced,-120.491786,37.307192,66.41,98.21,95-100% (highest scores),...,87.78,9.11,98.32,123700,MERCED COLLEGE,MERCED,CA,95340,2,1
48,6019006802,3339,Fresno,93654,Unincorporated Fresno County area,-119.497435,36.599505,66.35,98.17,95-100% (highest scores),...,90.41,9.38,99.26,130800,REEDLEY COLLEGE,REEDLEY,CA,93654,2,1


**Question 5:** Find the least polluted zip codes and show the college there using the code cell below.

In [20]:
# TODO: Write code to find the least polluted zip codes and display the colleges

CES4_Public.sort_values(by='CES 4.0 Score', ascending=True).head(10)

Unnamed: 0,Census Tract,Total Population,California County,ZIP,Approximate Location,Longitude,Latitude,CES 4.0 Score,CES 4.0 Percentile,CES 4.0 Percentile Range,...,Pop. Char.,Pop. Char. Score,Pop. Char. Pctl,OPEID,College,City,State,Zip,yrs,EVDCode
157,6099001002,3284,Stanislaus,95350,Modesto,-120.978323,37.654742,26.51,52.02,50-55%,...,46.76,4.85,45.37,124000,MODESTO JR COLLEGE,MODESTO,CA,95350,2,1
90,6071010802,3820,San Bernardino,92407,Unincorporated San Bernardino County area,-117.352081,34.270082,26.77,52.58,50-55%,...,47.01,4.88,45.79,114200,CALIF ST UNIV SAN BERNARD,SN BERNRDNO,CA,92407,4,2
18,6099003908,2428,Stanislaus,95380,Turlock,-120.854306,37.506439,29.72,58.16,55-60%,...,59.88,6.21,65.48,115700,CALIF ST UNIV STANISLAUS,TURLOCK,CA,95380,4,2
122,6037554600,4173,Los Angeles,90650,Norwalk,-118.089292,33.883645,31.87,62.14,60-65%,...,49.56,5.14,49.61,116100,CERRITOS COLLEGE,NORWALK,CA,90650,2,1
156,6099000909,5437,Stanislaus,95350,Modesto,-120.984276,37.673238,32.39,63.17,60-65%,...,55.92,5.8,59.38,124000,MODESTO JR COLLEGE,MODESTO,CA,95350,2,1
54,6019006300,7507,Fresno,93654,Unincorporated Fresno County area,-119.435827,36.663503,33.05,64.12,60-65%,...,48.05,4.98,47.44,130800,REEDLEY COLLEGE,REEDLEY,CA,93654,2,1
203,6073005200,7087,San Diego,92101,San Diego,-117.15211,32.715528,33.15,64.33,60-65%,...,48.92,5.07,48.7,127300,SAN DIEGO CITY COLLEGE,SAN DIEGO,CA,92101,2,1
155,6099000805,6582,Stanislaus,95350,Modesto,-121.021662,37.676408,33.21,64.44,60-65%,...,65.51,6.8,73.76,124000,MODESTO JR COLLEGE,MODESTO,CA,95350,2,1
42,6071004403,5054,San Bernardino,92410,San Bernardino,-117.340961,34.092831,33.44,64.88,60-65%,...,66.03,6.85,74.46,127200,SAN BERNARDINO VALLEY COL,SN BERNRDNO,CA,92410,2,1
184,6013364002,5531,Contra Costa,94806,Tara Hills,-122.317385,37.993894,33.52,65.03,65-70%,...,61.24,6.35,67.52,119000,CONTRA COSTA COLLEGE,SAN PABLO,CA,94806,2,1


Now that we have become more familiar with the data, let's take a closer look at the census tract for El Camino College. The relevant tract number is **6037603702**.

**Question 6:** Filter the dataset for this tract number using the code cell below.

In [None]:
# TODO: Fill in the ellipses

ecc = enviro[...]
ecc

**Question 7:** Based on this filtered data, examine three measures of environmental health and describe what this means. Feel free to refer back to the CalEnviroScreen website for more context on the health measures in the data.

*YOUR ANSWER HERE...*

In [10]:
ecc = enviro[enviro['Census Tract'] == 6037603702]
ecc

Unnamed: 0,Census Tract,Total Population,California County,ZIP,Approximate Location,Longitude,Latitude,CES 4.0 Score,CES 4.0 Percentile,CES 4.0 Percentile Range,...,Linguistic Isolation Pctl,Poverty,Poverty Pctl,Unemployment,Unemployment Pctl,Housing Burden,Housing Burden Pctl,Pop. Char.,Pop. Char. Score,Pop. Char. Pctl
2087,6037603702,4777,Los Angeles,90506,Alondra Park,-118.339382,33.882996,38.32,72.66,70-75%,...,40.9,21.9,38.13,6.0,57.25,8.4,8.04,57.25,5.94,61.3


Let's compare these values with other census tracts in **Los Angeles**. 

**Question 8:** Write code to filter the datasets for only Los Angeles county data.

In [23]:
# TODO: Filter for Los Angeles data

la = ...
la.head()

---

## Visualizing the Data

geopandas used here...