In [None]:
import warnings
warnings.filterwarnings('ignore')

from datascience import *
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
plt.style.use('fivethirtyeight')

# Pollution Maps

In this report we introduce maps! Some data has geospatial features such as latitude and longitude, giving us the opportunity to understand how instances of our data are spread across different locations. 

## The Data Science Life Cycle - Table of Contents

<a href='#section 0'>Background Knowledge</a>

<a href='#subsection 1a'>Formulating a question or problem</a> 

<a href='#subsection 1b'>Acquiring and preparing data</a>

<a href='#subsection 1c'>Conducting exploratory data analysis</a>

<a href='#subsection 1d'>Using prediction and inference to draw conclusions</a>
<br><br>

## Background<a id='section 0'></a>

In 2004, California introduced the Environmental Justice Action Plan to study the impact of multiple pollution sources on California communities. 

While California has some of the strictest regulatory controls in the country, many communities in California struggle with a disproportionate share of environmental pollution. 
Some parts of the state are burdened by old industrial and agricultural practices, while others are close to
trade corridors and suffer from high levels of air pollution. 
 
The California Environmental Protection Agency (CalEPA) and the California Office of Health Hazard Assessment (OEHHA)
and the public developed the CalEnviroScreen project to help assess the impact of pollution on California Communities. We'll be using its data to draw maps that give us more insight into the impacts of pollution. 



## Formulating a question or problem <a id='subsection 1a'></a>

It is impotant to ask questions that will be informative and that will avoid misleading results.

*Insert Answer Here*

*Insert Answer Here*

## Acquiring and preparing data <a id='subsection 1b'></a>

The California Environmental Screening project, CalEnviroScreen, provides data on air, water, and pollution (e.g., ozone levels, airborne particulate matter, pesticides, traffic congestion). CalEnviroScreen, CES for short, has combined this information into a pollution  score. You can find raw data [here](https://oehha.ca.gov/calenviroscreen).

The data provided here includes environmental and population information across different census tracts in California. Census tracts from the U.S. Census Bureau (2010 census) represent small, relatively permanent subdivisons of a county, and are uniquely numbered in each county with a numeric code. Census tracts average about 4,000 people, but are 1,200 at minimum and 8,000 at maximum. 

In [None]:
ces_raw = Table.read_table("pollution_data/ces_data_v2.csv")

Here are some of the important fields in the dataset that you will focus on:

|Column Name   | Description |
|:---|:---|
|Total.Population | Census tract population |
|California.County | County name |
|Latitude | Measurement of location north or south of the equator | 
|Longitude | Measurement of location east or west of the north-south line |
|Hispanic....| Percent of population that is Hispanic |
|White.... | Precent of population that is non-Hispanic White |
|African.American.... | Precent of population that is African American |
|Ozone| Average daily maximum ozone concnetration (ppm) from May to October |
|PM2.5 | Average concentration of fine particulate matter (micro-gram per meter cubed) |
|Pollution.Burden | Pollution Burden scores for each census tract derived from exposures indicators (ozone and PM2.5 concentrations, diesel PM emissions, drinking water contaminants, pesticide use, toxic releases from facilities, and traffic density) and environmental effects indicators (cleanup sites, impaired water bodies, groundwater threats, hazardous waste facilities and generators, and solid waste sites and facilities) |
|Asthma | Emergency department visits for asthma per 10,000 people |
|Low.Birth.Weight | Percent of babies born weighing under 5.5 pounds |
|Poverty | Percent of the population living below two times the federal poverty level |
|Unemployment | Percent of the population over 16 that is unemployed and eligible for the labor force |

In [None]:
ces_raw

Part of **cleaning data** includes **renaming columns**, **reducing the table size to include only the columns of interest**, and **removing missing values**.  

For our purposes, we will not be using the columns:
- ZIP
- CES.3.0.Score
- Diesel.PM
- Drinking.Water
- Pesticides
- Tox..Release
- Traffic
- Cleanup.Sites
- Groundwater.Threats
- Haz..Waste
- Imp..Water.Bodies
- Solid.Waste

Keep the column `Census.Tract` because it uniquely identifies the census tract.  Also keep `Pollution.Burden`, race columns, health indicators, and a few pollution indicators that you identified earlier.  

<div class="alert alert-info">
<b>Question:</b> Fill the array "cols_to_drop" with the names of the columns we seek to remove from our dataset.</div> 

In [None]:
cols_to_drop = make_array(...)

ces_data = ces_raw.drop(...)
ces_data

In [None]:
#KEY
cols_to_drop = make_array(["ZIP", "CES.3.0.Score","Diesel.PM", "Drinking.Water", "Pesticides", 
                "Tox..Release", "Traffic", "Cleanup.Sites", "Groundwater.Threats", "Haz..Waste", 
                "Imp..Water.Bodies", "Solid.Waste"])

ces_data = ces_raw.drop(cols_to_drop)
ces_data

Let's give some of the remaining columns simpler, more meaningful names.

In [None]:
old_names = make_array('Census.Tract', 'Total.Population', 'California.County', 'Hispanic....', "White....", 
            'African.American....', 'Native.American....', "Asian.American....", 'Other....', 'Pollution.Burden',
            'Low.Birth.Weight')
new_names = make_array('Tract', 'Population', 'County', 'Hispanic', "White", 'Black', 'Native', "Asian", 'Other', 'Pollution_Burden',
            'Low_Birthweight')

In [None]:
ces_data = ces_data.relabel(old_names, new_names)
ces_data

## Conducting exploratory data analysis <a id='subsection 1c'></a>

We are interested in studying the impact of pollution on different communities in California. With maps, we can get a spatial understanding of how levels of pollution vary across different geographical regions.


We will be using two different map types to give us insight: a **dot map** and a **size map**. 

### Dot map

Dot maps are a simple map with a dot at each (lat, long) pair from our data. 

The next cell creates a function called <b>dot_map</b> which we will use to create a dot map. 

In [None]:
def dot_map(tbl):
    """Create a map with dots to represent a unique location.
    
    Parameters:
        tbl (datascience.Table): The Table containing the data needed to plot our map. Note the table
        must have a "Latitude" and "Longitude" column for this function to work.
    Returns:
        (datascience.Map): A map with a dot at each unique (lat, long) pair.
    """
    reduced = tbl.select("Latitude", "Longitude")
    return Circle.map_table(reduced, area=10, fill_opacity=1)

<div class="alert alert-danger" role="alert">
    <b>Example:</b> Let's start with a dot map that displays all of our census tracts. 

To do so, we can pass in our <code>ces_data</code> table. Following the function provided, it takes in a table, abbreviated as tbl, so let's use <code>ces_data</code> as our argument.
</div>

In [None]:
dot_map(ces_data)

Wow! We've now generated a map with dots that represent each of the census tracts. Next, we would like to create dot maps for census tracts that fit certain criteria. For example, ...

<div class="alert alert-info"> 
<b>Question:</b> Make a simple dot map for only San Francisco county census tracts using <i>ces_data</i>.
</div>

In [None]:
dot_sf = ces_data.where("", are.equal_to(""))

dot_map(dot_sf)

In [None]:
# KEY

dot_sf = ces_data.where("County", are.equal_to("San Francisco"))

dot_map(dot_sf)

<div class="alert alert-info"> 
<b>Question:</b> Make a simple dot map where pollution burden is greater than 80% for census tracts using <i>ces_data</i>.
</div>

In [None]:
dot_80 = ces_data.where("", are.above())

dot_map(dot_80)

In [None]:
# KEY

dot_80 = ces_data.where("Pollution_Burden", are.above(80))

dot_map(dot_80)

### Size map

Size maps are detail-oriented maps, using color and size data to add more visual information to our map. 

The next cell creates a function called <b>size_map</b> which we will use to create a size map.

In [None]:
def size_map(tbl):
    """Plots a geographical map where each dot represents a coordinate pair, scaled by a given column.
    
    Parameters:
        tbl: The input Table containing the following arguments, in order:
            Col 0: latitude
            Col 1: longitude
            Col 2: type of location
            Col 3: color (MUST be labeled "colors")
            Col 4: area (MUST be labeled "areas")
    Returns:
        (datascience.Map): A map with a dot at each (lat, long),
                        colored according to Col 3,area as in Col 4.
    """
    return Circle.map_table(tbl, fill_opacity=1)

Load the following cells to get some helpful functions needed for `size_map`.

#### Helper Functions for `size_map`

In [1]:
#Use this function in order to get our Col 3: color for size_map
def get_colors_from_column(tbl, col, include_outliers=True):
    """Assigns each row of the input table to a color based on the value of its percentage column."""
    vmin = min(tbl.column(col))
    vmax = max(tbl.column(col))

    if include_outliers:
        outlier_min_bound = vmin
        outlier_max_bound = vmax
    else:
        q1 = np.percentile(tbl.column(col), 25)
        q3 = np.percentile(tbl.column(col), 75)
        IQR = q3 - q1
        outlier_min_bound = max(vmin, q1 - 1.5 * IQR)
        outlier_max_bound = min(vmax, q3 + 1.5 * IQR)
        
    colorbar_scale = list(np.linspace(outlier_min_bound, outlier_max_bound, 10))
    scale_colors = ['#006100', '#3c8000', '#6ba100', '#a3c400', '#dfeb00', '#ffea00', '#ffbb00', '#ff9100', '#ff6200', '#ff2200']
    
    def assign_color(colors, cutoffs, datapoint):
        """Assigns a color to the input percent based on the data's distribution."""
        for i, cutoff in enumerate(cutoffs):
            if cutoff >= datapoint:
                return colors[i - 1] if i > 0 else colors[0]
        return colors[-1]
    
    colors = [""] * tbl.num_rows
    for i, datapoint in enumerate(tbl.column(col)): 
        colors[i] = assign_color(scale_colors, colorbar_scale, datapoint)
        
    return colors

In [2]:
# Use this function in order to get our Col 4: size for size_map
def get_areas_from_column(tbl, label):
    """Gets the array values corresponding to the column label in the input table."""
    return tbl.column(label)

<div class="alert alert-danger"> 
<b>Example:</b> Let's start with creating a population size map.
    
To do so, let's using our above helper functions to get the population areas and store them in a variable called <code>pop_areas</code>
</div>

In [3]:
pop_areas = get_areas_from_column(ces_data, "Population")
pop_areas

NameError: name 'ces_data' is not defined

Since we know that there are no percentages associated with population, we can use a constant hexadecimal color value **#1E90FF** and a constant type `Population`.

In [None]:
ces_pop_tbl = ces_data.select("Latitude", "Longitude").with_columns("type", "Population",
                                                                    "colors", '#1E90FF',
                                                                    "areas", pop_areas)
ces_pop_tbl

In [None]:
size_map(ces_pop_tbl)

<div class="alert alert-info"> 
<b>Question:</b> Create a size map that uses color to map the <b>Pollution Burden</b> values.
    
Recall, <b>Pollution_Burden</b> represents the potential exposures to pollutants and the adverse environmental conditions caused by pollution in a given census tract.
</div>

In [None]:
# pollution_areas: pollution burden percents from ces_data
pollution_areas = get_areas_from_column(ces_data, "Pollution_Burden")
# pollution_colors: color code pollution burden percents from ces_data
pollution_colors = get_colors_from_column(ces_data, "Pollution_Burden")
pollution_tbl = ces_data.select("Latitude", "Longitude").with_columns("type", "Population",
                                                                    "colors", pollution_colors,
                                                                    "areas", pollution_areas)

In [None]:
pollution_tbl

In [None]:
size_map(pollution_tbl)

<div class="alert alert-info"> 
    <b>Question:</b> Now, create a size map that uses color to map the <b>Poverty</b> values.

Recall, <b>Poverty</b> is the percent of the population living below two times the federal poverty level in a given census tract.
</div>

In [None]:
# KEY

pov_colors = get_colors_from_column(ces_data, "Poverty")
pov_areas = get_areas_from_column(ces_data, "Poverty")

poverty_tbl = ces_data.select("Latitude", "Longitude").with_columns("type", "Poverty",
                                                                    "colors", pov_colors,
                                                                    "areas", pov_areas)

In [None]:
poverty_tbl

In [None]:
size_map(poverty_tbl)

## Using prediction and inference to draw conclusions <a id='subsection 1d'></a>

<div class="alert alert-info"> 
<b>Question:</b> 

</div>

*Insert answer here*

<div class="alert alert-info"> 
<b>Question:</b> 

</div>

*Insert answer here*

<div class="alert alert-info"> 
<b>Question:</b> 

</div>

*Insert answer here*

<div class="alert alert-success" role="alert">
  <h2 class="alert-heading">Well done!</h2>
    <p>In this report you ...</p>
</div>