In [None]:
import warnings
warnings.filterwarnings('ignore')

from datascience import *
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
plt.style.use('fivethirtyeight')

# Pollution Maps

In this report we introduce maps! Some data has geospatial features such as latitude and longitude, giving us the opportunity to understand how instances of our data are spread across different locations. 

<img src="longitude-and-latitude.png">

## The Data Science Life Cycle - Table of Contents

<a href='#section 0'>Background Knowledge</a>

<a href='#subsection 1a'>Formulating a question or problem</a> 

<a href='#subsection 1b'>Acquiring and preparing data</a>

<a href='#subsection 1c'>Conducting exploratory data analysis</a>

<a href='#subsection 1d'>Using prediction and inference to draw conclusions</a>
<br><br>

## Background<a id='section 0'></a>

In 2004, California introduced the Environmental Justice Action Plan to study the impact of multiple pollution sources on California communities. 

While California has some of the strictest regulatory controls in the country, many communities in California struggle with a disproportionate share of environmental pollution. 
Some parts of the state are burdened by old industrial and agricultural practices, while others are close to
trade corridors and suffer from high levels of air pollution. 
 
The California Environmental Protection Agency (CalEPA) and the California Office of Health Hazard Assessment (OEHHA)
and the public developed the CalEnviroScreen project to help assess the impact of pollution on California communities. 



## Formulating a question or problem <a id='subsection 1a'></a>

It is impotant to ask questions that will be informative and that will avoid misleading results. 

There are many different questions we could ask about the impact of pollution on California communities. For example, researchers use maps to draw relationships between land type, environmental pollution, and population demographics. Take a look at the important fields in the dataset that you will focus on below to aid in your question(s).

<div class="alert alert-info">
<b>Question:</b> Take some time to formulate questions you have and the data you would need to answer the questions.
</div> 

<b>Your questions:</b> *Insert answer here*

<b>Data you would need:</b> *Insert answer here*

## Acquiring and preparing data <a id='subsection 1b'></a>

The California Environmental Screening project, CalEnviroScreen, provides data on air, water, and pollution (e.g., ozone levels, airborne particulate matter, pesticides, traffic congestion). CalEnviroScreen, CES for short, has combined this information into a pollution  score. You can find raw data [here](https://oehha.ca.gov/calenviroscreen).

The data provided here includes environmental and population information across different census tracts in California. Census tracts from the U.S. Census Bureau (2010 census) represent small, relatively permanent subdivisons of a county, and are uniquely numbered in each county with a numeric code. Census tracts average about 4,000 people, but are 1,200 at minimum and 8,000 at maximum. 

In [None]:
ces_raw = Table.read_table("pollution_data/ces_data_v2.csv")

Here are some of the important fields in the dataset that you will focus on:

|Column Name   | Description |
|:---|:---|
|Total.Population | Census tract population |
|California.County | County name |
|Latitude | Measurement of location north or south of the equator | 
|Longitude | Measurement of location east or west of the north-south line |
|Hispanic....| Percent of population that is Hispanic |
|White.... | Precent of population that is non-Hispanic White |
|African.American.... | Precent of population that is African American |
|Ozone| Average daily maximum ozone concnetration (ppm) from May to October |
|PM2.5 | Average concentration of fine particulate matter (micro-gram per meter cubed) |
|Pollution.Burden | Pollution Burden scores for each census tract derived from exposures indicators (ozone and PM2.5 concentrations, diesel PM emissions, drinking water contaminants, pesticide use, toxic releases from facilities, and traffic density) and environmental effects indicators (cleanup sites, impaired water bodies, groundwater threats, hazardous waste facilities and generators, and solid waste sites and facilities) |
|Asthma | Emergency department visits for asthma per 10,000 people |
|Low.Birth.Weight | Percent of babies born weighing under 5.5 pounds |
|Poverty | Percent of the population living below two times the federal poverty level |
|Unemployment | Percent of the population over 16 that is unemployed and eligible for the labor force |

In [None]:
ces_raw.show(10)

Part of **cleaning data** includes **renaming columns**, **reducing the table size to include only the columns of interest**, and **removing missing values**.  

For our purposes, we will not be using the columns:
- ZIP
- CES.3.0.Score
- Diesel.PM
- Drinking.Water
- Pesticides
- Tox..Release
- Traffic
- Cleanup.Sites
- Groundwater.Threats
- Haz..Waste
- Imp..Water.Bodies
- Solid.Waste

Keep the column `Census.Tract` because it uniquely identifies the census tract.  Also keep `Pollution.Burden`, race columns, health indicators, and a few pollution indicators that you identified earlier.  

<div class="alert alert-info">
<b>Question:</b> Fill the array "cols_to_drop" with the names of the columns we seek to remove from our dataset.</div> 

In [None]:
cols_to_drop = make_array("...", "...", "...", "...", "...", 
                          "...", "...", "...", "...", "...", 
                          "...", "...")

ces_data = ces_raw.drop(cols_to_drop)
ces_data

Let's give some of the remaining columns simpler, more meaningful names.

In [None]:
old_names = make_array('Census.Tract', 'Total.Population', 'California.County', 'Hispanic....', "White....", 
            'African.American....', 'Native.American....', "Asian.American....", 'Other....', 'Pollution.Burden',
            'Low.Birth.Weight')
new_names = make_array('Tract', 'Population', 'County', 'Hispanic', "White", 'Black', 'Native', "Asian", 'Other', 'Pollution_Burden',
            'Low_Birthweight')

In [None]:
ces_data = ces_data.relabel(old_names, new_names)
ces_data

## Conducting exploratory data analysis <a id='subsection 1c'></a>

We are interested in studying the impact of pollution on different communities in California. With maps, we can get a spatial understanding of how levels of pollution vary across different geographical regions.


We will be using two different map types to give us insight: a **dot map** and a **size map**. 

### Dot map

Dot maps are a simple map with a dot at each (lat, long) pair from our data. 

The next cell creates a function called <b>dot_map</b> which we will use to create a dot map. 

In [None]:
def dot_map(tbl):
    """Create a map with dots to represent a unique location.
    
    Parameters:
        tbl (datascience.Table): The Table containing the data needed to plot our map. Note the table
        must have a "Latitude" and "Longitude" column for this function to work.
    Returns:
        (datascience.Map): A map with a dot at each unique (lat, long) pair.
    """
    reduced = tbl.select("Latitude", "Longitude")
    return Circle.map_table(reduced, area=10, fill_opacity=1)

<div class="alert alert-danger" role="alert">
    <b>Example:</b> Let's start with a dot map that displays all of our census tracts. To do so, we can pass in our table <code>ces_data</code> into <b>dot_map</b>.
</div>

In [None]:
dot_map(ces_data)

<b>Next, we would like to create dot maps for census tracts that fit certain criteria. For example, let's focus on a specific county, then visualize different important fields.</b>

<div class="alert alert-info"> 
<b>Question:</b> Make a simple dot map for only Los Angeles county census tracts using the table <code>ces_data</code>.
</div>

In [None]:
dot_la = ces_data.where("...", are.equal_to("..."))

In [None]:
dot_map(dot_la)

<div class="alert alert-info"> 
<b>Question:</b> Make a simple dot map where pollution burden is greater than 70 for census tracts using the table  <code>dot_la</code>.
</div>

In [None]:
dot_la_70 = dot_la.where("...", are.above(...))

In [None]:
dot_map(dot_la_70)

<div class="alert alert-info"> 
<b>Question:</b> Make a simple dot map where the of the population living below two times the federal poverty level is above 75% using the table <code>dot_la</code>.
</div>

In [None]:
dot_la_poverty = dot_la.where("...", are.above(...))

In [None]:
dot_map(dot_la_poverty)

<div class="alert alert-info"> 
<b>Question:</b> What inference can we draw from the pollution burden and poverty dot maps? What are some important considerations to this inference?
</div>

*Insert answer here*

### Size map

Size maps are detail-oriented maps, using color and size data to add more visual information to our map. 

The next cell creates a function called <b>size_map</b> which we will use to create a size map.

In [None]:
def size_map(tbl):
    """Plots a geographical map where each dot represents a coordinate pair, scaled by a given column.
    
    Parameters:
        tbl: The input Table containing the following arguments, in order:
            Col 0: latitude
            Col 1: longitude
            Col 2: type of location
            Col 3: color (MUST be labeled "colors")
            Col 4: area (MUST be labeled "areas")
    Returns:
        (datascience.Map): A map with a dot at each (lat, long),
                        colored according to Col 3,area as in Col 4.
    """
    return Circle.map_table(tbl, fill_opacity=0.7)

Compared to our function <b>dot_map</b>, this requires a table of a specific format for the table:

| Latitude | Longitude | type | colors | areas
|:---|:---|:---|:---|:---
|...|...|...|...|...

The next two cells create functions <b>get_colors_from_column</b> and <b>get_areas_from_column</b> which will help us create Col 3: colors and Col 4: areas! 

Don't worry about the code. We'll explain how to use them in the example.

In [None]:
#Use this function in order to get our Col 3: color for size_map
def get_colors_from_column(tbl, col, include_outliers=False):
    """Assigns each row of the input table to a color based on the value of its percentage column."""
    vmin = min(tbl.column(col))
    vmax = max(tbl.column(col))

    if include_outliers:
        outlier_min_bound = vmin
        outlier_max_bound = vmax
    else:
        q1 = np.percentile(tbl.column(col), 25)
        q3 = np.percentile(tbl.column(col), 75)
        IQR = q3 - q1
        outlier_min_bound = max(vmin, q1 - 1.5 * IQR)
        outlier_max_bound = min(vmax, q3 + 1.5 * IQR)
        
    colorbar_scale = list(np.linspace(outlier_min_bound, outlier_max_bound, 10))
    scale_colors = ['#006100', '#3c8000', '#6ba100', '#a3c400', '#dfeb00', '#ffea00', '#ffbb00', '#ff9100', '#ff6200', '#ff2200']
    
    def assign_color(colors, cutoffs, datapoint):
        """Assigns a color to the input percent based on the data's distribution."""
        for i, cutoff in enumerate(cutoffs):
            if cutoff >= datapoint:
                return colors[i - 1] if i > 0 else colors[0]
        return colors[-1]
    
    colors = [""] * tbl.num_rows
    for i, datapoint in enumerate(tbl.column(col)): 
        colors[i] = assign_color(scale_colors, colorbar_scale, datapoint)
        
    return colors

In [None]:
# Use this function in order to get our Col 4: size for size_map
def get_areas_from_column(tbl, label):
    """Gets the array values corresponding to the column label in the input table."""
    areas = tbl.column(label)
    areas[areas == 0] = np.nan
    return areas

<b>For size maps, let's continue with our data focusing on Los Angeles county using the</b> <code>size_la</code> <b>table.</b>

In [None]:
size_la = ces_data.where("County", are.equal_to("Los Angeles"))
size_la.show(10)

<div class="alert alert-danger"> 
<b>Example:</b> Let's start with creating a population size map. To do so, we will:
      <ol>
        <li>Pass in our table and column data we wish to work with as our arguments to the function <b>get_colors_from_column</b>. It will return an array with strings that represent colors in hexadecimal format. Larger values will result in green-yellow-orange-red shades in the map.</li>
        <li>Pass in our table and column data we wish to work with as our arguments to the function <b>get_areas_from_columns</b>. It will return an array just like .column does. Larger values will result in larger circles by area in the map.
        <li>Create a new table selecting "Latitude" and "Longitude", then adding in the columns "type", "colors", and "areas". 
    </ol>
</div>

In [None]:
# Step 1: Use function get_colors_from_column (arguments: size_la, "Population")
la_pop_colors = get_colors_from_column(size_la, "Population")

# Step 2: Use function get_areas_from_column (arguments: size_la, "Population")
la_pop_areas = get_areas_from_column(size_la, "Population") * 0.10 # Reduce area size by 90%

In [None]:
# Step 3: Create a new table
la_pop_tbl = size_la.select("Latitude", "Longitude").with_columns("type", "Population",
                                                                    "colors", la_pop_colors,
                                                                    "areas", la_pop_areas)

In [None]:
size_map(la_pop_tbl)

<i>Note: The area size for all dots are reduced by 90% for visual purposes.</i>

<div class="alert alert-info"> 
<b>Question:</b> Create a size map to map the <b>Pollution Burden</b> values for LA County census tracts.
    
Recall, <b>Pollution_Burden</b> represents the potential exposures to pollutants and the adverse environmental conditions caused by pollution in a given census tract.
</div>

In [None]:
pollution_colors = get_colors_from_column(size_la, "...") 
pollution_areas = get_areas_from_column(size_la, "...") * 3 # Increase area size by 200%

pollution_tbl = size_la.select("Latitude", "Longitude").with_columns("type", "...",
                                                                      "colors", pollution_colors,
                                                                      "areas", pollution_areas)

In [None]:
pollution_tbl.show(10)

In [None]:
size_map(pollution_tbl)

<i>Note: The area size for all dots are increased by 100% for visual purposes.</i>

<div class="alert alert-info"> 
    <b>Question:</b> Now, create a size map to map the <b>Poverty</b> values for LA County census tracts.

Recall, <b>Poverty</b> is the percent of the population living below two times the federal poverty level in a given census tract.
</div>

In [None]:
pov_colors = get_colors_from_column(..., "...")
pov_areas = get_areas_from_column(..., "...") * 3 # Increase area size by 200%

poverty_tbl = size_la.select("Latitude", "Longitude").with_columns("type", "...",
                                                                    "colors", pov_colors,
                                                                    "areas", pov_areas)

In [None]:
poverty_tbl.show(10)

In [None]:
size_map(poverty_tbl)

<i>Note: The area size for all dots are increased by 100% for visual purposes.</i>

## Using prediction and inference to draw conclusions <a id='subsection 1d'></a>

<div class="alert alert-info"> 
<b>Question:</b> After seeing these map visualizations, tell us something interesting about this data. What detail were you able to uncover?

</div>

*Insert answer here*

<div class="alert alert-info"> 
<b>Question:</b> Visualizing race and ethnicity distributions with different measures of pollution and demographics using maps provides one line of insight into the impacts of systemic racism and environmental injustice. For example, communities of color and the poor often expereince disproportionate exposure to pollution as a result of unequal protections through laws, regulations, and more. Let's take a into this for LA County using Census data.
<br>
    
Create a size map to map either "Hispanic", "White", "Black", "Asian", or "Other" for all LA census tracts. Feel free to explore all options.
<hr>
    
<b>Note:</b> For each census tract, the percentages of Hispanic, White, Black, Asian, and Other add up to 100%
</div>

In [None]:
final_colors = get_colors_from_column(..., "...")
final_areas = get_areas_from_column(..., "...") 

final_tbl = size_la.select("Latitude", "Longitude").with_columns("type", "...",
                                                                    "colors", final_colors,
                                                                    "areas", final_areas)

In [None]:
final_tbl

In [None]:
size_map(final_tbl)

<div class="alert alert-info"> 
<b>Question:</b> What connections are you able to make between the race and ethinicity maps with pollution burden and poverty maps? What are your thoughts?

</div>

*Insert answer here*

<div class="alert alert-success" role="alert">
  <h2 class="alert-heading">Well done!</h2>
    <p>In this report you used real-world data from CalEPA to draw maps that give you more insight on relationships between pollution and demographics.
    <hr>
    <p> Notebook created for Berkeley Unboxing Data Science 2021 
    <p> Adapted from Project: Pollution by Carlos Ortiz with the support of Ani Adhikari, Deb Nolan, and Will Furtado
</div>