# Air Quality and Social Justice

In this notebook we explore the relationship between air quality (pm25 and ozone) and income and elevation. This will allow us to explore questions related to social justice, such as is air pollution exposure equally distributed across different socio-economic groups.

#### Tell our plotting routines to draw the graphics in our web browser

In [None]:
%matplotlib inline

### Import packages that we are going to use
#### These include

* [Pandas](http://pandas.pydata.org/index.html): A package for reading and manipulating tabular data.
* [Seaborn](http://seaborn.pydata.org/index.html): Seaborn is a plotting package that provides "nice" pre-defined plots for a variety of common graph types.
* [statsmodels](http://www.statsmodels.org/stable/): This package provides a number of common statistical functions for analysing data.


In [None]:
import pandas as pd
import seaborn as sns
import statsmodels
import statsmodels.formula.api as smf
import patsy
import os
import matplotlib.pyplot as plt

#### Tell our program where our data are located

In [None]:
DATADIR = "/home/jovyan/DATA/AirQuality"

In [None]:
os.listdir(DATADIR)

### What are these data

In our data directory we have two **csv** files. csv stands for **C**omma **S**eparated **V**alues. For different regions in the Salt Lake valley (zip code), we have computed the average

1. Income
1. Elevation
1. pm25 levels for March 8, 2016
1. ozone levels at 10:00 am and 3:00 PM from August 2016.

These files have been created by Dr. Daniel L. Mendoza at the University of Utah.

## Use Pandas to read in the data

Pandas reads the data into a Pandas **dataframe** which we assing to the variable ``pm25``. Dataframes have two methods for looking at the data ``head()`` and ``tail()``

In [None]:
pm25 = pd.read_csv(os.path.join(DATADIR, "Class_PM25_Data.csv"))
pm25.head(), pm25.tail()

#### Notice the ``NaN`` value for Income in our last data row

Pandas uses ``NaN`` (Not a number) to represent missing values. That is our data file did not have an income value for the last row. There are a variety of approaches for dealing with missing data, but we are just going to drop that value using the Pandas dataframe ``dropna`` method.



In [None]:
pm25 = pm25.dropna()
pm25.tail()

### Now Let's Repeat this with the Ozone Data

In [None]:
ozone = pd.read_csv(os.path.join(DATADIR, "Class_Ozone_Data.csv"))
ozone.tail()

In [None]:
ozone = ozone.dropna()

## Let's plot some of our data

In our introduction to Python we plotted using Pandas directly. When we explored the relationship of wind and air pollution, we used [matplotlib]() to plot. In this notebook we are going to explore using [Seaborn](http://seaborn.pydata.org/index.html) to plot data.

Our first plot is going to be to look at a [histogram](https://en.wikipedia.org/wiki/Histogram) of our particle measurements.

Seaborn has several options for plotting histograms. We are going to use the ``distplot()`` function. 

* What happens when you change ```kde``` from False to True?

In [None]:
fig1, ax1 = plt.subplots(1)
sns.distplot(pm25["PM25_MAR_8"], ax=ax1, kde=True)
ax1.set_xlabel("PM25")
ax1.set_ylabel("Proportion")
ax1.set_title("Histogram of PM25 for March 8, 2016")


``jointplot`` has a number of options that determine what kind of joint plot to generate. The default is "scatter" but you can use any of the following:

* "scatter"
* "reg"
* "resid"
* "kde"
* "hex"

You can also change which color you want to plot.

In [None]:
sns.jointplot(x="Income", y="PM25_MAR_8", data=pm25, kind="reg",
             color='purple');

### We can see that there is a reasonable linear relationship between polution and income
#### Let's quantify this

We will use ``statsmodels`` to do [ordinary least squares regression](https://en.wikipedia.org/wiki/Ordinary_least_squares).

```Python
mod = smf.ols(formula='PM25_MAR_8 ~ Income', data=pm25)
```

* Use Patsy to specify what our regressions is. This formula is stating a linear relationship between ``PM25_MAR_8`` to ``Income``.

```Python
'PM25_MAR_8 ~ Income'
```
```Python
data=pm25
```

* Use the ``pm25`` dataframe
* ``mod.fit()`` fits the model to the data
* ```print(res.summary())``` provides a detailed report on how well the model fit the data.

In [None]:
mod = smf.ols(formula='PM25_MAR_8 ~ Income', data=pm25)
res = mod.fit()
print(res.summary())

### This is a lot of information. What are the key points?

#### Let's start by looking at our overall model.

* **Prob (F-statistic):**           8.48e-10 ($8.48e^{-10}$). 
    * This is the probability that the linear relationship between our variables is purely due to chance.
* **R-squared:**                       0.458
    * This is the proportion of the variability in our data that is explained by our model.
* **Cond. No.**                     1.22e+05 ($1.22e^{05}$)
    * A large condition number indicates numeric problems with our model/data and mean the results are less reliable.

#### Now let's look at our ``Income`` variable

* **coef**=-3.74e-05 ($-3.74e^{-05}$). This is the slope of the line. The slope is negative, meaning as income **increases** air pollution **decreases.**
* **P**=0.000. This is the "p-value" and describes the probability that the linear relationship is just random chance.


## Let's repeat this for the relationship between ``Income`` and ``Elevation``

In [None]:
sns.jointplot(y="Income", x="Elevation", data=pm25, kind="reg",
             color='purple');

### The relationship between ``Income`` and ``Elevation`` is strong
### Now let us use ``Elevation`` to predict ``PM25_MAR_8``

In [None]:
mod = smf.ols(formula='PM25_MAR_8 ~ Elevation', data=pm25)
res = mod.fit()
print(res.summary())

### Can we build a model that uses *both* ``Elevation`` and ``Income`` to

In [None]:
mod = smf.ols(formula='PM25_MAR_8 ~ Elevation + Income + Elevation:Income', data=pm25)
res = mod.fit()
print(res.summary())

### What happens when we include both predictors?

# Now let's look at Ozone

![Ozone](https://upload.wikimedia.org/wikipedia/commons/thumb/2/28/Ozone_cycle.svg/1280px-Ozone_cycle.svg.png)

## We have ozone measurements at two times: 10:00 AM and 3:00 PM (15:00)

#### How are these measurements different? (use ``distplot()`` to explore the histograms)
#### What do you think might account for these differences?

In [None]:
fig2, ax2 = plt.subplots(1)
sns.distplot(ozone["Ozone_AUG_10"], ax=ax2, label="10:00")
sns.distplot(ozone["Ozone_AUG_15"], ax=ax2, color='r', label="15:00")
ax2.set_xlabel("Ozone Density")
ax2.legend()

## Ozone vs pm25

#### Using the same plotting and statistical functions we used above explore the relationship between ozone and income/elevation
#### How does the relationship differ from pm25?

In [None]:
sns.jointplot(x="Income", y="Ozone_AUG_15", data=ozone, kind="reg",
             color='purple');

In [None]:
mod = smf.ols(formula='Ozone_AUG_15 ~ Elevation', data=ozone)
res = mod.fit()
print(res.summary())

In [None]:
mod = smf.ols(formula='Ozone_AUG_15 ~ Income', data=ozone)
res = mod.fit()
print(res.summary())