## 1 Correlation

This notebook will give you an opportunity to calculate correlation coefficients and develop an understanding of how to interpret the coefficient.  We're going to look at the Enviroscreen data, which includes data at the census tract level on a range of enviornmental hazards, health outcomes (e.g., asthma), and socioeconomic and demographic variables such as race and poverty.  I've provided an extract of the complete dataset below.

In [None]:
#Call our libraries

import numpy as np
import pandas as pd
from scipy import stats
import math
from datascience import *

pd.options.display.float_format = '{:.4f}'.format

In [None]:
#Read in our data, forcing the FIPS code to come in as a string
dtype_dic= {'FIPS':str}
ej_df=pd.read_csv('Enviroscreen_Extract.csv', delimiter = ',', dtype=dtype_dic)

In [None]:
#Take a look at the dataset
ej_df.head()

### 1.1 Let's Plot the Data First

Correlations are easy to visualize - we just set one variable as our x axis and one variable as our y axis.  We are going to call in the matplotlib library and specify the style we want to use.

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import scipy 
%matplotlib inline

In [None]:
plt.scatter(ej_df["Poverty"], ej_df["EnviroScore"])
plt.xlabel("Poverty")
plt.ylabel("EnviroScore")

It does look like there is a relationship, but it's also a lot of dots!  Let's create a new dataframe that only includes the census tracts for Alameda County.

In [None]:
ej_df.drop(ej_df[ej_df['County']!="Alameda"].index, inplace = True)  
ej_df

What happened?  Why didn't it keep the Alameda rows?

Unfortunately, sometimes when you import data from a previously formatted dataset like Enviroscreen, the .csv preserves blank spaces before or after a string variable.  Python requires the match to be exact, so it doesn't recognize "  Alameda  " as "Alameda".  Run the data import statement again above to re-import the raw data, and the run the code below and try again.

In [None]:
# This code removes all the spaces from the county variable
ej_df['County'] = ej_df['County'].str.strip()

#Now, I can select my Alameda rows
ej_df.drop(ej_df[ej_df['County']!="Alameda"].index, inplace = True)  
ej_df

In [None]:
#Make a scatterplot for the data for Alameda county here

Eyeballing patterns is not the same as a statisical measure of a correlation; you must quantify it with numbers and statistics to prove your thoughts. Looking at the tables is not a very statistical measure of how much a variable correlates to the results. What does it mean for a variable "income" to match 7 out of the top 15 social disorder points? Does this correlate to the rest of the results? How well does it correlate? 

### 1.2 The correlation coefficient - *r*

> The correlation coefficient ranges from −1 to 1. A value of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as X increases. A value of −1 implies that all data points lie on a line for which Y decreases as X increases. A value of 0 implies that there is no linear correlation between the variables. ~Wikipedia

*r* = 1: the scatter diagram is a perfect straight line sloping upwards

*r* = -1: the scatter diagram is a perfect straight line sloping downwards.

Let's calculate the correlation coefficient between poverty and the Enviroscreen score.

In [None]:
ej_df['EnviroScore'].corr(ej_df["Poverty"], method='pearson')

In [None]:
ej_df.corr()

In [None]:
#Let's fit our line - for this, Python doesn't allow NANs, so we're going to delete census tracts with missing data.

drop_na = ej_df.dropna()  # if not all census tracts have measure
x = drop_na["Poverty"]
y = drop_na["EnviroScore"]

plt.scatter(x, y)
plt.xlabel("Poverty", fontsize=18)
plt.ylabel("EnviroScore", fontsize=18)
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)), color="r") #calculate line of best fit
plt.show()

### 1.3 Getting P Values for Correlation

It turns out that the pandas correlation function doesn't produce p-values automatically - you have to hard code them.  Instead, I found another cool library - pingouin - that has a super helpful correlation function.  Just like with researchpy, we need to install the library.

Note that we learned that you can't put a # code in the same cell as pip install.

In [None]:
pip install pingouin

In [None]:
import pingouin as pg

In [None]:
pg.corr(x=ej_df['EnviroScore'], y=ej_df["Poverty"])

In [None]:
#If we want to get the p values for the whole matrix, the code is:
ej_df.rcorr(stars=False)

In [None]:
#What happens when you take out the stars=False code?