# Notebook 6 - Correlations

In this notebook, we're going to explore correlations.  We'll start with some new variables in the CHIS data - most of the CHIS data are categorical, so it's more likely you'll need to use a t-test or a Chi-Square test.  But we'll demonstrate how correlations work using the CHIS.

Then, we'll look at the merged eviction and ACS data, to test whether neighborhoods with higher shares of renters with cost burdens have higher eviction rates.

As with everything in Python, there are lots of different ways to do the same thing, so we're providing some basic code so you have what you need for Assignment 4.  But you may find that when you work with your own data, you'll need to explore the web for other code.

## 1.0 Setup and Reading in our Libraries

As a reminder, setup should *always* be the first step in your notebook, and you need to load these cells first whenever you open the file before running any other cells.

In [None]:
#again, we are going to use the correlation function in a new library called pingouin, so I'm going to install that library
!pip install pingouin

In [None]:
!pip install researchpy

In [None]:
# Bring the libraries into the notebook
import pingouin as pg
import researchpy as rp
import numpy as np
import pandas as pd
import math
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt

pd.options.display.float_format = '{:.4f}'.format

In [None]:
#Show our plots in the Jupyter notebook
%matplotlib inline

In [None]:
import warnings
warnings.filterwarnings("ignore")

# Importing and Prepping our Data

## 2.1 Importing and Cleaning CHIS Data


In [None]:
#Read in our data 

chis_df = pd.read_csv('chis_extract_2022_weights.csv')

In [None]:
chis_df.rename(columns={"SRAGE_P1": "age", "AE_VEGI":"ate_veg", 'AE_FRUIT': "ate_fruit",
                        "SRSEX": "sex",
                        "OMBSRR_P1": "race_ethnicity",
                        "POVLL" : "pov_cat",
                       "AK22_P1" : "hh_inc",
                       "AM184": "housing_worry",
                       "CV7_1":"covid_lostjob"}, inplace=True)

In [None]:
#here I'm just going to keep my numeric variables 
chis_df=(chis_df[['age','ate_veg', 'ate_fruit']])

### Let's take a minute to look at the age codebook
<img src="AgeCodebook.png" width=800 height=400>

In [None]:
chis_df = chis_df[chis_df['ate_veg'] < 71] 

In [None]:
chis_df = chis_df[chis_df['ate_fruit'] < 71] 

In [None]:
#Because of the nature of the age data, and because I actually think the relationship between age and healthy eating
#isn't linear, I'm going to create a dummy for people between 18 and 29
chis_df['under30_dv']=np.where((chis_df['age']<30),1,0)
chis_df

## 2.2 Importing and Cleaning Eviction Lab data
First, we're going to read in our Eviction Lab data. It is publicly available here: https://data-downloads.evictionlab.org/#data-for-analysis/

In [None]:
# Here is my code for reading in the complete evictions data and filtering it
# If you are using an area outside Alameda County for your assignment, 
# you'll want to run the lines below and alter the file location and county filter.

# data =  pd.read_csv('C:/Users/katea/Downloads/tract_proprietary_valid_2000_2018.csv')
# ac_data = data[data['county']=='Alameda County']
# ac_data.to_csv('C:/Users/katea/Downloads/AC_tract_proprietary_valid_2000_2018.csv', index = False)

#today we're just going to work with the extract
evictions00_18 = pd.read_csv('AC_tract_proprietary_valid_2000_2018.csv')

### Eviction Lab Codebook
You can also find this information in the Excel sheet in DataHub, which I downloaded from the same link above.

|variable_name  |variable_type|description                                                               |
|---------------|-------------|--------------------------------------------------------------------------|
|fips           |numeric      |tract fips                                                                |
|cofips         |numeric      |county fips                                                               |
|tract          |string       |tract name                                                                |
|county         |string       |county name                                                               |
|state          |string       |state name                                                                |
|year           |numeric      |year                                                                      |
|type           |string       |OBSERVED                                                                  |
|filings        |numeric      |number of filings observed in proprietary data                            |
|filing_rate    |numeric      |number of filings per 100 renting households                              |
|threatened     |numeric      |number of households threatened with eviction observed in proprietary data|
|threatened_rate|numeric      |number of households threatened per 100 renting households                |
|judgements     |numeric      |number of judgements observed in proprietary data                         |
|judgement_rate |numeric      |number of judgements per 100 renting households                           |


### 2.2.1 Filtering for one year (2016)

In [None]:
# note: the year is an integer64 type, so don't put quotes around 2016!
evictions16 = evictions00_18[evictions00_18['year'] == 2016]

### 2.3 Importing and Cleaning ACS data

In [None]:
# Let's read in the ACS data, skipping the second header row 
# (remember, Python numbering starts at 0) when you read in the data
acs16 = pd.read_csv("ACSDT5Y2016.B25070-Data.csv", skiprows = [1])

In [None]:
acs16 = acs16[['GEO_ID', 'NAME', 'B25070_001E', 'B25070_002E', 'B25070_003E',
       'B25070_004E', 'B25070_005E', 'B25070_006E', 'B25070_007E',
       'B25070_008E', 'B25070_009E', 'B25070_010E', 'B25070_011E']]

In [None]:
acs16 = acs16.rename(columns = {'B25070_001E':'total', 
                        'B25070_002E':'rb_less10',
                        'B25070_003E':'rb_10_15',
                        'B25070_004E':'rb_15_20',
                        'B25070_005E':'rb_20_25',
                        'B25070_006E':'rb_25_30',
                        'B25070_007E':'rb_30_35',
                        'B25070_008E':'rb_35_40',
                        'B25070_009E':'rb_40_50',
                        'B25070_010E':'rb_50plus',
                        'B25070_011E':'na'})

In [None]:
# Create rent-burdened variable as percent of total
acs16['pct_rb30plus'] = (acs16['rb_30_35'] + acs16['rb_35_40'] + acs16['rb_40_50'] + acs16['rb_50plus'])/acs16['total']
acs16.describe()

### 2.4 Merging Datasets

In [None]:
# fixing up a fips column with the right number of characters using string subsetting
# note: ending the number 10 with a colon means we go from the 10th character to the END of the string
acs16['fips'] = acs16['GEO_ID'].str[10:]

In [None]:
#align fips datatypes
evictions16['fips'] = evictions16['fips'].astype(str)

In [None]:
# merge data
evict_df = acs16.merge(evictions16, on='fips', how='left', indicator=True) 

In [None]:
evict_df.columns

## 3 Correlation

The correlation coefficient (sometimes referred to as Pearson's correlation coefficient, Pearson's product-moment correlation, or simply r) measures the strength of the linear relationship between two variables. 

The correlation coefficient is directly linked to the beta coefficient in a linear regression (= the slope of a best-fit line), but has the advantage of being standardized between -1 to 1 ; the former meaning a perfect negative linear relationship, and the latter a perfect positive linear relationship. In other words, no matter what are the original units of the two variables are, the correlation coefficient will always be in the range of -1 to 1, which makes it very easy to work with.

The correlation coefficient *r*

> The correlation coefficient ranges from −1 to 1. A value of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as X increases. A value of −1 implies that all data points lie on a line for which Y decreases as X increases. A value of 0 implies that there is no linear relationship between the variables. 

<img src="py-corr-1.webp" width=800 height=400 />

In hypothesis testing, you want to find not only the correlation coefficient (the r value) but also the p-value.

In [None]:
#Let's start by plotting the two numeric variables
sns.regplot(chis_df["ate_fruit"], chis_df["ate_veg"], ci=None, scatter_kws={"color": "black"}, line_kws={"color": "red"})

In [None]:
#Let's run our correlation test
pg.corr(x=chis_df['ate_fruit'], y=chis_df["ate_veg"])

In [None]:
#how about for age?
pg.corr(x=chis_df['age'], y=chis_df["ate_veg"])

In [None]:
#Let's look at what we find if we use the dummy instead?  What's going on?
rp.ttest((chis_df[chis_df['under30_dv']==0].ate_veg), (chis_df[chis_df['under30_dv']==1].ate_veg))

### Let's see if our hypothesis that higher rent burdens lead to higher eviction filing rates results in a statistically significant finding 

In [None]:
evict_df.describe()

In [None]:
sns.regplot(evict_df["filing_rate"], evict_df["rb30plus"], ci=None, scatter_kws={"color": "black"}, line_kws={"color": "red"})

In [None]:
pg.corr(x=evict_df['filing_rate'], y=evict_df["pct_rb30plus"])

In [None]:
#You can actually get the correlations for all the variables in your dataset,
#known as a correlation matrix, at the same time
corr=evict_df.corr()
corr

In [None]:
#and create a correlation heatmap
# Set up the matplotlib plot configuration
f, ax = plt.subplots(figsize=(12, 10))

# Generate a mask for upper traingle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Configure a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap
sns.heatmap(corr, annot=True, mask = mask, cmap=cmap)

## 4.0  That's it!  

You now have all the tools you need to be able to complete Assignment 4!  It's time to practice and dedicate time to pulling everything from the semester together into your final case study!