## Title: 


Contributors: Nicole Bills and Allison Lee

### Table of Contents
1. <a href ='#goal'>Problem Statement</a>
2. <a href='#datasources'>Data Sources</a>
3. <a href='#collection'>Data Collection</a>
4. <a href='#testing'>Hypothesis Test I</a>
5. <a href='#test2'>Hypothesis Test II</a>

In [143]:
# Import libraries
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

<a id='goal'></a>
### Problem Statement

The goal of this analysis is to gain a better understanding of eviction rates in Ward 8, Washington D.C. We aim to test two hypotheses: 
1. There is a significant difference between eviction rates in census tracts with Planned Unit Developments (PUDs) and those without. 
2. Census tracts where the poverty rate is above 40 percent have higher eviction rates than those where the poverty rate is below 40 percent. 

### Approach

<a id='datasources'></a>
### Data Sources

The Eviction Lab: https://evictionlab.org/

This research uses data from The Eviction Lab at Princeton University, a project directed by Matthew Desmond and designed by Ashley Gromis, Lavar Edmonds, James Hendrickson, Katie Krywokulski, Lillian Leung, and Adam Porton. The Eviction Lab is funded by the JPB, Gates, and Ford Foundations as well as the Chan Zuckerberg Initiative. More information is found at evictionlab.org.


Open Data DC: https://opendata.dc.gov/
 - Planned Unit Developments: https://opendata.dc.gov/datasets/1b3e77aaa6154d1285af639323b0504f_14/data

<a id='collection'></a>
### Data Cleaning

<a id='testing'></a>
### Hypothesis Test I

We selected a significance level of alpha = 0.05. 

We set our null and alternative hypotheses as follows:

**H0:** There is no difference between eviction rates in census tracts with PUDs compared to tracts without PUDs. 

**Ha:** On average, there is a statistically significant difference between eviction rates.

To test these claims, we compared the means of two samples: one sample of tracts in DC with PUDs, and one sample of tracts without PUDs. We assume the samples are independent and from normally distributed populations, and we use a two-tailed t-test to test if the difference in the means can be attributed to random chance. We selected Welsch's t-test because our sample sizes are small (less than thirty), and we do not assume equal population variances. 

In [114]:
# TO DO ->  prove assumptions have been met

In [144]:
df = pd.read_csv('../../data/ward8.csv')

In [145]:
# Create a binary variable 1 if PUDs exist, 0 if none. 
df['PUD'] = [0 if x==True else 1 for x in df['PUD_NAME'].isna()]

In [152]:
# Find the means of the two samples:
with_PUDS = (df[df['PUD'] == 0])

no_PUDS = df[df['PUD']==1]

In [153]:
h1_result = stats.ttest_ind(with_PUDS['eviction-rate'], no_PUDS['eviction-rate'], equal_var = False)

In [154]:
h1_result

Ttest_indResult(statistic=0.31466609718334176, pvalue=0.7592466663700481)

In [None]:
# Interpretation

<a id='test2'></a>
### Hypothesis Test II

We selected a significance level of alpha = 0.05.

We set our null and alternative hypotheses as follows:

**H0:** Census tracts with poverty rates higher than 40 percent have the same or lower rates of eviction. 

**Ha:** Census tracts with poverty rates higher than 40 percent have higher rates of eviction. 

To test our alternative hypothesis, we decided to use a one-tailed t-test. This test was appropriate because we are dealing with small sample sizes, and unknown population variances (where we assume the variances are not equal). 

In [115]:
# TO DO - prove assumptions have been met

In [155]:
poverty = df[df['poverty-rate'] >= 40.0]
low = df[df['poverty-rate']< 40.0]

In [158]:
h2_test = stats.ttest_ind(poverty['eviction-rate'], low['eviction-rate'], equal_var = False)

In [159]:
h2_test

Ttest_indResult(statistic=-2.5966254090573613, pvalue=0.015633962244444684)

In [162]:
output_pvalue = h2_test[1]

The scipy stats test returns a p-value for a two-sided ttest. To evaluate the one-tailed p-value against our alpha, we need to divide the output p-value by two. 

In [164]:
p_value = output_pvalue/2
p_value

0.007816981122222342