<h1> Project: Housing Evictions and Fair Market Rents in New York City</h1> <a id=7></a>
<h3> Amelia Ingram, Josh Megnauth, Rameasa Arna, and Abby Stricklan</h3>


## Table of Contents 

<div class = "alert alert-info">

1. [Introduction](#1)<br>
2. [Research Question](#2)<br>
    2.1 [Assumptions](#2.1)<br>
3. [Data](#3)<br>
    3.1 [Univariate Analysis](#3.1)<br>
    3.2 [Mapping Analysis](#3.2)<br>
4. [Data Analysis](#4)<br>
    4.1 [Bivariate Analysis](#4.1)<br>
    4.2 [Multivariate Analysis](#4.2)<br>
5. [Summary Conclusions](#5)<br>
6. [References](#6)<br>
    
</div>
<hr>

## Introduction <a id=1></a>
In the wake of the COVID-19 pandemic, New Yorkers have feared a rise in housing evictions across the city as landlords attempt to recover their losses and forcibly evict tenants.  As housing courts have reopened and the eviction deferment period is long passed, we are now seeing a rise in evictions across the city.  According to a NY Times article, the housing chaos began this spring:<br> 
> "The roughly 2,000 eviction cases filed by landlords every week since March are roughly 40 percent more than the number filed in mid-January [2022], after the state’s eviction moratorium expired. Tenants have been thrown out of homes in more than 500 cases since February, according to city data, about double the number in all of the 20 months prior." (Zaveri, 2022)

In this project we will attempt to test the relationship of housing evictions to a variety of neighborhood demographics including changes in the fair market housing rates (FMR) through exploratory data analysis (EDA) using data selected from the <b>Eviction Lab</b>, <b>ACS</b> and the <b>Housing and Urban Development</b> (HUD) datasets. 

## Data <a id=3></a>

For this project, we are utilizing data from two datasets (evictionlab.org and HUD) that both include zip code geolocators, in order to both map and analyze the data.  In order to make this happen, we needed to utilize a "crosswalk" dataset from HUD.gov in order to connect geographic areas.  Additionally, we will use ACS data for demographic information within each area.

Our first task is to read in and inspect the <b>Eviction Lab</b> dataset.  In the eviction dataset, there are 8428 unique observations and 7 variables.  

In [1]:
import pandas as pd         #quick stats       
import numpy as np      #numerical functions
import matplotlib.pyplot as plt    #visualization library
import seaborn as sns    #visualization and stats

In [4]:
%matplotlib inline

### Initial Data Transformation
In order to make this datasets usable, we need to do some initial data transformation...

In [5]:
#load Evictions dataset
path = 'https://evictionlab.org/uploads/newyork_monthly_2020_2021.csv'

df = pd.read_csv(path, header=0)            # read eviction data from online


In [6]:
df.shape                                # returns (# of rows/obs, # of columns/variables)
df.head

<bound method NDFrame.head of           type   GEOID racial_majority    month  filings_2020  filings_avg  \
0     Zip Code   10001           White  01/2020            51    55.000000   
1     Zip Code   10001           White  02/2020            23    49.333333   
2     Zip Code   10001           White  03/2020            20    48.333333   
3     Zip Code   10001           White  04/2020             0    41.666667   
4     Zip Code   10001           White  05/2020             0    42.000000   
...        ...     ...             ...      ...           ...          ...   
8724  Zip Code  sealed             NaN  01/2022             1     3.333333   
8725  Zip Code  sealed             NaN  02/2022             1    19.666667   
8726  Zip Code  sealed             NaN  03/2022             3     3.666667   
8727  Zip Code  sealed             NaN  04/2022             3     3.000000   
8728  Zip Code  sealed             NaN  05/2022             0     3.000000   

     last_updated  
0      2022-0

There are two main issues we see above.  First, the `type` variable is (according to the data dictionary for the Eviction Tracking System), "Either Census Tract or Zip Code, depending on the site. Unfortunately, address-level data is unavailable for some sites (Austin, New York City, Pittsburgh, Richmond) - in these cases, we list aggregate data based on zip code, as it is the smallest geographic grouping available." Not a big deal, but this means that mapping of this data is more generalized for New York City than it might be for other municipalities.<br>

Secondly, we can observe `GEOID` cases are essentially zip codes and a number of `GEOID` cases that are listed as "sealed".  According to the data dictionary for the Eviction Tracking System, "A modest portion of filings are reported to us with missing, incorrect, or out-of-bounds addresses. In these cases, we do not assign a Census Tract or Zip code to the case." The sealed cases are still included in our overall counts for a given site, but we will not be able to map those cases, so they will be changed into null values (for mapping purposes).<br>

The `month` variable includes both month and year data, and the `last_updated` is just the date of the data pulled from the site.  I will hold onto the month and deselect the last_updated column.  Now let's check the null counts. The `racial_majority` variable has 29 null entries.  Hopefully this won't be a problem--it is relatively low.

In [8]:
#Convert 'GEOID' sealed entries into NaN
df['GEOID'] = df.GEOID.replace('sealed', np.nan)

df.isnull().sum()                           # returns the number of missing values for the df

type                0
GEOID              29
racial_majority    29
month               0
filings_2020        0
filings_avg         0
last_updated        0
dtype: int64

Next, I subset the variables from the Eviction dataset into a working dataframe labelled `df_evict`.  

In [12]:
#Working dataframe
#using loc to silence the SettingWithCopyError (source: Josh consultation)
df_evict = df.loc[:, ['GEOID',   # dv
          'filings_2020', 'filings_avg',    #iv
          'racial_majority', 'month',  # controls
        ]]

df_evict.info()    #confirm variables are saved into df_evict

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8729 entries, 0 to 8728
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   GEOID            8700 non-null   object 
 1   filings_2020     8729 non-null   int64  
 2   filings_avg      8729 non-null   float64
 3   racial_majority  8700 non-null   object 
 4   month            8729 non-null   object 
dtypes: float64(1), int64(1), object(3)
memory usage: 341.1+ KB


In [13]:
#convert the GEOID variable to int64 
#used this page to learn how to force the conversion on NAN rows  https://stackoverflow.com/questions/21287624/convert-pandas-column-containing-nans-to-dtype-int
df_evict['GEOID'] = np.floor(pd.to_numeric(df_evict['GEOID'],
                                           errors='coerce')).astype('Int64')
df_evict.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8729 entries, 0 to 8728
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   GEOID            8700 non-null   Int64  
 1   filings_2020     8729 non-null   int64  
 2   filings_avg      8729 non-null   float64
 3   racial_majority  8700 non-null   object 
 4   month            8729 non-null   object 
dtypes: Int64(1), float64(1), int64(1), object(2)
memory usage: 349.6+ KB


In [21]:
#filter out NaN rows on GEOID
filtered_df_evict = df_evict[df_evict['GEOID'].notnull()]
filtered_df_evict.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8700 entries, 0 to 8699
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   GEOID            8700 non-null   Int64  
 1   filings_2020     8700 non-null   int64  
 2   filings_avg      8700 non-null   float64
 3   racial_majority  8700 non-null   object 
 4   month            8700 non-null   object 
dtypes: Int64(1), float64(1), int64(1), object(2)
memory usage: 416.3+ KB


Next let's load the <b>FMR data</b> from HUD.  It includes the fair market rate of housing by zip from years 2020-2022.  The data was parsed by individual years, and thus needs to be merged before usage. This will give the option to create a time series analysis.

In [15]:
#load FMR data from local files...these will be merged and then exported as a single dataset for use
#path2= FMR 2022
#path2a= FMR 2021
#path2b= FMR 2020
path2 = '/Users/ameliaingram/Documents/My_GitHub+Repository/eviction-rent/data/fy2022_saFMR_revised.xlsx'

df2 = pd.read_excel(path2, sheet_name=0)          # read FMR 2022 data from xl file

In [16]:
df2.shape                                # returns (# of rows/obs, # of columns/variables)
df2.head

<bound method NDFrame.head of        ZIP\nCode     HUD Area Code  \
0          76437  METRO10180M10180   
1          76443  METRO10180M10180   
2          76464  METRO10180M10180   
3          76469  METRO10180M10180   
4          79501  METRO10180M10180   
...          ...               ...   
27317      23867  METRO51175N51175   
27318      23874  METRO51175N51175   
27319      23878  METRO51175N51175   
27320      23888  METRO51175N51175   
27321      23898  METRO51175N51175   

                    HUD Metro Fair Market Rent Area Name  SAFMR22\n2BR  \
0                                        Abilene, TX MSA           860   
1                                        Abilene, TX MSA           860   
2                                        Abilene, TX MSA           860   
3                                        Abilene, TX MSA           860   
4                                        Abilene, TX MSA           860   
...                                                  ...           ..

In [17]:
#load FMR 2021 data from local files...these will be merged and then exported as a single dataset for use
#path2a= FMR 2021
#path2b= FMR 2020
path2a = '/Users/ameliaingram/Documents/My_GitHub+Repository/eviction-rent/data/fy2021_saFMR_revised.xlsx'

df2a = pd.read_excel(path2a, sheet_name=0)          # read FMR 2021 data from xl file

df2a.shape                                # returns (# of rows/obs, # of columns/variables)
df2a.head

<bound method NDFrame.head of        ZIP\nCode     HUD Area Code HUD Metro Fair Market Rent Area Name  \
0          76437  METRO10180M10180                      Abilene, TX MSA   
1          76443  METRO10180M10180                      Abilene, TX MSA   
2          76464  METRO10180M10180                      Abilene, TX MSA   
3          76469  METRO10180M10180                      Abilene, TX MSA   
4          79501  METRO10180M10180                      Abilene, TX MSA   
...          ...               ...                                  ...   
27139      85356  METRO49740M49740                         Yuma, AZ MSA   
27140      85364  METRO49740M49740                         Yuma, AZ MSA   
27141      85365  METRO49740M49740                         Yuma, AZ MSA   
27142      85366  METRO49740M49740                         Yuma, AZ MSA   
27143      85367  METRO49740M49740                         Yuma, AZ MSA   

       2021 SAFMR\n2BR  SAFMR21\n2BR -\n90%\nPayment\nStandard  \
0  

In [38]:
#load FMR 2020 data from local files...these will be merged and then exported as a single dataset for use
#path2b= FMR 2020
path2b = '/Users/ameliaingram/Documents/My_GitHub+Repository/eviction-rent/data/fy2020_saFMR_rev.xlsx'

df2b = pd.read_excel(path2b, sheet_name=0)          # read FMR 2020 data from xl file

df2b.shape                                # returns (# of rows/obs, # of columns/variables)
df2b.head

<bound method NDFrame.head of         zcta         CBSASub20       Areaname20  SAFMR 2020 2br  \
0      76437  METRO10180M10180  Abilene, TX MSA             740   
1      76443  METRO10180M10180  Abilene, TX MSA             740   
2      76464  METRO10180M10180  Abilene, TX MSA             740   
3      76469  METRO10180M10180  Abilene, TX MSA             760   
4      79501  METRO10180M10180  Abilene, TX MSA             910   
...      ...               ...              ...             ...   
26085  85356  METRO49740M49740     Yuma, AZ MSA             780   
26086  85364  METRO49740M49740     Yuma, AZ MSA             860   
26087  85365  METRO49740M49740     Yuma, AZ MSA             940   
26088  85366  METRO49740M49740     Yuma, AZ MSA             820   
26089  85367  METRO49740M49740     Yuma, AZ MSA            1060   

       SAFMR20 2br 90pct pay_std  safmr20 2br 110pct pay_std  
0                            666                         814  
1                            666       

In [29]:
#merge df_evict into crosswalk dataset using GEOID and zip
df_evict_merge = filtered_df_evict.merge(df2, left_on = "GEOID", right_on = "ZIP\nCode")
#print to check results
print(df_evict_merge.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8961 entries, 0 to 8960
Data columns (total 11 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   GEOID                                 8961 non-null   Int64  
 1   filings_2020                          8961 non-null   int64  
 2   filings_avg                           8961 non-null   float64
 3   racial_majority                       8961 non-null   object 
 4   month                                 8961 non-null   object 
 5   ZIP
Code                              8961 non-null   int64  
 6   HUD Area Code                         8961 non-null   object 
 7   HUD Metro Fair Market Rent Area Name  8961 non-null   object 
 8   SAFMR22
2BR                           8961 non-null   int64  
 9   SAFMR22
2BR -
90%
Payment
Standard    8961 non-null   int64  
 10  SAFMR22
2BR -
110%
Payment
Standard   8961 non-null   int64  
dtypes: Int64(1), floa

In [33]:
#rename columns here to enable the second part of the merge
df_evict_merge = df_evict_merge.rename(columns={'ZIP\nCode':'ZIP', 'SAFMR22\n2BR':'SAFMR22 2BR','HUD Metro Fair Market Rent Area Name':'HUD Metro FMR Area Name', 'SAFMR22\n2BR -\n90%\nPayment\nStandard':'SAFMR22 2BR 90% Pmt', 'SAFMR22\n2BR -\n110%\nPayment\nStandard': 'SAFMR22 2BR 110% Pmt'})

df_evict_merge.columns

Index(['GEOID', 'filings_2020', 'filings_avg', 'racial_majority', 'month',
       'ZIP', 'HUD Area Code', 'HUD Metro FMR Area Name', 'SAFMR22 2BR',
       'SAFMR22 2BR 90% Pmt', 'SAFMR22 2BR 110% Pmt'],
      dtype='object')

In [40]:
#rename columns here to enable the second part of the merge
#rename columns from df2a
df2a = df2a.rename(columns={'ZIP\nCode':'ZIP', 'HUD Metro Fair Market Rent Area Name':'HUD Metro FMR Area Name', '2021 SAFMR\n2BR':'SAFMR21 2BR', 'SAFMR21\n2BR -\n90%\nPayment\nStandard':'SAFMR21 2BR 90% Pmt',
       'SAFMR21\n2BR -\n110%\nPayment\nStandard':'SAFMR21 2BR 110% Pmt'})

df2a.columns

Index(['ZIP', 'HUD Area Code', 'HUD Metro FMR Area Name', 'SAFMR21 2BR',
       'SAFMR21 2BR 90% Pmt', 'SAFMR21 2BR 110% Pmt'],
      dtype='object')

In [41]:
#rename columns here to enable the second part of the merge
#rename columns from df2b
df2b = df2b.rename(columns={'zcta':'ZIP', 'Areaname20':'HUD Metro FMR Area Name20', 'SAFMR 2020 2br':'SAFMR20 2BR',
       'SAFMR20 2br 90pct pay_std':'SAFMR20 2BR 90% Pmt', 'safmr20 2br 110pct pay_std':'SAFMR20 2BR 110% Pmt'})

df2b.columns

Index(['ZIP', 'CBSASub20', 'HUD Metro FMR Area Name20', 'SAFMR20 2BR',
       'SAFMR20 2BR 90% Pmt', 'SAFMR20 2BR 110% Pmt'],
      dtype='object')

In [42]:
#merge df_evict into crosswalk dataset using GEOID and zip
df_evict_merge2 = df_evict_merge.merge(df2a, on = "ZIP")
#print to check results
print(df_evict_merge2.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9715 entries, 0 to 9714
Data columns (total 16 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   GEOID                      9715 non-null   Int64  
 1   filings_2020               9715 non-null   int64  
 2   filings_avg                9715 non-null   float64
 3   racial_majority            9715 non-null   object 
 4   month                      9715 non-null   object 
 5   ZIP                        9715 non-null   int64  
 6   HUD Area Code_x            9715 non-null   object 
 7   HUD Metro FMR Area Name_x  9715 non-null   object 
 8   SAFMR22 2BR                9715 non-null   int64  
 9   SAFMR22 2BR 90% Pmt        9715 non-null   int64  
 10  SAFMR22 2BR 110% Pmt       9715 non-null   int64  
 11  HUD Area Code_y            9715 non-null   object 
 12  HUD Metro FMR Area Name_y  9715 non-null   object 
 13  SAFMR21 2BR                9715 non-null   int64

In [43]:
#merge df_evict into crosswalk dataset using GEOID and zip
df_evict_merge3 = df_evict_merge2.merge(df2b, on = "ZIP")
#print to check results
print(df_evict_merge3.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9744 entries, 0 to 9743
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   GEOID                      9744 non-null   Int64  
 1   filings_2020               9744 non-null   int64  
 2   filings_avg                9744 non-null   float64
 3   racial_majority            9744 non-null   object 
 4   month                      9744 non-null   object 
 5   ZIP                        9744 non-null   int64  
 6   HUD Area Code_x            9744 non-null   object 
 7   HUD Metro FMR Area Name_x  9744 non-null   object 
 8   SAFMR22 2BR                9744 non-null   int64  
 9   SAFMR22 2BR 90% Pmt        9744 non-null   int64  
 10  SAFMR22 2BR 110% Pmt       9744 non-null   int64  
 11  HUD Area Code_y            9744 non-null   object 
 12  HUD Metro FMR Area Name_y  9744 non-null   object 
 13  SAFMR21 2BR                9744 non-null   int64

In [44]:
#export newly merged dataset to csv
# (run once)
df_evict_merge3.to_csv(r'/Users/ameliaingram/Documents/My_GitHub+Repository/eviction-rent/data/Evict_FMR_merged.csv', index = False)

In order to combine a subset of the Eviction dataset with the FMR dataset, we must use a <b>crosswalk file </b> (courtesy HUD, originally from USPS) that will allow us to match zip codes to FIPS.   We must join the datasets using `fips2010` column as the key connector using pandas. <br>
Now, let's load the crosswalk dataset to see what's in common...

In [10]:
#load crosswalk dataset
path3 = '/Users/ameliaingram/Documents/My_GitHub+Repository/eviction-rent/data/ZIP_CBSA_Crosswalk_122021.xlsx'

df3 = pd.read_excel(path3, sheet_name=0)

In [11]:
#to figure out the key column to merge...looks like the zip into df_evict, tract to merge the FMR
df3.info()
df3.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47484 entries, 0 to 47483
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   zip                  47484 non-null  int64  
 1   cbsa                 47484 non-null  int64  
 2   usps_zip_pref_city   47484 non-null  object 
 3   usps_zip_pref_state  47484 non-null  object 
 4   res_ratio            47484 non-null  float64
 5   bus_ratio            47484 non-null  float64
 6   oth_ratio            47484 non-null  float64
 7   tot_ratio            47484 non-null  float64
dtypes: float64(4), int64(2), object(2)
memory usage: 2.9+ MB


Unnamed: 0,zip,cbsa,usps_zip_pref_city,usps_zip_pref_state,res_ratio,bus_ratio,oth_ratio,tot_ratio
0,683,41900,SAN GERMAN,PR,0.999842,1.0,1.0,0.999855
1,683,32420,SAN GERMAN,PR,0.000158,0.0,0.0,0.000145
2,923,41980,SAN JUAN,PR,1.0,1.0,1.0,1.0
3,1010,44140,BRIMFIELD,MA,0.976896,1.0,1.0,0.977816
4,1010,49340,BRIMFIELD,MA,0.023104,0.0,0.0,0.022184


In [16]:
#test to see if fips2010 from df2 is the same as tract from df3 (crosswalk file)
print(df2.fips2010==72023830102)

0       False
1       False
2       False
3       False
4       False
        ...  
4760    False
4761    False
4762    False
4763    False
4764    False
Name: fips2010, Length: 4765, dtype: bool


In [17]:
#check datatypes in FMR data
print(df2.info(verbose=bool))


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4765 entries, 0 to 4764
Data columns (total 130 columns):
 #    Column         Dtype  
---   ------         -----  
 0    fips2010       int64  
 1    fips2000       float64
 2    areaname22     object 
 3    name           object 
 4    msa22          object 
 5    fmr22_2        int64  
 6    fmr22          int64  
 7    msa21          object 
 8    fmr21_2        int64  
 9    fmr21          int64  
 10   msa20          object 
 11   fmr20_2        int64  
 12   fmr20          int64  
 13   msa19          object 
 14   fmr19_2        int64  
 15   fmr19          int64  
 16   msa18          object 
 17   fmr18_2        int64  
 18   fmr18          int64  
 19   msa17          object 
 20   fmr17_2        int64  
 21   fmr17          int64  
 22   msa16          object 
 23   fmr16_2        float64
 24   fmr16          float64
 25   msa15          object 
 26   fmr15_2        float64
 27   fmr15          float64
 28   msa14          o

In [18]:
#merge FMR (df2) into df_evict_crosswalk dataset using cbsa and fips2010 ***NOT WORKING :-P
df_evict_fmr_merged = df_evict_crosswalk.merge(df2, left_on = "cbsa", right_on = "fips2010")
#print to check results
print(df_evict_fmr_merged.head())

Empty DataFrame
Columns: [GEOID, filings_2020, filings_avg, racial_majority, month, zip, cbsa, usps_zip_pref_city, usps_zip_pref_state, res_ratio, bus_ratio, oth_ratio, tot_ratio, fips2010, fips2000, areaname22, name, msa22, fmr22_2, fmr22, msa21, fmr21_2, fmr21, msa20, fmr20_2, fmr20, msa19, fmr19_2, fmr19, msa18, fmr18_2, fmr18, msa17, fmr17_2, fmr17, msa16, fmr16_2, fmr16, msa15, fmr15_2, fmr15, msa14, fmr14_2, fmr14, msa13, fmr13_2, fmr13, msa12, fmr12_2, fmr12, msa11, fmr11_2, fmr11, msa10, fmr10_2, fmr10, msa09, fmr09_2, fmr09, msa08, fmr08_2, fmr08, msa07, fmr07_2, fmr07, msa06, fmr06_2, fmr06, msa05, fmr05_2, fmr05, msa04, fmr04_2, fmr04, msa03, fmr03_2, fmr03, msa02, fmr02_2, fmr02, msa01, fmr01_2, fmr01, msa00, fmr00_2, fmr00, msa99, fmr99_2, fmr99, msa98, fmr98_2, fmr98, msa97, fmr97_2, fmr97, msa96, fmr96_2, fmr96, msa95, fmr95_2, ...]
Index: []

[0 rows x 143 columns]


^Working on code above ^
=============
In the `df_evict` subset there are a bunch of object variables and only two numeric variables (filings_2020 and filings_avg).  We may need to investigate the month variable further to ensure it encodes in the correct date type format.

In [None]:
#initial peek under the hood of the df_evict set
df_evict.head()

### Combine datasets

In [None]:
#load the crosswalk file
from pandas import read_excel

path3 = pd.ExcelFile('/Users/ameliaingram/Documents/My_GitHub+Repository/eviction-rent/ZIP_TRACT_122021.xlsx')

my_sheet = 'Sheet1' #sheet name is at the bottom left of the excel file
df3 = pd.ExcelFile.parse(path3)    # read FMR data from xlsx file

print(df3.head())       

### Check for missing values
Before I do any initial analysis, I need to check for missing values from the dataset. In the code below, I found that the control `racial_majority` had 29 missing values, and the rest are fine.  In the context of this large dataset these are acceptable missing amounts to continue to use everything.  

Our backup control variables `month` and `GEOID` (zip code) have no missing values.  These are good.  

In [None]:
df_evict.isnull().sum()                           # returns the number of missing values for the df

I will first rename `race_majority` to `race` in order to ease analysis.  Then, I will also add `counties` to the dataset by assigning zip codes to counties and then applying those as a function to the `GEOID` info for readability in the analysis and plots.

In [None]:
#rename of race column here
df_evict = df_evict.rename(columns={'race_majority':'Race'})

df_evict.columns

Now the variables are ready to perform an initial analysis to present their survey results.  

## Research Questions <a id=2></a>

For this project, we wish to investigate the following question:

> <b>Q1:</b>  Is there a correlation between eviction rates and fair housing rents in New York City? <br>
> <b>Q2:</b>  Is there a correlation between eviction rates and any other demographic features of New York City neighborhoods that might indicate a further association to gentrification? <br>


The variables from the evictionlab.org dataset which will be used to answer these questions are:
> <b>DV:</b>  eviction rates<br>
> <b>IV:</b>  average fair market rent (FMR) for a 2 bedroom<br>

In order to better understand the relationship between these issues, we will experiment with several demographic control variables, including:
> - race
> - zip code
> - county
> - length of time


## Assumptions <a id=2.1></a>
Recent evidence produced from several housing research centers predict that there will be a strong increase in housing evictions in the New York City metropolitan area, despite the best efforts of both city and state agencies to prevent a massive rise in homelessness.  We assume that there could be some evidence of a relationship between incident areas where eviction rates are high and increases in fair market rent (FMR) that could potentially displace lower income communities and lead to further gentrification (or else be signs of future gentrification). In this case, we assume that a rise in fair market rent (FMR) will indicate displacement of low income populations from previously affordable subsidized housing. 

According to the U.S. Housing and Urban Development (HUD) Office of Policy Development and Research (see Summary), the FMR was implemented in 1974 to help low-income households find affordable housing. This is generally known as the Section 8 voucher system.  According to their Summary page, the FMR is defined as "the 40th percentile of gross rents for typical, non-substandard rental units occupied by recent movers in a local housing market." (Further information and problematics in the calculation of this rate are discussed on HUD's Summary page).

It is assumed that race also plays a crucial role in both eviction rates and rental rates.  Government and local experts have historically suggested this as a trend.  Recently, a report was published by the U.S. Commission on Civil Rights which outlines the recent issues of racial discrimination in evictions (2022).

Finally, for the purposes of this research we assume that the housing market in New York City is fixed (this is a reasonable assumption given the pandemic slowdown of new building construction).
  

## Univariate Analysis of Main Variables <a id=3.1></a>

### Variable:  Evictions (filings_2020)
`filings_2020` is the independent variable in this study.  The Eviction Lab data reports both filings_2020 which is a cumulative number since 2020 and filings_avg which is the average per month.  We are exploring both versions in this project.]

In [None]:
df_evict.filings_2020.describe()                          

According to the preliminary descriptive statistics, evictions were on average 17.075 per zipcode, with a minimum of zero and a maximum of 550.  The interquartile range varied from 0 to 15 for the middle 50% of zipcodes.  

In [None]:
with plt.style.context('bmh'):      #temporary use of style sheet--source Matplotlib reference
    df_evict.filings_2020.plot(kind='line')
plt.title('Evictions per County (Eviction Lab, 2020-2022)')
plt.xlabel('Counties')
plt.ylabel('# Evictions')

After viewing the histogram, it is apparent that the vast majority of evictions are happening in the midrange of all zipcodes, however there are gaps.  This leads to a heavily left-skewed plot.

In order to refine the evictions into a recognizable pattern, I will divide into five categorical levels of evictions (0, 1-9, 10-29, 30-59, 60-99, and >100). This will give a more detailed attention to the extreme ranges of evictions, in order to isolate these groups from the lower rates.

In [None]:
def evict_b(y):                                 
    '''
    INPUT: 
    y: int, from -1 to 550, the inputs of the int variable `filings_2020`
    
    OUTPUT:
    0 recoded to '<1'
    1-9 recoded to '1-9'
    10-29 recoded to '10-29'
    30-59 recoded to '30-59'
    60-99 recoded to '60-99'
    >100 recoded to '>100'
    '''
    if y == 0:
        return '0'
    if y >0 and y<10:
        return '1-9'
    elif y >= 10 and y<30:
        return '10-29'
    elif y >= 30 and y<60:
        return '30-59'
    elif y>=60 and y<100:
        return '60-99'
    elif y>=100:
        return '>100'
    else:
        return np.nan                        # missing is coded as nan 

# apply the function to `filings_2020`

df_evict['filings_cat'] = df_evict.filings_2020.apply(evict_b)

In [None]:
# double check whether the transformation is successful:

df_evict[['filings_cat']]

Now that we have groups `filings_2020` into groups, let's see the resulting distribution. 

In [None]:
with plt.style.context('fast'):
    df_evict.groupby('filings_cat').size().plot(kind='bar')   #bar graph in order
plt.title('Eviction Rates by Groups (Eviction Lab 2020-2022)')
plt.xlabel('Eviction by Groups')
plt.ylabel('# Evictions')

## 4. Data Analysis <a id=4></a>


## 4.1 Bivariate Analysis<a id=4.1></a>

After inpecting each variable, I performed some simple bivariate or multivariate distributions of the numerical variables (filings_2020) over different constants (race, month, political party).

- ### Eviction Rates and Race <br>

In [None]:
df_evict.groupby('racial_majority')['filings_2020'].agg(['mean', 'median', 'max', 'min'])        # avg filings_2020 groupby race        

In order to better understand the distribution, it is also useful to visualize evictions and race in a boxplot.

In [None]:
#boxplot of evictions IQR and mean grouped by race
#Changing the outlier markers
red_circle = dict(markerfacecolor='red', marker='o')
df3.boxplot(column='filings_2020', by='racial_majority', vert=False, showmeans=True, flierprops=red_circle)    # by: x axis, column: y axis
#plt.xscale('log')    #playing around with log on boxplot to see if it helps...it reduces outliers but makes it impossible to read TV hours
plt.title('Eviction Filings by Race (Eviction Lab, 2020-2022)')
plt.xlabel('Evictions')
plt.ylabel('Race')
plt.suptitle('')

In [None]:
pd.crosstab(values=df_evict['id'],    #prop table/contingency table visualization
            index=df3['racial_majority'],
            columns=df3['filings_2020'],
            aggfunc='count',
            normalize='index').plot(kind='barh', 
                                   figsize=(8, 6), alpha=1,
                                   stacked=False)
plt.title('Evictions % by Race (Eviction Lab, 2020-2022)')
plt.xlabel('% Evictions')
plt.ylabel('Race')
plt.suptitle('')

- ### Evictions and Month
Evictions are assumed to be a year-round activity.  Let us see how the dates confirm or deny this assumption.


In [None]:
pd.crosstab(values=df_evict['id'],    #prop table/contingency table
            index=df_evict['filings_2020'],
            columns=df_evict['month'],
            aggfunc='count',
            normalize='index')    # takes True, 'columns' (each col 100%), 'index' (each row 100%)

In [None]:
pd.crosstab(values=df_evict['id'],    #prop table/contingency table visualization
            index=df3['filings_2020'],
            columns=df3['month'],
            aggfunc='count',
            normalize='index').plot(kind='barh', 
                                   figsize=(8, 6), alpha=0.7,
                                   stacked=True, color=(['Red', 'Blue']))
plt.title('Evictions % by Month (Eviction Lab, 2020-2022)')
plt.xlabel('%')
plt.ylabel('Month')
plt.suptitle('')

- ### Evictions and Political Party

The relationship between evictions and political party in American society is a contentious one.  With the support of the Republican party, police unions and the  military, a link is usually made between poverty and liberal politics.  Let us see how the data reflected these viewpoints.

In [None]:
pd.crosstab(values=df_evict['id'],    #prop table/contingency table
            index=df_evict['filings_2020'],
            columns=df_evict['party'],
            aggfunc='count',
            normalize='index')    # takes True, 'columns' (each col 100%), 'index' (each row 100%)

 ## 4.2 Multivariate Analysis <a id=4.2></a>

For the final portion of exploratory analysis, I will explore the strength of relationships between multiple variables.  

- <b> Evictions + County + Race</b><br>
First, I will look at the relationship of eviction filings grouped by county and race.  

In [None]:
import scipy.stats as stats               # a statistical analysis library

In [None]:
df_evict.groupby(['county', 'race'])['filings_2020'].agg(['mean', 'median', 'max', 'min'])         # avg tvhours groupby race and owngun

## 6. References <a id=6></a>

### Programming References:
Matplotlib Style Sheets Reference.  https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html

Legend in Matplotlib.  https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.legend.html

Stats t-test in Scipy.  https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html


### Datasets:
Crosswalk Dataset of Zip to Tract. U.S. Department of Housing and Urban Development. Office of Policy Development and Research.  https://www.huduser.gov/portal/datasets/usps_crosswalk.html#data

"Fair Market Rents: 40th Percentile." U.S. Department of Housing and Urban Development. Office of Policy Development and Research. Datasets.  https://www.huduser.gov/portal/datasets/fmr.html#2022_data

Peter Hepburn, Renee Louis, and Matthew Desmond. Eviction Tracking System: Version 1.0. Princeton: Princeton University, 2020. www.evictionlab.org.

### General Reference
"Summary: Fair Market Rents." U.S. Department of Housing and Urban Development. Office of Policy Development and Research. Blog.  https://www.huduser.gov/periodicals/ushmc/winter98/summary-2.html 

U.S. Commission on Civil Rights. Racial Discrimination and Eviction Policies and Enforcement in New York. 10 Mar 2022.  https://www.usccr.gov/reports/2022/racial-discrimination-and-eviction-policies-and-enforcement-new-york

Zaveri, Mihir.  After a Two-Year Dip, Evictions Accelerate in New York. The New York Times. 2 May 2022. https://www.nytimes.com/2022/05/02/nyregion/new-york-evictions-cases.html


<div class = "alert alert-info">

[Back to top](#7)<br>
    
</div>
<hr>