<h1> Project: Housing Evictions and Fair Market Rents in New York City</h1> <a id=7></a>
<h3> Memo Draft</h3>


## Table of Contents 

<div class = "alert alert-info">

1. [Introduction](#1)<br>
2. [Research Question](#2)<br>
    2.1 [Assumptions](#2.1)<br>
3. [Data](#3)<br>
    3.1 [Univariate Analysis](#3.1)<br>
    3.2 [Mapping Analysis](#3.2)<br>
4. [Data Analysis](#4)<br>
    4.1 [Bivariate Analysis](#4.1)<br>
    4.2 [Multivariate Analysis](#4.2)<br>
5. [Summary Conclusions](#5)<br>
6. [References](#6)<br>
    
</div>
<hr>

## Introduction <a id=1></a>
In the wake of the COVID-19 pandemic, New Yorkers have feared a rise in housing evictions across the city as landlords attempt to recover their losses and forcibly evict tenants.  As housing courts have reopened and the eviction deferment period is long passed, we are now seeing a rise in evictions across the city.  According to a NY Times article, the housing chaos began this spring:<br> 
> "The roughly 2,000 eviction cases filed by landlords every week since March are roughly 40 percent more than the number filed in mid-January [2022], after the state’s eviction moratorium expired. Tenants have been thrown out of homes in more than 500 cases since February, according to city data, about double the number in all of the 20 months prior." (Zaveri, 2022)

In this project we will attempt to test the relationship of housing evictions to a variety of neighborhood demographics including changes in the fair market housing rates (FMR) through exploratory data analysis (EDA) using data selected from the <b>Eviction Lab</b>, <b>ACS</b> and the <b>Housing and Urban Development</b> (HUD) datasets. 

## Data <a id=3></a>

For this project, we are utilizing data from two datasets (evictionlab.org and HUD) that both include zip code geolocators, in order to both map and analyze the data.  In order to make this happen, we needed to utilize a "crosswalk" dataset from HUD.gov in order to connect geographic areas.  Additionally, we will use ACS data for demographic information within each area.

Our first task is to read in and inspect the merged <b>Eviction Lab</b> and <b>FMR</b> dataset.  In the eviction dataset, there are 8428 unique observations and 7 variables, and in the merged dataset, there are 9744 observations across 21 columns.  

In [1]:
import pandas as pd         #quick stats       
import numpy as np      #numerical functions
import matplotlib.pyplot as plt    #visualization library
import pyarrow as pa
import pyarrow.parquet as pq
import seaborn as sns    #visualization and stats

In [2]:
%matplotlib inline

In [3]:
#load merged dataset--parquet
path = '/Users/ameliaingram/Documents/My_GitHub+Repository/eviction-rent/assets/data/evict_merged.parquet'

df = pd.read_parquet(path)            # read eviction data from online

In [4]:
df.info()                               # returns # rows/obs, columns/variables and types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8729 entries, 0 to 8728
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   city             8729 non-null   category      
 1   type             8729 non-null   category      
 2   geoid            8700 non-null   float64       
 3   racial_majority  8700 non-null   category      
 4   month            8729 non-null   datetime64[ns]
 5   filings_2020     8729 non-null   int64         
 6   filings_avg      8729 non-null   float64       
 7   fmr_2br          2861 non-null   float64       
 8   fmr_2br_90       2861 non-null   float64       
 9   fmr_2br_110      2861 non-null   float64       
 10  borough          5133 non-null   category      
 11  post_office      5133 non-null   category      
 12  neighborhood     5133 non-null   category      
 13  population       5133 non-null   float64       
 14  density          5133 non-null   float64

Based on the listed info, there are 8729 unique observations in this dataset, with only 8700 including `GEOID`(zip code) data.  

There are two main issues we see above. First, we can observe `GEOID` cases are essentially zip codes and a number of `GEOID` cases that are listed as "sealed".  According to the data dictionary from the Eviction Lab website, "A modest portion of filings are reported to us with missing, incorrect, or out-of-bounds addresses. In these cases, we do not assign a Census Tract or Zip code to the case." There were 29 sealed cases excluded for the purposes of merging and mapping.  <b>NOTE:</b>  If we wish to use the sealed sets, we can load the `f_evict` dataset.<br>

Secondly, the `HUD Metro Area Name` actually changed for certain zip codes from year to year.  We kept those columns in the data for reference.  Not sure if it's important yet, but they're there.

The `month` variable includes both month and year data.  Will need to confirm they use Python's `datetime64` format.  The 'sealed' values from the `GEOID` column now has 896 null entries.  Hopefully this won't be a problem--it is relatively low.

### Check for missing values
Before any initial analysis, we need to check for missing values from each dataset. In the merged dataset, there are no missing values.  However in the Evictions dataset, the controls `GEOID` and `racial_majority` had 896 missing values, and the rest are fine.  In the context of this large dataset these are acceptable missing amounts to continue to use everything. 

In [5]:
df.isnull().sum()        # confirm the number of NaN values for the df

city                  0
type                  0
geoid                29
racial_majority      29
month                 0
filings_2020          0
filings_avg           0
fmr_2br            5868
fmr_2br_90         5868
fmr_2br_110        5868
borough            3596
post_office        3596
neighborhood       3596
population         3596
density            3596
dtype: int64

In [20]:
#Convert 'GEOID' sealed entries into NaN
df2['GEOID'] = df2.GEOID.replace('sealed', np.nan)

df2.isnull().sum()        # returns the number of NaN values for the df

city                 0
type                 0
GEOID              896
racial_majority    896
month                0
filings_2020         0
filings_avg          0
last_updated         0
dtype: int64

I will first rename `race_majority` to `race` in order to ease analysis.  Then, I will also add `counties` to the dataset by assigning zip codes to counties and then applying those as a function to the `GEOID` info for readability in the analysis and plots.

In [21]:
#rename of race column in Eviction dataset here
df2 = df2.rename(columns={'race_majority':'Race'})

df2.columns

Index(['city', 'type', 'GEOID', 'racial_majority', 'month', 'filings_2020',
       'filings_avg', 'last_updated'],
      dtype='object')

Now the variables are ready to perform an initial exploratory analysis.  

## Research Questions <a id=2></a>

For this project, we wish to investigate the following question:

> <b>Q1:</b>  Is there a correlation between eviction rates and any other demographic features of New York City neighborhoods that might indicate a further association to gentrification? <br>
> <b>Q2:</b>  Is there a correlation between eviction rates and fair housing rents in New York City? <br>


The variables from the evictionlab.org dataset which will be used to answer these questions are:
> <b>DV:</b>  eviction rates<br>
> <b>DV2:</b>  average fair market rent (FMR) for a 2 bedroom<br>

Eviction rates are defined as the number of evictions registered by the housing court within a given zip code.  These may not include forcible evictions that happen in unofficial living arrangements (such as "cash only" housing in illegal sublets, frequented by poor and undocumented populations). (*It is important to note that forced evictions are illegal in New York City, regardless of the type of housing arrangement. (For more information, see the Tenants Bill of Rights under the NYC Comptroller's Office.))

In order to better understand the relationship between these issues, we will experiment with several demographic control variables, including:
> <b>IV:</b> race <br>
> <b>IV:</b> zip code and/or neighborhood <br>
> <b>IV:</b> county <br>
> <b>IV:</b> length of time <br>

Others may be included as the project unfolds.

## Assumptions <a id=2.1></a>
Recent evidence produced from several housing research centers predict that there will be a strong increase in housing evictions in the New York City metropolitan area, despite the best efforts of both city and state agencies to prevent a massive rise in homelessness.  We assume that there could be some evidence of a relationship between incident areas where eviction rates are high and increases in fair market rent (FMR) that could potentially displace lower income communities and lead to further gentrification (or else be signs of future gentrification). In this case, we assume that a rise in fair market rent (FMR) will indicate displacement of low income populations from previously affordable subsidized housing. 

According to the U.S. Housing and Urban Development (HUD) Office of Policy Development and Research (see Summary), the FMR was implemented in 1974 to help low-income households find affordable housing. This is generally known as the Section 8 voucher system.  According to their Summary page, the FMR is defined as "the 40th percentile of gross rents for typical, non-substandard rental units occupied by recent movers in a local housing market." (Further information and problematics in the calculation of this rate are discussed on HUD's Summary page).

It is assumed that race also plays a crucial role in both eviction rates and rental rates.  Government and local experts have historically suggested this as a trend.  Recently, a report was published by the U.S. Commission on Civil Rights which outlines the recent issues of racial discrimination in evictions (2022).

Finally, for the purposes of this research we assume that the housing market in New York City is fixed (this is a reasonable assumption given the pandemic slowdown of new building construction).
  

## Descriptive Analysis <a id=3.1></a>

### 1. Variable:  Evictions (filings_2020)
`filings_2020` is the independent variable in this study.  The Eviction Lab data reports both filings_2020 which is a cumulative number since 2020 and filings_avg which is the average per month.  We are exploring both versions in this project.]

In [None]:
df2.filings_2020.describe()                          

According to the preliminary descriptive statistics, evictions were on average 17.075 per zipcode, with a minimum of zero and a maximum of 550.  The interquartile range varied from 0 to 15 for the middle 50% of zipcodes.  

In [None]:
with plt.style.context('bmh'):      #temporary use of style sheet--source Matplotlib reference
    df2.filings_2020.plot(kind='line')
plt.title('Evictions per County (Eviction Lab, 2020-2022)')
plt.xlabel('Counties')
plt.ylabel('# Evictions')

After viewing the histogram, it is apparent that the vast majority of evictions are happening in the midrange of all zipcodes, however there are gaps.  This leads to a heavily left-skewed plot.

In order to refine the evictions into a recognizable pattern, I will divide into five categorical levels of evictions (0, 1-9, 10-29, 30-59, 60-99, and >100). This will give a more detailed attention to the extreme ranges of evictions, in order to isolate these groups from the lower rates.

In [None]:
def evict_b(y):                                 
    '''
    INPUT: 
    y: int, from -1 to 550, the inputs of the int variable `filings_2020`
    
    OUTPUT:
    0 recoded to '<1'
    1-9 recoded to '1-9'
    10-29 recoded to '10-29'
    30-59 recoded to '30-59'
    60-99 recoded to '60-99'
    >100 recoded to '>100'
    '''
    if y == 0:
        return '0'
    if y >0 and y<10:
        return '1-9'
    elif y >= 10 and y<30:
        return '10-29'
    elif y >= 30 and y<60:
        return '30-59'
    elif y>=60 and y<100:
        return '60-99'
    elif y>=100:
        return '>100'
    else:
        return np.nan                        # missing is coded as nan 

# apply the function to `filings_2020`

df2['filings_cat'] = df2.filings_2020.apply(evict_b)

In [None]:
# double check whether the transformation is successful:

df2[['filings_cat']]

Now that we have groups `filings_2020` into groups, let's see the resulting distribution. 

In [None]:
with plt.style.context('fast'):
    df2.groupby('filings_cat').size().plot(kind='bar')   #bar graph in order
plt.title('Eviction Rates by Groups (Eviction Lab 2020-2022)')
plt.xlabel('Eviction by Groups')
plt.ylabel('# Evictions')

## 4. Data Analysis <a id=4></a>


## 4.1 Bivariate Analysis<a id=4.1></a>

After inpecting each variable, I performed some simple bivariate or multivariate distributions of the numerical variables (filings_2020) over different constants (race, month, political party).

- ### Eviction Rates and Race <br>

In [None]:
df2.groupby('racial_majority')['filings_2020'].agg(['mean', 'median', 'max', 'min'])        # avg filings_2020 groupby race        

In order to better understand the distribution, it is also useful to visualize evictions and race in a boxplot.

In [None]:
#boxplot of evictions IQR and mean grouped by race
#Changing the outlier markers
red_circle = dict(markerfacecolor='red', marker='o')
df2.boxplot(column='filings_2020', by='racial_majority', vert=False, showmeans=True, flierprops=red_circle)    # by: x axis, column: y axis
#plt.xscale('log')    #playing around with log on boxplot to see if it helps...it reduces outliers but makes it impossible to read TV hours
plt.title('Eviction Filings by Race (Eviction Lab, 2020-2022)')
plt.xlabel('Evictions')
plt.ylabel('Race')
plt.suptitle('')

In [None]:
pd.crosstab(values=df2['id'],    #prop table/contingency table visualization
            index=df2['racial_majority'],
            columns=df2['filings_2020'],
            aggfunc='count',
            normalize='index').plot(kind='barh', 
                                   figsize=(8, 6), alpha=1,
                                   stacked=False)
plt.title('Evictions % by Race (Eviction Lab, 2020-2022)')
plt.xlabel('% Evictions')
plt.ylabel('Race')
plt.suptitle('')

- ### Evictions and Month
Evictions are assumed to be a year-round activity.  Let us see how the dates confirm or deny this assumption.


In [None]:
pd.crosstab(values=df2['id'],    #prop table/contingency table
            index=df2['filings_2020'],
            columns=df2['month'],
            aggfunc='count',
            normalize='index')    # takes True, 'columns' (each col 100%), 'index' (each row 100%)

In [None]:
pd.crosstab(values=df2['id'],    #prop table/contingency table visualization
            index=df2['filings_2020'],
            columns=df2['month'],
            aggfunc='count',
            normalize='index').plot(kind='barh', 
                                   figsize=(8, 6), alpha=0.7,
                                   stacked=True, color=(['Red', 'Blue']))
plt.title('Evictions % by Month (Eviction Lab, 2020-2022)')
plt.xlabel('%')
plt.ylabel('Month')
plt.suptitle('')

- ### Evictions and Political Party

The relationship between evictions and political party in American society is a contentious one.  With the support of the Republican party, police unions and the  military, a link is usually made between poverty and liberal politics.  Let us see how the data reflected these viewpoints.

In [None]:
pd.crosstab(values=df2['id'],    #prop table/contingency table
            index=df2['filings_2020'],
            columns=df2['party'],
            aggfunc='count',
            normalize='index')    # takes True, 'columns' (each col 100%), 'index' (each row 100%)

 ## 4.2 Multivariate Analysis <a id=4.2></a>

For the final portion of exploratory analysis, I will explore the strength of relationships between multiple variables.  

- <b> Evictions + County + Race</b><br>
First, I will look at the relationship of eviction filings grouped by county and race.  

In [None]:
import scipy.stats as stats               # a statistical analysis library

In [None]:
df.groupby(['county', 'race'])['filings_2020'].agg(['mean', 'median', 'max', 'min'])         # avg tvhours groupby race and owngun

## 6. References <a id=6></a>

### Programming References:
Matplotlib Style Sheets Reference.  https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html

Legend in Matplotlib.  https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.legend.html

Stats t-test in Scipy.  https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html


### Datasets:
Crosswalk Dataset of Zip to Tract. U.S. Department of Housing and Urban Development. Office of Policy Development and Research.  https://www.huduser.gov/portal/datasets/usps_crosswalk.html#data

"Fair Market Rents: 40th Percentile." U.S. Department of Housing and Urban Development. Office of Policy Development and Research. Datasets.  https://www.huduser.gov/portal/datasets/fmr.html#2022_data

Peter Hepburn, Renee Louis, and Matthew Desmond. Eviction Tracking System: Version 1.0. Princeton: Princeton University, 2020. www.evictionlab.org.

### General Reference
"Summary: Fair Market Rents." U.S. Department of Housing and Urban Development. Office of Policy Development and Research. Blog.  https://www.huduser.gov/periodicals/ushmc/winter98/summary-2.html 

U.S. Commission on Civil Rights. Racial Discrimination and Eviction Policies and Enforcement in New York. 10 Mar 2022.  https://www.usccr.gov/reports/2022/racial-discrimination-and-eviction-policies-and-enforcement-new-york

Zaveri, Mihir.  After a Two-Year Dip, Evictions Accelerate in New York. The New York Times. 2 May 2022. https://www.nytimes.com/2022/05/02/nyregion/new-york-evictions-cases.html


<div class = "alert alert-info">

[Back to top](#7)<br>
    
</div>
<hr>