## Data Profiling and Bias Analysis

This notebook uses the [Pandas Profiling library](https://github.com/pandas-profiling/pandas-profiling) to create documentation summarizing the columns in the data. The output provides insight on the number of null values, the distribution of the column (or frequencies if categorical data). This allows us to quickly understand our data and possibly highlight any outliers or unexpected values.

The second part of this notebook explores our missing values to confirm whether there is a risk of bias based on missing data for particular Boroughs.

We import the required libraries

In [1]:
import pandas as pd
import numpy as np

In [2]:
from pandas_profiling import ProfileReport

To be able to view all column names, we're increasing the number of columns to be displayed on our screen.

In [3]:
pd.set_option('display.max_columns', None)

### Reading in Processed Data

We read in the processed NYC data.

In [4]:
df_nyc = pd.read_csv("processed/nyc_processed_2017_to_2019_20210225.csv")
df_nyc.head()

Unnamed: 0,census_tract_GEOID,total-households,total-renter-occupied-households,total-owner-occupied-households,total-owner-occupied-households-mortgage,median-gross-rent,median-household-income,median-property-value,median-monthly-housing-cost,pct-white,pct-af-am,pct-hispanic,pct-am-indian,pct-asian,pct-nh-pi,pct-multiple,pct-other,pct-below-poverty-level,households-children,single-parent-household,older-adult-alone,level-of-education,immigrant-status,english-fluency,drive-to-work,public-transport-to-work,vacant-properties,live-in-mobile-home,pct-renter-occupied,pct-owner-occupied,pct-owner-occupied-mortgage,pct-owner-occupied-without-mortgage,median-house-age,pct-non-white,pct-without-health-insurance,total-evictions,avg-evictions,total-evictions-2017,eviction-filings-2017,eviction-rate-2017,total-evictions-2018,eviction-filings-2018,eviction-rate-2018,total-evictions-2019,eviction-filings-2019,eviction-rate-2019,overall-city-eviction-rate,avg-eviction-rate,ratio-to-mean-eviction-rate,county_GEOID,county,state
0,36085013204,1790,422,1368,798,1411,84866,493400,2396,95.4,0.0,9.6,0.0,1.3,0.0,0.6,2.7,-888888888,382,27,260,3443,244,1172,1307,672,115,0,23.575419,76.424581,44.581006,31.843575,55,4.6,2.817209,9.0,3.0,2.0,3.555556,0.5625,6.0,10.666667,0.5625,1.0,1.777778,0.5625,0.844547,0.7109,0.841753,36085,Staten Island,New York
1,36085013800,2369,441,1928,1030,1185,82361,561100,2404,98.8,0.5,6.2,0.0,0.0,0.0,0.7,0.0,-888888888,585,67,313,4553,117,1388,1581,681,92,0,18.61545,81.38455,43.478261,37.90629,55,1.2,0.773281,2.0,2.0,0.0,0.0,0.0,2.0,6.0,0.333333,0.0,0.0,0.0,0.844547,0.453515,0.536992,36085,Staten Island,New York
2,36085014700,1341,241,1100,707,1220,84310,463800,2406,94.9,2.2,4.5,0.0,3.0,0.0,0.0,0.0,-888888888,366,30,155,2492,59,442,1023,435,163,0,17.971663,82.028337,52.721849,29.306488,63,5.1,4.076246,4.0,2.0,3.0,5.0,0.6,0.0,0.0,0.0,1.0,1.666667,0.6,0.844547,0.829876,0.982628,36085,Staten Island,New York
3,36085019700,712,99,613,366,1384,98167,449800,2297,95.4,1.0,8.3,0.0,1.3,0.0,0.0,2.4,-888888888,219,18,74,1283,49,158,622,162,18,0,13.904494,86.095506,51.404494,34.691011,82,4.6,3.532009,1.0,1.0,1.0,1.5,0.666667,0.0,0.0,0.0,0.0,0.0,0.0,0.844547,1.010101,1.196027,36085,Staten Island,New York
4,36085020804,1988,179,1809,1270,1175,95417,602200,2820,92.6,0.0,3.4,0.0,4.5,0.0,1.0,2.0,-888888888,535,29,147,4142,79,1027,1754,556,67,8,9.004024,90.995976,63.8833,27.112676,42,7.4,2.940674,6.0,2.0,2.0,4.4,0.454545,3.0,6.6,0.454545,1.0,2.2,0.454545,0.844547,1.117318,1.322979,36085,Staten Island,New York


As well as the Hillsborough data.

In [5]:
df_hills = pd.read_csv("processed/hillsborough_fl_processed_2017_to_2019_20210225.csv")
df_hills.head()

Unnamed: 0,census_tract_GEOID,total-households,total-renter-occupied-households,total-owner-occupied-households,total-owner-occupied-households-mortgage,median-gross-rent,median-household-income,median-property-value,median-monthly-housing-cost,pct-white,pct-af-am,pct-hispanic,pct-am-indian,pct-asian,pct-nh-pi,pct-multiple,pct-other,pct-below-poverty-level,households-children,single-parent-household,older-adult-alone,level-of-education,immigrant-status,english-fluency,drive-to-work,public-transport-to-work,vacant-properties,live-in-mobile-home,pct-renter-occupied,pct-owner-occupied,pct-owner-occupied-mortgage,pct-owner-occupied-without-mortgage,median-house-age,pct-non-white,pct-without-health-insurance,total-evictions,avg-evictions,total-foreclosure-sales,avg-foreclosure-sales,total-lien-foreclosures,avg-lien-foreclosures,total-evictions-2017,eviction-filings-2017,eviction-rate-2017,total-evictions-2018,eviction-filings-2018,eviction-rate-2018,total-evictions-2019,eviction-filings-2019,eviction-rate-2019,foreclosure-sales-2017,foreclosure-sales-2018,foreclosure-sales-2019,lien-foreclosures-2017,lien-foreclosures-2018,lien-foreclosures-2019,avg-foreclosure-rate,foreclosure-rate-2017,foreclosure-rate-2018,foreclosure-rate-2019,avg-lien-foreclosure-rate,lien-foreclosure-rate-2017,lien-foreclosure-rate-2018,lien-foreclosure-rate-2019,avg-eviction-rate,ratio-to-mean-foreclosure-rate,ratio-to-mean-eviction-rate,avg-housing-loss-rate,evictions-pct-total-housing-loss,housing-loss-index,county_GEOID,county,state
0,12057010103,1454,283,1171,672,831,63611,153500,1226,92.1,1.6,11.8,0.0,2.0,0.0,0.0,4.3,-888888888,383,57,92,3033,104,523,1711,0,95,801,19.463549,80.536451,46.217331,34.31912,31,7.9,11.810261,17.181818,5.727273,,,,,5.090909,8.0,0.636364,5.090909,8.0,0.636364,7.0,11.0,0.636364,,,,,,,,,,,,,,,2.023771,,0.855968,,,,12057,Hillsborough County,Florida
1,12057011006,1861,244,1617,1072,1349,66815,194200,1423,93.0,1.4,8.9,0.0,1.0,0.0,3.4,1.2,-888888888,569,60,222,3598,65,622,2227,7,173,197,13.111231,86.888769,57.603439,29.28533,37,7.0,9.018718,18.36,6.12,27.0,9.0,6.0,2.0,7.02,13.0,0.54,7.02,13.0,0.54,4.32,8.0,0.54,10.0,7.0,10.0,1.0,3.0,2.0,0.839552,0.932836,0.652985,0.932836,0.123686,0.061843,0.185529,0.123686,2.508197,1.30756,1.060859,1.148936,0.404762,0.746079,12057,Hillsborough County,Florida
2,12057011108,681,41,640,262,497,49821,146000,1379,91.0,1.9,16.4,0.0,3.4,0.0,0.0,3.7,-888888888,79,14,205,1208,143,371,570,0,76,357,6.020558,93.979442,38.472834,55.506608,44,9.0,14.095372,4.8,1.6,3.0,1.5,,,1.2,3.0,0.4,1.6,4.0,0.4,2.0,5.0,0.4,0.0,2.0,1.0,,,,0.572519,0.0,0.763359,0.381679,,,,,3.902439,0.891669,1.650563,1.023102,0.516129,0.664367,12057,Hillsborough County,Florida
3,12057011203,1403,552,851,578,967,72716,235400,1484,81.3,10.3,21.7,0.0,0.6,2.9,1.3,3.6,-888888888,392,102,76,2683,267,808,1579,0,230,0,39.344262,60.655738,41.197434,19.458304,44,18.7,10.713294,25.253968,8.417989,3.0,1.5,8.0,4.0,6.460317,11.0,0.587302,5.873016,10.0,0.587302,12.920635,22.0,0.587302,2.0,0.0,1.0,5.0,3.0,0.0,0.259516,0.346021,0.0,0.17301,0.470035,0.587544,0.352526,0.0,1.524998,0.404182,0.645008,0.877698,0.84876,0.569947,12057,Hillsborough County,Florida
4,12057011206,1263,676,587,343,750,33329,132500,1168,78.9,15.5,16.2,0.0,1.6,0.0,1.7,2.3,-888888888,205,70,190,2057,282,744,823,90,148,361,53.523357,46.476643,27.157561,19.319082,46,21.1,21.570122,50.076923,16.692308,6.0,2.0,3.0,1.0,14.307692,30.0,0.476923,14.307692,30.0,0.476923,21.461538,45.0,0.476923,3.0,2.0,1.0,1.0,1.0,1.0,0.58309,0.874636,0.58309,0.291545,0.170358,0.170358,0.170358,0.170358,2.469276,0.908134,1.044397,1.834378,0.893004,1.191181,12057,Hillsborough County,Florida


### Creating Profiles

We create the profiles by passing our data and specifying the title of the document. Pandas profiling can provide the following:

- Type inference: detect the types of columns in a dataframe.
- Essentials: type, unique values, missing values
- Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
- Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
- Most frequent values
- Histogram
- Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
- Missing values matrix, count, heatmap and dendrogram of missing values
- Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.
- File and Image analysis extract file sizes, creation dates and dimensions and scan for truncated images or those containing EXIF information.

We are disabling expensive computations (such as correlations and dynamic binning) by passing `minimal=True`.

In [6]:
profile = ProfileReport(df_nyc, title='NYC Processed Data', minimal=True)

We are saving our profile as a HTML file.

In [None]:
profile.to_file("NYC Processed Data.html")

We repeat for Hillsborough.

In [8]:
profile = ProfileReport(df_hills, title='Hillsborough Processed Data', minimal=True)

In [None]:
profile.to_file("Hillsborough Processed Data.html")

### Futher Exploration

From the profiles, we can see that there are a few rows with null values and many placeholder values such as '666666666', '-666666666' and '-999999999'.

Focusing on NYC, we can quickly check if any rows include null values and how these are distributed across the boroughs. Using insight from [census.gov](https://www.census.gov/quickfacts/fact/table/newyorkcitynewyork,bronxcountybronxboroughnewyork,kingscountybrooklynboroughnewyork,newyorkcountymanhattanboroughnewyork,queenscountyqueensboroughnewyork,richmondcountystatenislandboroughnewyork/HSD410219#HSD410219) we can the proportion of housing units are as follow:
- Bronx: 16% (503,829)
- Brooklyn: 30% (958,567)
- Manhattan: 24% (759,460)
- Queens: 25% (778,932)
- Staten Island: 5% (166,246)

In [11]:
df_nyc[df_nyc.isna().any(axis=1)].groupby(['county']).count()[['census_tract_GEOID']]

Unnamed: 0_level_0,census_tract_GEOID
county,Unnamed: 1_level_1
Bronx,9
Brooklyn,27
Manhattan,14
Queens,47
Staten Island,4


Queens has a disproportionate number of null values (47% compared to its household percentage of 25%).

Similar, we can look at some of these placeholder values.

In [12]:
df_nyc[df_nyc['median-house-age'] > 666666666].groupby(['county']).count()[['census_tract_GEOID']]

Unnamed: 0_level_0,census_tract_GEOID
county,Unnamed: 1_level_1
Bronx,6
Brooklyn,13
Manhattan,9
Queens,25
Staten Island,3


In [13]:
df_nyc[df_nyc['pct-non-white'] > 666666666].groupby(['county']).count()[['census_tract_GEOID']]

Unnamed: 0_level_0,census_tract_GEOID
county,Unnamed: 1_level_1
Bronx,5
Brooklyn,11
Manhattan,8
Queens,21
Staten Island,2


Queens has a disproportionate number of placeholder values for both `median-house-age` and `pct-non-white` (45% compared to its household percentage of 25%).

There are lots of scenarios where there are negative numbers (the place holder values).

In [14]:
cols = list(df_nyc.columns)[:-2]

In [15]:
df_nyc[(df_nyc[cols] < 0).any(axis=1)].groupby(['county']).count()[['census_tract_GEOID']]

Unnamed: 0_level_0,census_tract_GEOID
county,Unnamed: 1_level_1
Bronx,339
Brooklyn,761
Manhattan,288
Queens,669
Staten Island,110


Brooklyn and Queens both have slightly elevated proportions for negative number placeholders (35% and 31% compared to their household percentages of 30% and 25% respectively). Meanwhile, Manhattan is substantially lower (13% compared to its household percentage of 24%).

### Raw NYC Evictions Data

We read in the raw NYC evictions data and view the first few rows.

In [9]:
df_raw_nyc = pd.read_csv("raw/nyc_evictions_geocoded.csv")
df_raw_nyc.head()

Unnamed: 0,COURT_INDEX_NUMBER,DOCKET_NUMBER,EVICTION_ADDRESS,EVICTION_APT_NUM,EXECUTED_DATE,MARSHAL_FIRST_NAME,MARSHAL_LAST_NAME,RESIDENTIAL_COMMERCIAL_IND,BOROUGH,EVICTION_ZIP,Address CLEANED,STATE,input_address,match_indicator,match_type,matched_address,lon_lat,tiger_line_id,side,state_code,county_code,tract_code,block_code
0,11371/16,33621,454 EAST 105TH ST,05C,12/12/18,Bruce,Kemp,Residential,MANHATTAN,10029,454 EAST 105TH ST,NY,"454 EAST 105TH ST, MANHATTAN, NY, 10029",Match,Non_Exact,"454 E 105TH ST, NEW YORK, NY, 10029","-73.93865,40.788143",640857964.0,R,36.0,61.0,16200.0,4002.0
1,N77151/18,111600,601 WEST 189TH ST,4G,4/3/19,Darlene,Barone,Residential,MANHATTAN,10040,601 WEST 189TH ST,NY,"601 WEST 189TH ST, MANHATTAN, NY, 10040",Match,Non_Exact,"601 W 189TH ST, NEW YORK, NY, 10040","-73.930176,40.85436",59659038.0,R,36.0,61.0,27900.0,3000.0
2,68097/19,18347,2607 AVENUE O,5-B,8/26/19,George,"Essock, Jr.",Residential,BROOKLYN,11210,2607 AVENUE O,NY,"2607 AVENUE O, BROOKLYN, NY, 11210",Match,Exact,"2607 AVE O, BROOKLYN, NY, 11210","-73.94823,40.61415",59090174.0,L,36.0,47.0,75600.0,2003.0
3,64883/18,25398,726 WILLOUGHBY AVE,BASEMENT,7/18/18,Gary,Rose,Residential,BROOKLYN,11206,726 WILLOUGHBY AVE,NY,"726 WILLOUGHBY AVE, BROOKLYN, NY, 11206",Match,Exact,"726 WILLOUGHBY AVE, BROOKLYN, NY, 11206","-73.93993,40.69493",59079108.0,R,36.0,47.0,28300.0,3000.0
4,95253/18,94512,945 SARATOGA AVENUE,1,9/4/19,Henry,Daley,Residential,BROOKLYN,11212,945 SARATOGA AVENUE,NY,"945 SARATOGA AVENUE, BROOKLYN, NY, 11212",Match,Exact,"945 SARATOGA AVE, BROOKLYN, NY, 11212","-73.91456,40.65798",59084034.0,L,36.0,47.0,89600.0,3001.0


We could create another Pandas profiling document, however, we're going to quickly look at descriptive analytics using `describe()`.

In [10]:
df_raw_nyc.describe()

Unnamed: 0,DOCKET_NUMBER,EVICTION_ZIP,tiger_line_id,state_code,county_code,tract_code,block_code
count,60788.0,60788.0,55459.0,55459.0,55459.0,55459.0,55459.0
mean,156367.932404,10801.659193,164121400.0,36.0,40.736346,39279.303522,2274.01511
std,149694.482603,511.222055,214606600.0,0.0,29.667331,31409.17927,1309.307262
min,103.0,0.0,59074650.0,36.0,5.0,100.0,1000.0
25%,61667.75,10456.0,59098860.0,36.0,5.0,18101.0,1003.0
50%,91827.0,10473.0,59869330.0,36.0,47.0,29300.0,2001.0
75%,293285.25,11229.0,80303100.0,36.0,61.0,47500.0,3001.0
max,496987.0,12221.0,653464200.0,36.0,85.0,162100.0,9001.0


Note that 5,329 out of the 60,788 rows have null values for `tract_code` (~9%) and therefore will not be counted in our processed data. We wish to understand if this is reflective of the household distribution or whether there is potential of bias with our processed data.

In [12]:
df_raw_nyc[df_raw_nyc.tract_code.isna()].groupby(['BOROUGH']).count()[['COURT_INDEX_NUMBER']]

Unnamed: 0_level_0,COURT_INDEX_NUMBER
BOROUGH,Unnamed: 1_level_1
BRONX,898
BROOKLYN,893
MANHATTAN,1056
QUEENS,2385
STATEN ISLAND,97


Once again, Queens addresses as disproportionately impacted by the lack of `tract_code` value (45% compared to it's household percentage of 25%). Proportionally, Brooklyn is the least impacted (accounting for only 17% of the missing values whilst making up 30% of the number of households).

Further exploration is required to understand if the format of the address is incorrect for the API or whether the API is not complete and another should be used.