# Philly Evictions Analysis
## Environment Setup

In [1]:
import pandas as pd
import numpy as np
import pipeline

pipeline.notebook.set_up()

## Load Data
Our evictions data has already been augmented with data from the ACS and from
Philadelphia's open data portal. We load in the final merged dataset.

In [2]:
df = pd.read_csv('data/final_merged_df.csv')
df.head()

Unnamed: 0,GEOID,year_evictions,eviction-filings,evictions,low-flag,imputed,subbed,evictions_t-1,evictions_t-2,evictions_t-5,...,units,occupied_units,vacant_units,for_rent_units,num_af_am_alone,num_hisp,black_alone_owner_occupied,num_with_high_school_degree,num_with_ged,num_unemployed
0,421010001001,2010,16.0,13.0,0,0,1,8.0,10.0,9.0,...,220.0,189.0,94.0,0.0,0.0,0.0,0.0,,,
1,421010001001,2011,17.0,7.0,0,0,1,13.0,8.0,3.0,...,1173.0,934.0,728.0,38.0,76.0,94.0,14.0,,,
2,421010001001,2012,14.0,7.0,0,0,1,7.0,13.0,13.0,...,1347.0,1024.0,790.0,124.0,104.0,111.0,12.0,,,0.0
3,421010001001,2013,15.0,9.0,0,0,1,7.0,7.0,10.0,...,1481.0,1094.0,826.0,139.0,88.0,122.0,12.0,52.0,0.0,0.0
4,421010001001,2014,13.0,4.0,0,0,1,9.0,7.0,8.0,...,1489.0,1126.0,823.0,155.0,89.0,57.0,12.0,53.0,0.0,0.0


## Split Data By Year
We have data for 2009 to 2016. We want to split this data into training set /
test set pairs using a temporal cross-validation approach.

In [3]:
splits = pipeline.split_by_year(df, colname='year_evictions')
pipeline.split_boundaries(splits, colname='year_evictions')

Unnamed: 0,train_start,train_end,test_start,test_end
0,2010,2010,2011,2011
1,2010,2011,2012,2012
2,2010,2012,2013,2013
3,2010,2013,2014,2014
4,2010,2014,2015,2015
5,2010,2015,2016,2016


## Data Cleaning
We want to clean each of our training set / test set pairs. We use a function
called `clean_split()` that cleans both sets at once, making sure to clean the
test data using the same bins and categories applied to the training data.

In [4]:
%psource pipeline.clean_split

[1;32mdef[0m [0mclean_split[0m[1;33m([0m[0msplit[0m[1;33m)[0m[1;33m:[0m[1;33m[0m
[1;33m[0m    [0mtrain_df[0m[1;33m,[0m [0mtest_df[0m [1;33m=[0m [0msplit[0m[1;33m[0m
[1;33m[0m    [0mtrain_df[0m [1;33m=[0m [0mclean_overall_data[0m[1;33m([0m[0mtrain_df[0m[1;33m)[0m[1;33m[0m
[1;33m[0m    [0mtest_df[0m [1;33m=[0m [0mclean_overall_data[0m[1;33m([0m[0mtest_df[0m[1;33m)[0m[1;33m[0m
[1;33m[0m[1;33m[0m
[1;33m[0m    [0mfeatures_generator[0m [1;33m=[0m [0mget_feature_generators[0m[1;33m([0m[0mtrain_df[0m[1;33m)[0m[1;33m[0m
[1;33m[0m    [0mtrain_df[0m[1;33m,[0m [0mtest_df[0m [1;33m=[0m \
        [0mclean_and_create_features[0m[1;33m([0m[0mtrain_df[0m[1;33m,[0m [0mtest_df[0m[1;33m,[0m [0mfeatures_generator[0m[1;33m)[0m[1;33m[0m
[1;33m[0m[1;33m[0m
[1;33m[0m    [1;32mreturn[0m [0mtrain_df[0m[1;33m,[0m [0mtest_df[0m[1;33m[0m[1;33m[0m[0m



In [5]:
cleaned_splits = [pipeline.clean_split(split) for split in splits]

  return np.nanmean(a, axis, out=out, keepdims=keepdims)
  return np.nanmean(a, axis, out=out, keepdims=keepdims)


## Data Labeling
We plan to use both regression-based models and binary classifiers. For our
binary classifiers, we will need to label our data using a binary label.

Our binary label separates block groups into two classes: "high" and "low"
eviction rate block groups. The "high" eviction rate block groups are those
that we believe should be prioritized for intervention.

Any block group with more than 14 evictions is considered a "high" eviction
rate block group. Roughly 16% of Philadelphia block groups are "high" eviction
rate block groups. We have picked this lower boundary because we know that
Philadelphia can afford to target about 16% of block groups for intervention.

In [6]:
labeled_splits = [pipeline.label(split, lower_bound=14, drop_column=True)
                  for split in cleaned_splits]

## Model Generation
### Binary Classifiers
Our binary classifiers are given by the following list:

In [7]:
pipeline.clfs

{'LR': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                    intercept_scaling=1, l1_ratio=None, max_iter=100,
                    multi_class='warn', n_jobs=None, penalty='l2',
                    random_state=1234, solver='liblinear', tol=0.0001, verbose=0,
                    warm_start=False),
 'KNN': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                      metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                      weights='uniform'),
 'DT': DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                        max_features=None, max_leaf_nodes=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, presort=False,
                        random_state=1234, splitter='best'),
 'SVM': SVC(C=1.0, cache_size=200, class_w

We plan to run a grid search using the following hyperparameters.

In [8]:
pipeline.clf_small_grid

{'LR': {'penalty': ['l1', 'l2'], 'C': [0.01, 0.1]},
 'KNN': {'n_neighbors': [5, 10],
  'weights': ['uniform', 'distance'],
  'algorithm': ['auto', 'ball_tree', 'kd_tree']},
 'DT': {'criterion': ['gini', 'entropy'],
  'max_depth': [5, 50],
  'max_features': [None],
  'min_samples_split': [5, 10]},
 'SVM': {'C': [0.01, 0.1]},
 'RF': {'n_estimators': [100, 1000],
  'max_depth': [5, 50],
  'max_features': ['sqrt', 'log2'],
  'min_samples_split': [5, 10]},
 'GB': {'n_estimators': [100, 1000],
  'learning_rate': [0.01, 0.05],
  'subsample': [0.1, 0.5],
  'max_depth': [5, 10]},
 'AB': {'algorithm': ['SAMME', 'SAMME.R'], 'n_estimators': [100, 1000]},
 'NB': {},
 'ET': {'n_estimators': [100, 1000],
  'criterion': ['gini', 'entropy'],
  'max_depth': [5, 10],
  'max_features': ['sqrt', 'log2'],
  'min_samples_split': [5, 10]},
 'BC': {'n_estimators': [100, 1000]}}

We also want to evaluate our models at the following thresholds:

In [9]:
thresholds = [10]

We run our models for each of our splits.

In [10]:
models = pipeline.clfs
grid = pipeline.clf_small_grid

results_df = pd.DataFrame(columns=[
    'split',
    'classifier',
    'parameters',
    'threshold',
] + pipeline.evaluate.ClassifierEvaluator.metric_names())

for i, (train_df, test_df) in enumerate(labeled_splits, start=1):
    results = pipeline.run_clf_loop(
        test_df, train_df, models, grid, 'label', thresholds
    )
    results_df.append([i] + results, ignore_index=True)

results_df.head()

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

### Regression Models
Our regression models are given by the following list:

In [None]:
pipeline.regs

We plan to run a grid search using the following hyperparameters:

In [None]:
pipeline.reg_small_grid

We run our models for each of our splits.

In [None]:
models = pipeline.regs
grid = pipeline.reg_small_grid

results_df = pd.DataFrame(columns=[
    'split',
    'classifier',
    'parameters',
] + pipeline.evaluate.RegressionEvaluator.metric_names())

for i, (train_df, test_df) in enumerate(cleaned_splits, start=1):
    results = pipeline.run_reg_loop( test_df, train_df, models, grid, 'evictions')
    results_df.append([i] + results, ignore_index=True)

results_df.head()