## Objective

Implement date cutoff, see if $R^2$ improves from 0.50.

In [1]:
from __future__ import division
import pandas as pd
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline

rcParams['figure.figsize'] = 20, 5
warnings.filterwarnings("ignore", category=DeprecationWarning)
sns.set_style("whitegrid")
sns.set_context("poster")

from helper_functions import dummify_cols_and_baselines, make_alphas

In [2]:
df_orig = pd.read_pickle('../data/data_from_remove_from_dataset.pkl')
df_orig.shape

(516406, 40)

## Cutting off dataset

I had determined that there are more unresolved issues starting around June 2016, presumably because not enough time had passed to complete them. Let's remove these rows and see if our performance improves.

In [5]:
df = df_orig[df_orig.OPEN_DT < '2016-06-01']

## Removing same columns as last time

In [6]:
cols_orig_dataset = ['COMPLETION_HOURS_LOG_10', 'TYPE', 'SubmittedPhoto', 'Property_Type', 'Source', 'neighborhood_from_zip']
cols_census = ['race_white',
     'race_black',
     'race_asian',
     'race_hispanic',
     'race_other',
     'poverty_pop_below_poverty_level',
     'earned_income_per_capita',
     'poverty_pop_w_public_assistance',
     'poverty_pop_w_food_stamps',
     'poverty_pop_w_ssi',
     'school',
     'school_std_dev',
     'housing',
     'housing_std_dev',
     'bedroom',
     'bedroom_std_dev',
     'value',
     'value_std_dev',
     'rent',
     'rent_std_dev',
     'income',
     'income_std_dev']
cols_engineered = ['queue_wk', 'queue_wk_open', 'is_description']

In [7]:
df = df[cols_orig_dataset + cols_census + cols_engineered]

## Dummify

In [8]:
cols_to_dummify = df.dtypes[df.dtypes == object].index
cols_to_dummify

Index([u'TYPE', u'Property_Type', u'Source', u'neighborhood_from_zip',
       u'school', u'housing'],
      dtype='object')

In [9]:
df_dummified, baseline_cols = dummify_cols_and_baselines(df, cols_to_dummify)

Zoning is baseline 0 6
other is baseline 1 6
Twitter is baseline 2 6
West Roxbury is baseline 3 6
8_6th_grade is baseline 4 6
rent is baseline 5 6


In [10]:
df_dummified.shape

(449027, 252)

## Running model

In [11]:
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.cross_validation import ShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error




In [12]:
X_train, X_test, y_train, y_test = train_test_split(
    df_dummified.drop('COMPLETION_HOURS_LOG_10', axis=1), 
    df_dummified.COMPLETION_HOURS_LOG_10, 
    test_size=0.2, 
    random_state=300
)

In [13]:
pipe = make_pipeline(StandardScaler(), LinearRegression())

In [14]:
cv = ShuffleSplit(X_train.shape[0], n_iter=1, test_size=0.2, random_state=300)

In [15]:
params = {'lassocv__alphas': make_alphas(-3, -6)}
params = {}
model = GridSearchCV(pipe, param_grid=params, n_jobs=-1, cv=cv, verbose=True)
model.fit(X_train, y_train);

Fitting 1 folds for each of 1 candidates, totalling 1 fits


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   18.5s finished


In [16]:
pd.DataFrame(model.cv_results_).T

Unnamed: 0,0
mean_fit_time,11.6905
mean_score_time,1.06006
mean_test_score,-9.82613e+17
mean_train_score,0.487667
params,{}
rank_test_score,1
split0_test_score,-9.82613e+17
split0_train_score,0.487667
std_fit_time,0
std_score_time,0


In [17]:
model.score(X_test, y_test)

0.48864275375981969

Cutting off the data actually hurts my performance, so I will not use it in subsequent iterations. I will still want to remove these rows for my final model, since I have more unresolved issues starting from June 2016, and I would want my model to reflect the fact of those unresolved issues.