## Objective

Add my engineered features to our model, see if $R^2$ improves from 0.48.

In [1]:
from __future__ import division
import pandas as pd
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline

rcParams['figure.figsize'] = 20, 5
warnings.filterwarnings("ignore", category=DeprecationWarning)
sns.set_style("whitegrid")
sns.set_context("poster")

from helper_functions import dummify_cols_and_baselines, make_alphas

In [2]:
df_orig = pd.read_pickle('../data/data_from_remove_from_dataset.pkl')
df_orig.shape

(516406, 40)

## Removing extra columns

In [3]:
df_orig.head(1).loc[:, :]

Unnamed: 0,CASE_ENQUIRY_ID,OPEN_DT,CLOSED_DT,TYPE,SubmittedPhoto,LOCATION_ZIPCODE,Property_Type,LATITUDE,LONGITUDE,Source,...,housing,housing_std_dev,bedroom,bedroom_std_dev,value,value_std_dev,rent,rent_std_dev,income,income_std_dev
905425,101001983786,2017-01-07 10:51:37,2017-01-07 11:46:43,Request for Snow Plowing,True,2124.0,Address,42.2809,-71.068,Citizens Connect App,...,own,26.870058,3,61.512329,350000.0,20.979404,1750,19.162161,112500,28.61672


In [4]:
list(df_orig.columns)

['CASE_ENQUIRY_ID',
 'OPEN_DT',
 'CLOSED_DT',
 'TYPE',
 'SubmittedPhoto',
 'LOCATION_ZIPCODE',
 'Property_Type',
 'LATITUDE',
 'LONGITUDE',
 'Source',
 u'description',
 'COMPLETION_HOURS_LOG_10',
 'tract_and_block_group',
 'queue_wk',
 'queue_wk_open',
 'race_white',
 'race_black',
 'race_asian',
 'race_hispanic',
 'race_other',
 'poverty_pop_below_poverty_level',
 'earned_income_per_capita',
 'poverty_pop_w_public_assistance',
 'poverty_pop_w_food_stamps',
 'poverty_pop_w_ssi',
 'is_description',
 'zipcode',
 'neighborhood_from_zip',
 'school',
 'school_std_dev',
 'housing',
 'housing_std_dev',
 'bedroom',
 'bedroom_std_dev',
 'value',
 'value_std_dev',
 'rent',
 'rent_std_dev',
 'income',
 'income_std_dev']

In [3]:
cols_orig_dataset = ['COMPLETION_HOURS_LOG_10', 'TYPE', 'SubmittedPhoto', 'Property_Type', 'Source', 'neighborhood_from_zip']
cols_census = ['race_white',
     'race_black',
     'race_asian',
     'race_hispanic',
     'race_other',
     'poverty_pop_below_poverty_level',
     'earned_income_per_capita',
     'poverty_pop_w_public_assistance',
     'poverty_pop_w_food_stamps',
     'poverty_pop_w_ssi',
     'school',
     'school_std_dev',
     'housing',
     'housing_std_dev',
     'bedroom',
     'bedroom_std_dev',
     'value',
     'value_std_dev',
     'rent',
     'rent_std_dev',
     'income',
     'income_std_dev']
cols_engineered = ['queue_wk', 'queue_wk_open', 'is_description']

In [4]:
df = df_orig[cols_orig_dataset + cols_census + cols_engineered]

## Dummify

In [5]:
cols_to_dummify = df.dtypes[df.dtypes == object].index
cols_to_dummify

Index([u'TYPE', u'Property_Type', u'Source', u'neighborhood_from_zip',
       u'school', u'housing'],
      dtype='object')

In [6]:
df_dummified, baseline_cols = dummify_cols_and_baselines(df, cols_to_dummify)

Zoning is baseline 0 6
other is baseline 1 6
Twitter is baseline 2 6
West Roxbury is baseline 3 6
8_6th_grade is baseline 4 6
rent is baseline 5 6


In [10]:
df_dummified.shape

(516406, 229)

### Running model

In [7]:
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.cross_validation import ShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error




In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    df_dummified.drop('COMPLETION_HOURS_LOG_10', axis=1), 
    df_dummified.COMPLETION_HOURS_LOG_10, 
    test_size=0.2, 
    random_state=300
)

In [9]:
pipe = make_pipeline(StandardScaler(), LinearRegression())

In [10]:
cv = ShuffleSplit(X_train.shape[0], n_iter=10, test_size=0.2, random_state=300)

In [11]:
params = {'lassocv__alphas': make_alphas(-3, -6)}
params = {}
model = GridSearchCV(pipe, param_grid=params, n_jobs=-1, cv=cv, verbose=True)
model.fit(X_train, y_train);

Fitting 10 folds for each of 1 candidates, totalling 10 fits


[Parallel(n_jobs=-1)]: Done   4 out of  10 | elapsed:   28.2s remaining:   42.4s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:   39.4s finished


In [12]:
pd.DataFrame(model.cv_results_).T

Unnamed: 0,0
mean_fit_time,15.1977
mean_score_time,1.33867
mean_test_score,-9.2195e+16
mean_train_score,0.501991
params,{}
rank_test_score,1
split0_test_score,-1.33846e+11
split0_train_score,0.501685
split1_test_score,0.499882
split1_train_score,0.502318


In [13]:
model.score(X_test, y_test)

0.49561001510175473

My engineered features related to how many issues workers have to deal upon any given new issue, and whether the user wrote a description, add a bit of signal to the model.

## Are we in high bias or high variance situation?

Looking at the CV_train and CV_test scores, ignoring the test scores that are abnormal (I think because some of the y-outliers are in those subsets), the CV_train and CV_test scores look similar, so we don't seem to be in either a high bias or high variance situation.