## Objective

Run the model with all the relevant features from the original 311 dataset.

In [2]:
from __future__ import division
import pandas as pd
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline

rcParams['figure.figsize'] = 20, 5
warnings.filterwarnings("ignore", category=DeprecationWarning)
sns.set_style("whitegrid")
sns.set_context("poster")

from helper_functions import dummify_cols_and_baselines, make_alphas

In [3]:
df_orig = pd.read_pickle('../data/data_from_remove_from_dataset.pkl')
df_orig.shape

(516406, 40)

## Removing extra columns

In [10]:
df_orig.head(1).loc[:, :]

Unnamed: 0,CASE_ENQUIRY_ID,OPEN_DT,CLOSED_DT,TYPE,SubmittedPhoto,LOCATION_ZIPCODE,Property_Type,LATITUDE,LONGITUDE,Source,...,housing_std_dev,bedroom,bedroom_std_dev,value,value_std_dev,rent,rent_std_dev,income,income_std_dev,COMPLETION_HOURS_LOG_10
905425,101001983786,2017-01-07 10:51:37,2017-01-07 11:46:43,Request for Snow Plowing,True,2124.0,Address,42.2809,-71.068,Citizens Connect App,...,26.870058,3,61.512329,350000.0,20.979404,1750,19.162161,112500,28.61672,-0.037


In [5]:
cols_to_keep = ['COMPLETION_HOURS_LOG_10', 'TYPE', 'SubmittedPhoto', 'Property_Type', 'Source', 'neighborhood_from_zip']

In [6]:
df = df_orig[cols_to_keep]

## Dummify

In [7]:
cols_to_dummify = df.dtypes[df.dtypes == object].index
cols_to_dummify

Index([u'TYPE', u'Property_Type', u'Source', u'neighborhood_from_zip'], dtype='object')

In [8]:
df_dummified, baseline_cols = dummify_cols_and_baselines(df, cols_to_dummify)

Zoning is baseline 0 4
other is baseline 1 4
Twitter is baseline 2 4
West Roxbury is baseline 3 4


In [9]:
df_dummified.shape

(516406, 219)

### Running model

In [10]:
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.cross_validation import ShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error




In [11]:
X_train, X_test, y_train, y_test = train_test_split(
    df_dummified.drop('COMPLETION_HOURS_LOG_10', axis=1), 
    df_dummified.COMPLETION_HOURS_LOG_10, 
    test_size=0.2, 
    random_state=300
)

In [12]:
pipe = make_pipeline(StandardScaler(), LinearRegression())

In [14]:
cv = ShuffleSplit(X_train.shape[0], n_iter=1, test_size=0.2, random_state=300)

In [15]:
params = {}
model = GridSearchCV(pipe, param_grid=params, n_jobs=-1, cv=cv, verbose=True)
model.fit(X_train, y_train);

Fitting 1 folds for each of 1 candidates, totalling 1 fits


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   18.3s finished


In [16]:
model.score(X_test, y_test)

0.4815050011144506

In [17]:
pd.DataFrame(model.cv_results_).T

Unnamed: 0,0
mean_fit_time,11.4462
mean_score_time,1.04059
mean_test_score,-3.85918e+16
mean_train_score,0.488159
params,{}
rank_test_score,1
split0_test_score,-3.85918e+16
split0_train_score,0.488159
std_fit_time,0
std_score_time,0


The $R^2$ performance implies that adding the rest of the original dataset doesn't add extra signal to our model. This intuitively makes sense, as we only added whether the user submitted a photo, the property type, and source of issue.