## Objective

Remove outliers for `y`, see if $R^2$ improves from 0.50.

In [93]:
from __future__ import division
import pandas as pd
import numpy as np
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline

rcParams['figure.figsize'] = 20, 5
warnings.filterwarnings("ignore", category=DeprecationWarning)
sns.set_style("whitegrid")
sns.set_context("poster")

from helper_functions import dummify_cols_and_baselines, make_alphas

In [56]:
df_orig = pd.read_pickle('../data/data_from_remove_from_dataset.pkl')
df_orig.shape

(516406, 40)

## Removing outliers

A standard procedure is to remove values further than 3 standard deviations from the mean. Since I have so many low values and some very high values, I anecdotally think that the low values are very likely to be true, and the high values not so much.

So, I will remove values further than 3 SDs from the median, by type.

Ideally, I would take into account the time dimension. I would like to do so given more time.

In [75]:
def replace(group, stds):
    # http://stackoverflow.com/questions/29740216/remove-outliers-3-std-and-replace-with-np-nan-in-python-pandas
    group[pd.np.abs(group - group.median()) > stds * group.std()] = pd.np.nan
    return group

def remove_outliers_by_type(df, y_col, std_devs=3):
    group_column = 'TYPE'
    df = df.copy()
    df.loc[:, y_col] = df[[y_col, 'TYPE']].groupby(group_column).transform(lambda g: replace(g, std_devs))
    return df.dropna(subset=[y_col])

In [73]:
df_outliers_removed = remove_outliers_by_type(df_orig, y_col='COMPLETION_HOURS_LOG_10')
df_outliers_removed.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


(508653, 40)

I'm removing ~1.5% of my rows.

## Removing same columns as last time

In [77]:
cols_orig_dataset = ['COMPLETION_HOURS_LOG_10', 'TYPE', 'SubmittedPhoto', 'Property_Type', 'Source', 'neighborhood_from_zip']
cols_census = ['race_white',
     'race_black',
     'race_asian',
     'race_hispanic',
     'race_other',
     'poverty_pop_below_poverty_level',
     'earned_income_per_capita',
     'poverty_pop_w_public_assistance',
     'poverty_pop_w_food_stamps',
     'poverty_pop_w_ssi',
     'school',
     'school_std_dev',
     'housing',
     'housing_std_dev',
     'bedroom',
     'bedroom_std_dev',
     'value',
     'value_std_dev',
     'rent',
     'rent_std_dev',
     'income',
     'income_std_dev']
cols_engineered = ['queue_wk', 'queue_wk_open', 'is_description']

In [78]:
df = df_outliers_removed[cols_orig_dataset + cols_census + cols_engineered]

## Dummify

In [79]:
cols_to_dummify = df.dtypes[df.dtypes == object].index
cols_to_dummify

Index([u'TYPE', u'Property_Type', u'Source', u'neighborhood_from_zip',
       u'school', u'housing'],
      dtype='object')

In [80]:
df_dummified, baseline_cols = dummify_cols_and_baselines(df, cols_to_dummify)

Zoning is baseline 0 6
other is baseline 1 6
Twitter is baseline 2 6
West Roxbury is baseline 3 6
8_6th_grade is baseline 4 6
rent is baseline 5 6


In [81]:
df_dummified.shape

(508653, 253)

## Running model

In [95]:
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.cross_validation import ShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score


In [83]:
X_train, X_test, y_train, y_test = train_test_split(
    df_dummified.drop('COMPLETION_HOURS_LOG_10', axis=1), 
    df_dummified.COMPLETION_HOURS_LOG_10, 
    test_size=0.2, 
    random_state=300
)

In [84]:
pipe = make_pipeline(StandardScaler(), LinearRegression())

In [85]:
cv = ShuffleSplit(X_train.shape[0], n_iter=10, test_size=0.2, random_state=300)

In [86]:
params = {'lassocv__alphas': make_alphas(-3, -6)}
params = {}
model = GridSearchCV(pipe, param_grid=params, n_jobs=-1, cv=cv, verbose=True)
model.fit(X_train, y_train);

Fitting 10 folds for each of 1 candidates, totalling 10 fits


[Parallel(n_jobs=-1)]: Done   4 out of  10 | elapsed:   27.9s remaining:   41.9s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:   39.6s finished


In [87]:
pd.DataFrame(model.cv_results_).T

Unnamed: 0,0
mean_fit_time,14.7322
mean_score_time,1.37415
mean_test_score,-2.04099e+18
mean_train_score,0.558793
params,{}
rank_test_score,1
split0_test_score,0.556459
split0_train_score,0.55938
split1_test_score,0.559988
split1_train_score,0.558477


In [88]:
model.score(X_test, y_test)

-3.688740028649031e+17

It makes intuitive sense that the CV_test scores would be better than previous iterations of the model once we remove the y outliers.

Our model on the test set has very extreme predicted values compared to actual. This is likely a symptom of overfitting.

In [120]:
y_pred = model.predict(X_test)
y_pred[y_pred > np.percentile(y_pred, 99.99)] = np.percentile(y_pred, 99.99)

In [124]:
y_test.describe()

count    101731.000000
mean          1.717758
std           1.119753
min          -2.857332
25%           1.353593
50%           1.865962
75%           2.375708
max           4.582340
Name: COMPLETION_HOURS_LOG_10, dtype: float64

In [122]:
pd.Series(y_pred).describe()

count    1.017310e+05
mean    -2.132221e+06
std      6.800788e+08
min     -2.169132e+11
25%      1.133769e+00
50%      1.940363e+00
75%      2.328280e+00
max      3.657871e+00
dtype: float64

In [123]:
r2_score(y_test, y_pred)

-3.688740028649031e+17