# Testing out Models

In this file is several different executions of our pipeline with different combinations of feature creation methods, feature selection methods, models, and so on. You can kind of think of this file as a type of grid search cross validation where we're not testing only model parameters but different combinations of features as well.

#### Warning: Do NOT try re-running this notebook, the whole thing will take several hours to run.

If there is a particular pipeline you would like to retry, run them individually and before doing so be aware of how long it will take. Rhere are start and stop timestamps in the output of each run of a pipeline, so please consult those before beginning.

The best model of all of these is the second-to-last, which is all the basic features plus deep feature synthesis on the season plus recursive feature elimination for selection. ($R^{2}$ of 0.8638)

In [1]:
# Outside modules
import warnings; warnings.simplefilter('ignore') # suppress warnings
import dask.dataframe as dd
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR

# Home-made modules
from utils import pipeline_casero
from preprocessing import *
from feature_creation import *
from feature_selection import  *
from dim_reduction import *

In [2]:
data = dd.read_csv("https://gist.githubusercontent.com/catyselman/9353e4e480ddf2db44b44a79e14718b5/raw/ded23e586ca5db1b4a566b1e289acd12ebf69357/bikeshare_hourly.csv")

## Baselines

This run executes all the preprocessing steps with no feature creation or feature selection to get an understaning of model performance in advance of applying more advanced techniques to the data. The models tested are linear regression, random forest, and gradient boosting with their default parameters. We get an $R^{2}$ of 0.8325 on the holdout with our best model which is a Random Forest.

In [4]:
pp = [drop_registered, drop_casual, drop_date, drop_instant, year_as_bool,\
      season_as_category, month_as_category, weekday_as_category, \
      hour_as_category, holiday_as_bool, working_day_as_bool, weather_sit_as_category]


lm = {"name": "Linear Regression",
      "model": LinearRegression(),
      "params": {}\
     }

rfr = {"name": "Random Forest",
       "model": RandomForestRegressor(),
       "params": {"random_state": [SEED]}
      }

gbr = {"name": "Gradient Boosting",
       "model": GradientBoostingRegressor(),
       "params": { "random_state": [SEED] }
      }

models = [lm, rfr, gbr]

Got a lot of errors below with that line of code related to utils.py
Couldn't get get_dummies to work with Dask...

In [5]:
pipeline_casero(data, preprocessing=pp, models=models)

Beginning pipeline at 2019-05-22 14:07:48.598287

Performing preprocessing steps...
	Dropping the registered variable since we won't have this information
	Dropping the causual variable since we won't have this information
	Dropping the date variable since this information is encoded in other variables
	Dropping index variable
	Converting year to a boolean variable...
	Converting season to a categorical variable...
	Converting month to a categorical variable...
	Converting day of week to a categorical variable...
	Converting hour of day to a categorical variable...
	Converting holiday or not to a boolean variable...
	Converting holiday or not to a boolean variable...
	Converting weather situation to a categorical variable...
Preprocessing completed at 2019-05-22 14:07:48.711984, performed 12 steps
New Shape of data: 13

Performing feature creation...
Feature Creation completed at 2019-05-22 14:07:48.715972, performed 0 steps
New Shape of data: 13

Dummifying...


NotImplementedError: `get_dummies` with unknown categories is not supported. Please use `column.cat.as_known()` or `df.categorize()` beforehand to ensure known categories

## Basic Features with Grid Search

The next thing we try is adding two basic futures, a boolean field containing whether or not the hour is during commute hours and a field that contains the row's cluster based on clustering using weather variables. To ensure that we're not fitting too much noise, we also apply a feature selection method that simply removes those features that contribute less than 0.1% of the decisions being made by the trees in a simple random forest. Here we also do Grid Search Cross Validation on a small set of parameters for random forest and gradient boosting, namely the number of trees per forest and the learning rate for gradient boosting. Here, we increase our $R^{2}$ on the test set to 0.8625, and our new best model after parameter tuning with grid search is Gradient Boosting.

In [None]:
pp = [drop_registered, drop_casual, drop_date, drop_instant, year_as_bool,\
      season_as_category, month_as_category, weekday_as_category, \
      hour_as_category, holiday_as_bool, working_day_as_bool, weather_sit_as_category,
     change_season, change_weather_sit]

fc = [commute_hours, weather_cluster]

fs = [tree_selection]

lm = {"name": "Linear Regression",
      "model": LinearRegression(),
      "params": {}\
     }

rfr = {"name": "Random Forest",
       "model": RandomForestRegressor(),
       "params": {"random_state": [SEED],
                  "n_estimators": list(range(10, 110, 10))
                 }\
      }

gbr = {"name": "Gradient Boosting",
       "model": GradientBoostingRegressor(),
       "params": {"random_state": [SEED],
                  "n_estimators": list(range(500, 1050, 50)),
                  "learning_rate": [0.01, 0.05, 0.1, 0.15, 0.2]
                 }\
      }

models = [lm, rfr, gbr]


pipeline_casero(PATH, preprocessing=pp, creation=fc, selection=fs, models=models)

## Deep Feature Synthesis 

In the next trial, we added deep features using deep feature synthesis on top of the previous steps we ran on the previous iteration. Unfortunately, our score on the holdout actually decreases to 0.860 as a result of adding this step.

In [None]:
pp = [drop_registered, drop_casual, drop_date, drop_instant, year_as_bool,\
      season_as_category, month_as_category, weekday_as_category, \
      hour_as_category, holiday_as_bool, working_day_as_bool, weather_sit_as_category,
     change_season, change_weather_sit]


fc = [commute_hours, weather_cluster, deep_features]

fs = [tree_selection]

lm = {"name": "Linear Regression",
      "model": LinearRegression(),
      "params": {}
     }

rfr = {"name": "Random Forest",
       "model": RandomForestRegressor(),
       "params": {"random_state": [SEED],
                  "n_estimators": list(range(10, 110, 10))
                 }
      }

gbr = {"name": "Gradient Boosting",
       "model": GradientBoostingRegressor(),
       "params": {"random_state": [SEED],
                  "n_estimators": list(range(500, 1050, 50)),
                  "learning_rate": [0.01, 0.05, 0.1, 0.15, 0.2]
                 }
      }

models = [lm, rfr, gbr]


pipeline_casero(PATH, preprocessing=pp, creation=fc, selection=fs, models=models)

## Using Registered and Casual Predictions

The theory behind this approach was that although we don't know the true values of registered and casual numbers of riders when we are trying to make a prediction for the total number of users and therefore cannot use those values directly, what we can do is make predictions about what these values will be and use those as features to a final model. Doing this did not improve our score, probably for a few reasons. First, the cross validation ended up tremendously overfitting because data from "blind" sets during cross validation were used to generate the values for those features. This meant that the correlations between these predictions and the target variable were extremely high. One consequence of this is that during feature selection, every feature besides these predictions were ultimately dropped since by far the greatest separating features were these predictions. It also led to Linear Regression being chosen as the ideal model during cross validation because of the small feature space and the overfitted correlations between the features and the target during cross validation. The result is still good because these predictions decently explain the target (for good reason), but the result is not better than previous iterations.

In [None]:
pp = [drop_registered, drop_casual, drop_date, drop_instant, year_as_bool,\
      season_as_category, month_as_category, weekday_as_category, \
      hour_as_category, holiday_as_bool, working_day_as_bool, weather_sit_as_category,
     change_season, change_weather_sit]


fc= [commute_hours, weather_cluster, prediction_forecasts]

fs = [tree_selection]


lm = {"name": "Linear Regression",
      "model": LinearRegression(),
      "params": {}
     }

rfr = {"name": "Random Forest",
       "model": RandomForestRegressor(),
       "params": {"random_state": [SEED],
                  "n_estimators": list(range(10, 110, 10))
                 }
      }

gbr = {"name": "Gradient Boosting",
       "model": GradientBoostingRegressor(),
       "params": {"random_state": [SEED],
                  "n_estimators": list(range(500, 1050, 50)),
                  "learning_rate": [0.01, 0.05, 0.1, 0.15, 0.2]
                 }
      }

models = [lm, rfr, gbr]


pipeline_casero(PATH, preprocessing=pp, creation=fc, selection=fs, models=models)

## Deep Feature Synthesis with Stepwise


This iteration uses the features from the "Deep Feature Synthesis" iteration and replaces the simple tree-based feature importance selection with a forward stepwise feature selection where attributes are added that will improve the score until adding the next best feature actually worsens the score. This results in far less features which is better since the model is simpler, but ultimately the lost information hurt the model more than helped it and the score from this model on the holdout is still lower than the best previously seen.

In [None]:
pp = [drop_registered, drop_casual, drop_date, drop_instant, year_as_bool,\
      season_as_category, month_as_category, weekday_as_category, \
      hour_as_category, holiday_as_bool, working_day_as_bool, weather_sit_as_category,
     change_season, change_weather_sit]


fc= [commute_hours, weather_cluster, deep_features]

fs = [r2_selection]


lm = {"name": "Linear Regression",
      "model": LinearRegression(),
      "params": {}\
     }

rfr = {"name": "Random Forest",
       "model": RandomForestRegressor(),
       "params": {"random_state": [SEED],
                  "n_estimators": list(range(10, 110, 10))
                 }\
      }

gbr = {"name": "Gradient Boosting",
       "model": GradientBoostingRegressor(),
       "params": {"random_state": [SEED],
                  "n_estimators": list(range(500, 1050, 50)),
                  "learning_rate": [0.01, 0.05, 0.1, 0.15, 0.2]
                 }\
      }

models = [lm, rfr, gbr]


pipeline_casero(PATH, preprocessing=pp, creation=fc, selection=fs, models=models)

## Registered and Casual Forecast with Stepwise

In this trial we use registered and casual forecast like before, but this time we update the feature selection method to see if there are other variables that when added to the data will still improve the $R^{2}$. Certain features were still added after the forecast variables, but their contribution was very small and this model also suffers from overfitting, so our overall score on the holdout still does not improve from our best model so far.

In [None]:
pp = [drop_registered, drop_casual, drop_date, drop_instant, year_as_bool,\
      season_as_category, month_as_category, weekday_as_category, \
      hour_as_category, holiday_as_bool, working_day_as_bool, weather_sit_as_category,
     change_season, change_weather_sit]


fc= [commute_hours, weather_cluster, prediction_forecasts, deep_features]

fs = [r2_selection]


lm = {"name": "Linear Regression",
      "model": LinearRegression(),
      "params": {}
     }

rfr = {"name": "Random Forest",
       "model": RandomForestRegressor(),
       "params": {"random_state": [SEED],
                  "n_estimators": list(range(10, 110, 10))
                 }
      }

gbr = {"name": "Gradient Boosting",
       "model": GradientBoostingRegressor(),
       "params": {"random_state": [SEED],
                  "n_estimators": list(range(500, 1050, 50)),
                  "learning_rate": [0.01, 0.05, 0.1, 0.15, 0.2]
                 }
      }

models = [lm, rfr, gbr]


pipeline_casero(PATH, preprocessing=pp, creation=fc, selection=fs, models=models)

## Registered and Casual Forecasts with PCA

Here we try to do principle component analysis rather than feature selection to see if implicitly including all the other variables into the forecast variables via linear transformation will improve our result. The answer is no - this was a disaster, resulting in a negative $R^{2}$

In [None]:
pp = [drop_registered, drop_casual, drop_date, drop_instant, year_as_bool,\
      season_as_category, month_as_category, weekday_as_category, \
      hour_as_category, holiday_as_bool, working_day_as_bool, weather_sit_as_category,
     change_season, change_weather_sit]


fc= [commute_hours, weather_cluster, prediction_forecasts, deep_features]

dr = [pca]


lm = {"name": "Linear Regression",
      "model": LinearRegression(),
      "params": {}
     }

rfr = {"name": "Random Forest",
       "model": RandomForestRegressor(),
       "params": {"random_state": [SEED],
                  "n_estimators": list(range(10, 110, 10))
                 }
      }

gbr = {"name": "Gradient Boosting",
       "model": GradientBoostingRegressor(),
       "params": {"random_state": [SEED],
                  "n_estimators": list(range(500, 1050, 50)),
                  "learning_rate": [0.01, 0.05, 0.1, 0.15, 0.2]
                 }
      }

models = [lm, rfr, gbr]


pipeline_casero(PATH, preprocessing=pp, creation=fc, reduction=dr, models=models)

## Using Registered and Casual only with Tree-Based Models

Since we figured that linear regression was only being chosen with the forecast variables due to overfitting, we wanted to see if excluding that model from consideration would help our cause. Without a linear regression model, gradient boosting was selected again as the best model but did not perform better than the linear model trained with these variables and therefore still worse than our best score so far.

In [None]:
pp = [drop_registered, drop_casual, drop_date, drop_instant, year_as_bool,\
      season_as_category, month_as_category, weekday_as_category, \
      hour_as_category, holiday_as_bool, working_day_as_bool, weather_sit_as_category,
     change_season, change_weather_sit]


fc= [commute_hours, weather_cluster, prediction_forecasts]

fs = [tree_selection]


lm = {"name": "Linear Regression",
      "model": LinearRegression(),
      "params": {}
     }

rfr = {"name": "Random Forest",
       "model": RandomForestRegressor(),
       "params": {"random_state": [SEED],
                  "n_estimators": list(range(10, 110, 10))
                 }
      }

gbr = {"name": "Gradient Boosting",
       "model": GradientBoostingRegressor(),
       "params": {"random_state": [SEED],
                  "n_estimators": list(range(500, 1050, 50)),
                  "learning_rate": [0.01, 0.05, 0.1, 0.15, 0.2]
                 }
      }

models = [rfr, gbr]


pipeline_casero(PATH, preprocessing=pp, creation=fc, selection=fs, models=models)

## Registered/Casual, Deep Features, Stepwise, Only Trees

Here we use forecasted variables along with deep features and stepwise feature selection, again only with decision trees, to see what the result will be from this combination. There is still no improvement.

In [None]:
pp = [drop_registered, drop_casual, drop_date, drop_instant, year_as_bool,\
      season_as_category, month_as_category, weekday_as_category, \
      hour_as_category, holiday_as_bool, working_day_as_bool, weather_sit_as_category,
     change_season, change_weather_sit]


fc= [commute_hours, weather_cluster, prediction_forecasts, deep_features]

fs = [r2_selection]


lm = {"name": "Linear Regression",
      "model": LinearRegression(),
      "params": {}
     }

rfr = {"name": "Random Forest",
       "model": RandomForestRegressor(),
       "params": {"random_state": [SEED],
                  "n_estimators": list(range(10, 110, 10))
                 }
      }

gbr = {"name": "Gradient Boosting",
       "model": GradientBoostingRegressor(),
       "params": {"random_state": [SEED],
                  "n_estimators": list(range(500, 1050, 50)),
                  "learning_rate": [0.01, 0.05, 0.1, 0.15, 0.2]
                 }
      }

models = [rfr, gbr]


pipeline_casero(PATH, preprocessing=pp, creation=fc, selection=fs, models=models)

## Recursive Feature Elimination w/ Deep Features

This iteration is our best model in terms of performance on the holdout.

This is a variation of the "Deep Feature Synthesis" trial, except this time instead of removing several features at once based on feature importance in a random forest, we only remove them one at a time and refit recursively until the number of features is reduced to 50. This means less features will be used but the features have been taken away more intelligently, and since we are using deep feature synthesis we have created a large number of features to choose from in terms of explaining the target variable.

The final $R^{2}$ on the holdout is 0.8638.



In [None]:
pp = [drop_registered, drop_casual, drop_date, drop_instant, year_as_bool,\
      season_as_category, month_as_category, weekday_as_category, \
      hour_as_category, holiday_as_bool, working_day_as_bool, weather_sit_as_category,
     change_season, change_weather_sit]

fc= [commute_hours, weather_cluster, deep_features]

fs = [rfe]

models=[gbr]

pipeline_casero(PATH, preprocessing=pp, creation=fc, selection=fs, models=models)

## Recursive Feature Elimination with Basic Features


Since previously our best iteration had been basic features without deep feature synthesis, we wanted to try recursive feature elimination on just those features to see if we would improve our score from the previous iteration. We see that we actually did not do better, which demonstrate that some of the values produced from deep feature synthesis actually did turn out to be an important deciding variable for predicting the number of riders in an hour.


In [None]:
pp = [drop_registered, drop_casual, drop_date, drop_instant, year_as_bool,\
      season_as_category, month_as_category, weekday_as_category, \
      hour_as_category, holiday_as_bool, working_day_as_bool, weather_sit_as_category,
     change_season, change_weather_sit]

fc= [commute_hours, weather_cluster]

fs = [rfe]

models=[gbr]

pipeline_casero(PATH, preprocessing=pp, creation=fc, selection=fs, models=models)