# Testing out Features and Models

In this file is several different executions of our pipeline with different combinations of feature creation methods, feature selection methods, models, and so on. You can kind of think of this file as a type of grid search cross validation where we're not testing only model parameters but different combinations of features as well.

#### Warning: Re-running this notebook takes a while to run.

If there is a particular pipeline you would like to retry, run them individually and before doing so be aware of how long it will take.

The best model of all of these is the model with no feature creation and only Recursive Feature Engineering applied as feature selection.

In [1]:
import warnings; warnings.simplefilter('ignore') # suppress warnings
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
import joblib
import pandas as pd

import dask.dataframe as dd

# Home-made modules
from utils import *
from preprocessing import *
from feature_creation import *
from feature_selection import  *
from dim_reduction import *

In [2]:
client = Client()

In [3]:
data = dd.read_csv('https://gist.githubusercontent.com/catyselman/9353e4e480ddf2db44b44a79e14718b5/raw/ded23e586ca5db1b4a566b1e289acd12ebf69357/bikeshare_hourly.csv', blocksize=25e4)

In [4]:
data.dtypes

instant         int64
dteday         object
season          int64
yr              int64
mnth            int64
hr              int64
holiday         int64
weekday         int64
workingday      int64
weathersit      int64
temp          float64
atemp         float64
hum           float64
windspeed     float64
casual          int64
registered      int64
cnt             int64
dtype: object

In [5]:
data['realtemp']=(data.temp*41)

In [6]:
pp = [drop_registered, drop_casual, drop_date, year_as_bool,\
      season_as_category, month_as_category, weekday_as_category, \
      hour_as_category, holiday_as_bool, working_day_as_bool, weather_sit_as_category, categorize_columns,
     ]

lm = {"name": "Linear Regression",
      "model": LinearRegression(),
      "params": {}\
     }

rfr = {"name": "Random Forest",
       "model": RandomForestRegressor(),
       "params": {"random_state": [SEED]}
      }

gbr = {"name": "Gradient Boosting",
       "model": GradientBoostingRegressor(),
       "params": { "random_state": [SEED] }
      }

models = [lm, rfr, gbr]


### Baseline - No feature engineering

In [7]:
best_model = pipeline_casero(data, preprocessing=pp, models=models)

Beginning pipeline at 2019-05-19 20:12:29.880424

Performing preprocessing steps...
	Dropping the registered variable since we won't have this info
	Dropping the causual variable since we won't have this info
	Dropping the date variable since this information isencoded in other variables
	Converting year to a boolean variable...
	Converting season to a categorical variable...
	Converting month to a categorical variable...
	Converting day of week to a categorical variable...
	Converting hour of day to a categorical variable...
	Converting holiday or not to a boolean variable...
	Converting holiday or not to a boolean variable...
	Converting weather situation to a categorical variable...
Preprocessing completed at 2019-05-19 20:12:30.040968, performed 12 steps
New Shape of data: 15

Performing feature creation...
Feature Creation completed at 2019-05-19 20:12:30.041641, performed 0 steps
New Shape of data: 15

Dummifying...
New Shape of data: 61

Index(['weathersit_1', 'weathersit_2', 'w

### Basic Feature Engineering

In [8]:
fc = [commute_hours, weather_cluster]
best_model = pipeline_casero(data, preprocessing=pp, creation=fc, models=models)

Beginning pipeline at 2019-05-19 20:13:13.187354

Performing preprocessing steps...
	Dropping the registered variable since we won't have this info
	Dropping the causual variable since we won't have this info
	Dropping the date variable since this information isencoded in other variables
	Converting year to a boolean variable...
	Converting season to a categorical variable...
	Converting month to a categorical variable...
	Converting day of week to a categorical variable...
	Converting hour of day to a categorical variable...
	Converting holiday or not to a boolean variable...
	Converting holiday or not to a boolean variable...
	Converting weather situation to a categorical variable...
Preprocessing completed at 2019-05-19 20:13:13.367958, performed 12 steps
New Shape of data: 15

Performing feature creation...
	Adding variable for Commute Hours, 1 for yes and 0 for false
	Adding clustering variable based on weather-related features...

Dummifying to create clusters...
New Shape of dat

### Predicting Casual and Registered Counts

In [9]:
fc = [prediction_forecasts]
best_model = pipeline_casero(data, preprocessing=pp, creation=fc, models=models)

Beginning pipeline at 2019-05-19 20:14:41.031636

Performing preprocessing steps...
	Dropping the registered variable since we won't have this info
	Dropping the causual variable since we won't have this info
	Dropping the date variable since this information isencoded in other variables
	Converting year to a boolean variable...
	Converting season to a categorical variable...
	Converting month to a categorical variable...
	Converting day of week to a categorical variable...
	Converting hour of day to a categorical variable...
	Converting holiday or not to a boolean variable...
	Converting holiday or not to a boolean variable...
	Converting weather situation to a categorical variable...
Preprocessing completed at 2019-05-19 20:14:41.240629, performed 12 steps
New Shape of data: 15

Performing feature creation...
	Adding casual_forecast variable...
Dummifying...
New Shape of data: 60

		Predictions for GridSearchCV on casual: 0.46363 +/- 0.13544
	Adding registered_forecast variable...
Du

### Basic Feature Engineering  Predicting Casual + Registered Counts 

In [10]:
fc = [commute_hours, weather_cluster, prediction_forecasts]
best_model = pipeline_casero(data, preprocessing=pp, creation=fc, models=models)

Beginning pipeline at 2019-05-19 20:15:45.507222

Performing preprocessing steps...
	Dropping the registered variable since we won't have this info
	Dropping the causual variable since we won't have this info
	Dropping the date variable since this information isencoded in other variables
	Converting year to a boolean variable...
	Converting season to a categorical variable...
	Converting month to a categorical variable...
	Converting day of week to a categorical variable...
	Converting hour of day to a categorical variable...
	Converting holiday or not to a boolean variable...
	Converting holiday or not to a boolean variable...
	Converting weather situation to a categorical variable...
Preprocessing completed at 2019-05-19 20:15:45.688369, performed 12 steps
New Shape of data: 15

Performing feature creation...
	Adding variable for Commute Hours, 1 for yes and 0 for false
	Adding clustering variable based on weather-related features...

Dummifying to create clusters...
New Shape of dat

### Genetic Programming

In [11]:
fc = [genetic_programming]
best_model = pipeline_casero(data, preprocessing=pp, creation=fc, models=models)

Beginning pipeline at 2019-05-19 20:17:57.102994

Performing preprocessing steps...
	Dropping the registered variable since we won't have this info
	Dropping the causual variable since we won't have this info
	Dropping the date variable since this information isencoded in other variables
	Converting year to a boolean variable...
	Converting season to a categorical variable...
	Converting month to a categorical variable...
	Converting day of week to a categorical variable...
	Converting hour of day to a categorical variable...
	Converting holiday or not to a boolean variable...
	Converting holiday or not to a boolean variable...
	Converting weather situation to a categorical variable...
Preprocessing completed at 2019-05-19 20:17:57.265797, performed 12 steps
New Shape of data: 15

Performing feature creation...
Creating features through genetic programming...
    |    Population Average   |             Best Individual              |
---- ------------------------- ----------------------

### Basic Featueres + Tree Selection 

In [12]:
fc = [commute_hours, weather_cluster]
fs = [tree_selection]
best_model = pipeline_casero(data, preprocessing=pp, creation=fc, selection = fs, models=models)

Beginning pipeline at 2019-05-19 20:22:41.352536

Performing preprocessing steps...
	Dropping the registered variable since we won't have this info
	Dropping the causual variable since we won't have this info
	Dropping the date variable since this information isencoded in other variables
	Converting year to a boolean variable...
	Converting season to a categorical variable...
	Converting month to a categorical variable...
	Converting day of week to a categorical variable...
	Converting hour of day to a categorical variable...
	Converting holiday or not to a boolean variable...
	Converting holiday or not to a boolean variable...
	Converting weather situation to a categorical variable...
Preprocessing completed at 2019-05-19 20:22:41.531490, performed 12 steps
New Shape of data: 15

Performing feature creation...
	Adding variable for Commute Hours, 1 for yes and 0 for false
	Adding clustering variable based on weather-related features...

Dummifying to create clusters...
New Shape of dat

### Genetic Programming + Tree Selection

In [13]:
fc = [genetic_programming]
fs = [tree_selection]
best_model = pipeline_casero(data, preprocessing=pp, creation=fc, selection = fs, models=models)

Beginning pipeline at 2019-05-19 20:23:55.777961

Performing preprocessing steps...
	Dropping the registered variable since we won't have this info
	Dropping the causual variable since we won't have this info
	Dropping the date variable since this information isencoded in other variables
	Converting year to a boolean variable...
	Converting season to a categorical variable...
	Converting month to a categorical variable...
	Converting day of week to a categorical variable...
	Converting hour of day to a categorical variable...
	Converting holiday or not to a boolean variable...
	Converting holiday or not to a boolean variable...
	Converting weather situation to a categorical variable...
Preprocessing completed at 2019-05-19 20:23:55.974568, performed 12 steps
New Shape of data: 15

Performing feature creation...
Creating features through genetic programming...
    |    Population Average   |             Best Individual              |
---- ------------------------- ----------------------

### Genetic Programming + Stepwise

In [14]:
fc = [commute_hours]
fs = [rfe]
best_model = pipeline_casero(data, preprocessing=pp, creation=fc, selection = fs, models=models)

Beginning pipeline at 2019-05-19 20:28:23.016797

Performing preprocessing steps...
	Dropping the registered variable since we won't have this info
	Dropping the causual variable since we won't have this info
	Dropping the date variable since this information isencoded in other variables
	Converting year to a boolean variable...
	Converting season to a categorical variable...
	Converting month to a categorical variable...
	Converting day of week to a categorical variable...
	Converting hour of day to a categorical variable...
	Converting holiday or not to a boolean variable...
	Converting holiday or not to a boolean variable...
	Converting weather situation to a categorical variable...
Preprocessing completed at 2019-05-19 20:28:23.180011, performed 12 steps
New Shape of data: 15

Performing feature creation...
	Adding variable for Commute Hours, 1 for yes and 0 for false
Feature Creation completed at 2019-05-19 20:28:23.194688, performed 1 steps
New Shape of data: 16

Dummifying...
Ne

### Recursive Feature Elimination

In [15]:
fc = []
fs = [rfe]
best_model = pipeline_casero(data, preprocessing=pp, creation=fc, selection = fs, models=models)

Beginning pipeline at 2019-05-19 20:29:41.351123

Performing preprocessing steps...
	Dropping the registered variable since we won't have this info
	Dropping the causual variable since we won't have this info
	Dropping the date variable since this information isencoded in other variables
	Converting year to a boolean variable...
	Converting season to a categorical variable...
	Converting month to a categorical variable...
	Converting day of week to a categorical variable...
	Converting hour of day to a categorical variable...
	Converting holiday or not to a boolean variable...
	Converting holiday or not to a boolean variable...
	Converting weather situation to a categorical variable...
Preprocessing completed at 2019-05-19 20:29:41.551222, performed 12 steps
New Shape of data: 15

Performing feature creation...
Feature Creation completed at 2019-05-19 20:29:41.553177, performed 0 steps
New Shape of data: 15

Dummifying...
New Shape of data: 61

Index(['weathersit_1', 'weathersit_2', 'w

### Tree Selection

In [16]:
fc = []
fs = [tree_selection]
best_model = pipeline_casero(data, preprocessing=pp, creation=fc, selection = fs, models=models)

Beginning pipeline at 2019-05-19 20:30:34.235942

Performing preprocessing steps...
	Dropping the registered variable since we won't have this info
	Dropping the causual variable since we won't have this info
	Dropping the date variable since this information isencoded in other variables
	Converting year to a boolean variable...
	Converting season to a categorical variable...
	Converting month to a categorical variable...
	Converting day of week to a categorical variable...
	Converting hour of day to a categorical variable...
	Converting holiday or not to a boolean variable...
	Converting holiday or not to a boolean variable...
	Converting weather situation to a categorical variable...
Preprocessing completed at 2019-05-19 20:30:34.406962, performed 12 steps
New Shape of data: 15

Performing feature creation...
Feature Creation completed at 2019-05-19 20:30:34.409127, performed 0 steps
New Shape of data: 15

Dummifying...
New Shape of data: 61

Index(['weathersit_1', 'weathersit_2', 'w

### Deep Feature Synthesis

This is the only feature engineering step that doesn't use dask. The 'deep_features' function will convert the dataset into a Pandas dataframe for processing.

In [17]:
fc = [deep_features]
fs = []
dr = [pca]
best_model = pipeline_casero(data, preprocessing=pp, creation=fc, selection = fs, reduction = dr, models=models)

Beginning pipeline at 2019-05-19 20:31:08.069966

Performing preprocessing steps...
	Dropping the registered variable since we won't have this info
	Dropping the causual variable since we won't have this info
	Dropping the date variable since this information isencoded in other variables
	Converting year to a boolean variable...
	Converting season to a categorical variable...
	Converting month to a categorical variable...
	Converting day of week to a categorical variable...
	Converting hour of day to a categorical variable...
	Converting holiday or not to a boolean variable...
	Converting holiday or not to a boolean variable...
	Converting weather situation to a categorical variable...
Preprocessing completed at 2019-05-19 20:31:08.262383, performed 12 steps
New Shape of data: 15

Performing feature creation...
	Performing Deep Feature Synthesis...
		Created feature seasons.STD(bikeshare_hourly.holiday)
		Created feature seasons.STD(bikeshare_hourly.workingday)
		Created feature season