# Final Model

This notebook serves to perform a deeper analysis of the best model and fine-tune this model with parameters using Grid Search.

In [1]:
import warnings; warnings.simplefilter('ignore') # suppress warnings
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
import pandas as pd
import dask.dataframe as dd


# Home-made modules
from utils import *
from preprocessing import *
from feature_creation import *
from feature_selection import  *
from dim_reduction import *

In [2]:
client = Client()

In [3]:
data = dd.read_csv('https://gist.githubusercontent.com/catyselman/9353e4e480ddf2db44b44a79e14718b5/raw/ded23e586ca5db1b4a566b1e289acd12ebf69357/bikeshare_hourly.csv', blocksize=25e4)

In [4]:
data['realtemp']=(data.temp*41)

In [5]:
pp = [drop_registered, drop_casual, drop_date, year_as_bool,\
      season_as_category, month_as_category, weekday_as_category, \
      hour_as_category, holiday_as_bool, working_day_as_bool, weather_sit_as_category, categorize_columns,
     ]

fc = []

fs = [rfe]

lm = {"name": "Linear Regression",
      "model": LinearRegression(),
      "params": {}
     }

rfr = {"name": "Random Forest",
       "model": RandomForestRegressor(),
       "params": {"random_state": [SEED],
                  "n_estimators": list(range(10, 150, 10)),
                  "max_features": [None, 'auto', 5],
                  "min_samples_leaf": [1, 3]
                 }
      }

gbr = {"name": "Gradient Boosting",
       "model": GradientBoostingRegressor(),
       "params": {"random_state": [SEED],
                  "n_estimators": [50, 80 ],
                  "learning_rate": [0.01, 0.05, 0.1, 0.15, 0.2],
                  "min_samples_leaf": [1, 3]
                 }\
      }

models = [lm, rfr, gbr]

best_model = pipeline_casero(data, preprocessing=pp, creation=fc, selection = fs, models=models)

Beginning pipeline at 2019-05-19 20:35:58.673010

Performing preprocessing steps...
	Dropping the registered variable since we won't have this info
	Dropping the causual variable since we won't have this info
	Dropping the date variable since this information isencoded in other variables
	Converting year to a boolean variable...
	Converting season to a categorical variable...
	Converting month to a categorical variable...
	Converting day of week to a categorical variable...
	Converting hour of day to a categorical variable...
	Converting holiday or not to a boolean variable...
	Converting holiday or not to a boolean variable...
	Converting weather situation to a categorical variable...
Preprocessing completed at 2019-05-19 20:35:58.844027, performed 12 steps
New Shape of data: 15

Performing feature creation...
Feature Creation completed at 2019-05-19 20:35:58.845180, performed 0 steps
New Shape of data: 15

Dummifying...
New Shape of data: 61

Index(['weekday_0', 'weekday_1', 'weekday

### Final model of RandomForestRegressor with score of 84.49% was found on holdout after fine-tuning the model.