# Block 6 Exercise 2: finding the best parameters for predicting the fare of taxi rides
We return to our Random Forest Regression and want to automatically optimize all free parameters ...

In [1]:
import pandas as pd
import numpy as np
import folium

In [2]:
# we load the data we have saved after wrangling and pre-processing in block I
X=pd.read_csv('../../DATA/train_cleaned.csv')
drop_columns=['Unnamed: 0','Unnamed: 0.1','Unnamed: 0.1.1','key','pickup_datetime','pickup_date','pickup_latitude_round3','pickup_longitude_round3','dropoff_latitude_round3','dropoff_longitude_round3']
X=X.drop(drop_columns,axis=1)
X=pd.get_dummies(X)# one hot coding
#generate labels
y=X['fare_amount']
X=X.drop(['fare_amount'],axis=1)

### Scikit Optimize
Scikit Optimize (https://scikit-optimize.github.io/stable/index.html) is a AutoML toolbox wrapped around Scikit-Learn. It allows us to use state-of-the-art automatic hyper-parameter optimization on top of our learning algorithms.   



In [None]:
# install 
#!pip install scikit-optimize

### E 2.1 Bayesian Optimization of a Random Forest Regression Model
use Bayesian Optimization with Cross-Validation (https://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html#skopt.BayesSearchCV) to find the best regression model. Compare
* linear regression (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) 
* Random Forest regression (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor)
* and SVM regression (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR)

NOTES: this can become quite compute intensive! Hence,
* use a smaller subset of the training data to run the experiments 
* think about the range of your parameters (e.g. larger number of trees in RF or high C-values in SMV will make models expensive)
* optimize only the following parameters per model type:
    * linear: no parameters to optimize
    * RF: #trees and depth
    * SVM: C and gamma (use RBF kernel)
* parallelize -> n_jobs
* use CoLab to rum the job for up to 12h 


In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=42)

In [4]:
estimators = [('linear_reg', LinearRegression())]
pipe = Pipeline(estimators)
pipe[0].set_params(n_jobs=-1)
pipe.fit(X_train, y_train)
print('train score: {}'.format(pipe[0].score(X_train, y_train)))
print('test score: {}'.format(pipe[0].score(X_test, y_test)))

train score: 0.7354513730762149
test score: 0.7164914829663074


In [5]:
opt_rbf = BayesSearchCV(
        RandomForestRegressor(),
        {
            'n_estimators': Integer(100, 200),
            'max_depth': Integer(5, 25)
        },
        n_iter=32,
        random_state=42,
        n_jobs=-1
    )

_ = opt_rbf.fit(X_train, y_train)
opt_rbf.best_estimator_



RandomForestRegressor(max_depth=23, n_estimators=200)

In [6]:
print('Random Forest Regressor train score: {}'.format(opt_rbf.score(X_train, y_train)))
print('Random Forest Regressor test score: {}'.format(opt_rbf.score(X_test, y_test)))

Random Forest Regressor train score: 0.9679813070825207
Random Forest Regressor test score: 0.8085650912477428


**Result: Generalized not good, as result on test data set significantly worse.**

In [None]:
opt_svm = BayesSearchCV(
        SVR(),
        {
            'C': Integer(190, 200),
            'gamma': Categorical(['scale', 'auto']),
            'kernel': Categorical(['rbf']),
        },
        n_iter=32,
        random_state=42,
        n_jobs=-1
    )

_ = opt_svm.fit(X_train, y_train)
opt_svm.best_estimator_

In [None]:
print('Support Vector Regression train score: {}'.format(opt_svm.score(X_train, y_train)))
print('Support Vector Regression test score: {}'.format(opt_svm.score(X_test, y_test)))