# Tunning XGBoost
***
In this notebook, let's build a gradient boosting regression to predict a NYC taxi trip duration given the **cleaned data** of taxi trips from 2016. The original data can be found in [kaggle's competition](https://www.kaggle.com/c/nyc-taxi-trip-duration/data). The data wrangling, exploratory data analysis and data cleaning can be found in [this](https://github.com/emmpew/datascience/blob/master/capstone_project/nyc_trip_duration.ipynb) notebook.

### Load Data
***
Load the cleaned data that is ready for modeling.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import GridSearchCV, train_test_split
from xgboost import XGBRegressor

In [2]:
train = pd.read_csv('cleaned_train_data.csv')
test = pd.read_csv('cleaned_test_data.csv')

In [3]:
train.drop('Unnamed: 0',axis=1,inplace=True)
test.drop('Unnamed: 0',axis=1,inplace=True)

In [4]:
X_train = train.drop('trip_duration',axis=1)
y_train = train['trip_duration']

X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.30, random_state=102)

### Parameters
***

XGBoost has more than 20 parameters to tune. Tuning more parameters usually gives better results; however, it does become computationally expensive and time consuming. For this project, let's focus on the following 5 paramenters:

- n_estimators
- max_depth
- min_child_weight
- subsample
- colsample_bytree

To validate the model, let's use cross-validation so that the metrics have a more precise estimate, in this case [Root Mean Squared Logarithmic Error](https://www.kaggle.com/wiki/RootMeanSquaredLogarithmicError). Due to the large amount of data(~1.4 million entries), let's split our data only in 5 folds. This means that cross-validation will be 5x as expensive as fitting a single model.

In [5]:
def rmsle(y_true,y_pred):
   assert len(y_true) == len(y_pred)
   return np.square(np.log(y_pred + 1) - np.log(y_true + 1)).mean() ** 0.5

In [6]:
def hyperparameter_tunning(params):
    xgboost_cv = GridSearchCV(XGBRegressor(),params, cv=5,verbose=3)
    %time xgboost_cv.fit(X_train,y_train)
    
    y_pred = xgboost_cv.predict(X_test)
    metric = rmsle(y_test,y_pred)

    print("Best Parameters: {}".format(xgboost_cv.best_params_)) 
    print("Best score is {}".format(xgboost_cv.best_score_))
    print("RMSLE is {}".format(metric))

### n_estimators

This parameter sets the number of boosted trees to fit. Each weak learner(tree) improves from the previous one. So the more you have, the more it will learn. However, there is a certain threshold were the model will stop learning and will only be noise. Also, the higher n_estimators, the more computantionally expensive it becomes. So for this case let's just use 3 different values for the parameter.

In [7]:
params = {'n_estimators': [10,30,50]}

hyperparameter_tunning(params)

Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV] n_estimators=10 .................................................
[CV] ........ n_estimators=10, score=0.3727455669785603, total=   8.2s
[CV] n_estimators=10 .................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    8.6s remaining:    0.0s


[CV] ....... n_estimators=10, score=0.37256353522029295, total=   8.2s
[CV] n_estimators=10 .................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   17.2s remaining:    0.0s


[CV] ........ n_estimators=10, score=0.3680092828338972, total=   7.9s
[CV] n_estimators=10 .................................................
[CV] ....... n_estimators=10, score=0.37146510068102323, total=   8.2s
[CV] n_estimators=10 .................................................
[CV] ....... n_estimators=10, score=0.37314683753481626, total=   8.1s
[CV] n_estimators=30 .................................................
[CV] ........ n_estimators=30, score=0.6905727837114519, total=  22.3s
[CV] n_estimators=30 .................................................
[CV] ........ n_estimators=30, score=0.6942748936703185, total=  22.4s
[CV] n_estimators=30 .................................................
[CV] ........ n_estimators=30, score=0.6788522645289554, total=  22.8s
[CV] n_estimators=30 .................................................
[CV] ........ n_estimators=30, score=0.6921605492682087, total=  22.3s
[CV] n_estimators=30 .................................................
[CV] .

[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:  5.8min finished


CPU times: user 6min 31s, sys: 3.75 s, total: 6min 35s
Wall time: 6min 35s


  This is separate from the ipykernel package so we can avoid doing imports until


Best Parameters: {'n_estimators': 50}
Best score is 0.7159462981176532
RMSLE is 0.44678453304116034


The results shows that using 50 n_estimators will give better results. 

### max_depth

This parameters  is the maximum number of nodes allowed from the root to the farthest leaf of a tree.  This means that the more nodes we add the more complex relationships the algorithm can model; however, at some point splits become less relevant causing the model to overfit. Let's use best parameters found so far plus 3 different max_depth values to tune the classifier.

In [8]:
params = {'max_depth': [2,5,8],
          'n_estimators': [50]}

hyperparameter_tunning(params)

Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV] max_depth=2, n_estimators=50 ....................................
[CV]  max_depth=2, n_estimators=50, score=0.690850157154514, total=  24.8s
[CV] max_depth=2, n_estimators=50 ....................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   25.6s remaining:    0.0s


[CV]  max_depth=2, n_estimators=50, score=0.6944858128369444, total=  24.6s
[CV] max_depth=2, n_estimators=50 ....................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   51.1s remaining:    0.0s


[CV]  max_depth=2, n_estimators=50, score=0.6790032257249099, total=  25.0s
[CV] max_depth=2, n_estimators=50 ....................................
[CV]  max_depth=2, n_estimators=50, score=0.6931945834340137, total=  24.9s
[CV] max_depth=2, n_estimators=50 ....................................
[CV]  max_depth=2, n_estimators=50, score=0.6907898362754823, total=  24.7s
[CV] max_depth=5, n_estimators=50 ....................................
[CV]  max_depth=5, n_estimators=50, score=0.7527919516031409, total= 1.0min
[CV] max_depth=5, n_estimators=50 ....................................
[CV]  max_depth=5, n_estimators=50, score=0.7570808032205429, total= 1.0min
[CV] max_depth=5, n_estimators=50 ....................................
[CV]  max_depth=5, n_estimators=50, score=0.7406796889876259, total= 1.0min
[CV] max_depth=5, n_estimators=50 ....................................
[CV]  max_depth=5, n_estimators=50, score=0.7551514344485716, total= 1.0min
[CV] max_depth=5, n_estimators=50 ........

[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed: 16.6min finished


CPU times: user 18min 44s, sys: 4.39 s, total: 18min 49s
Wall time: 18min 49s
Best Parameters: {'max_depth': 8, 'n_estimators': 50}
Best score is 0.7875367106969244
RMSLE is 0.38458887312586937


  This is separate from the ipykernel package so we can avoid doing imports until


The result show that with 50 n_estimators the best max_depth is 8. Tuning only two parameters the RMSLE lowered to 0.3845.

### min_child_weight

This parameter refers to the minimum number of samples required to create a new node in a tree. If the value is lower, the algorithm can create children that correspond to fewer samples. Let's use best parameters found so far plus 3 different min_child_weight values to tune the classifier.

In [10]:
params = {'min_child_weight': [5,10,15],
          'max_depth': [8],
          'n_estimators': [50]}

hyperparameter_tunning(params)

Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV] max_depth=8, min_child_weight=5, n_estimators=50 ................
[CV]  max_depth=8, min_child_weight=5, n_estimators=50, score=0.7894004515262651, total= 1.9min
[CV] max_depth=8, min_child_weight=5, n_estimators=50 ................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.9min remaining:    0.0s


[CV]  max_depth=8, min_child_weight=5, n_estimators=50, score=0.7931386224969376, total= 1.9min
[CV] max_depth=8, min_child_weight=5, n_estimators=50 ................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  3.9min remaining:    0.0s


[CV]  max_depth=8, min_child_weight=5, n_estimators=50, score=0.7754018449843321, total= 1.9min
[CV] max_depth=8, min_child_weight=5, n_estimators=50 ................
[CV]  max_depth=8, min_child_weight=5, n_estimators=50, score=0.7937147333142706, total= 2.0min
[CV] max_depth=8, min_child_weight=5, n_estimators=50 ................
[CV]  max_depth=8, min_child_weight=5, n_estimators=50, score=0.787488377372252, total= 2.1min
[CV] max_depth=8, min_child_weight=10, n_estimators=50 ...............
[CV]  max_depth=8, min_child_weight=10, n_estimators=50, score=0.789561872871535, total= 1.9min
[CV] max_depth=8, min_child_weight=10, n_estimators=50 ...............
[CV]  max_depth=8, min_child_weight=10, n_estimators=50, score=0.7935840519014963, total= 2.0min
[CV] max_depth=8, min_child_weight=10, n_estimators=50 ...............
[CV]  max_depth=8, min_child_weight=10, n_estimators=50, score=0.7760214244924619, total= 2.1min
[CV] max_depth=8, min_child_weight=10, n_estimators=50 .............

[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed: 31.4min finished


CPU times: user 33min, sys: 15.2 s, total: 33min 16s
Wall time: 33min 48s
Best Parameters: {'max_depth': 8, 'min_child_weight': 10, 'n_estimators': 50}
Best score is 0.7881835814256996
RMSLE is 0.38403234174560336


  This is separate from the ipykernel package so we can avoid doing imports until


With 50 n_estimators and max_child_weight of 8, the best min_child_weight is 10. The improvement in RMSLE was only **0.0005**, with a value of 0.3840. 

### subsample

This parameter is to determine the fraction of entries(rows) to subsample at each step. Let's use best parameters found so far plus 3 different subsample values to tune the classifier.

In [12]:
params = {'subsample': [0.4,0.6,0.8],
          'min_child_weight': [10],
          'max_depth': [8],
          'n_estimators': [50]}

hyperparameter_tunning(params)

Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV] max_depth=8, min_child_weight=10, n_estimators=50, subsample=0.4 
[CV]  max_depth=8, min_child_weight=10, n_estimators=50, subsample=0.4, score=0.7890842694634209, total= 1.8min
[CV] max_depth=8, min_child_weight=10, n_estimators=50, subsample=0.4 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.9min remaining:    0.0s


[CV]  max_depth=8, min_child_weight=10, n_estimators=50, subsample=0.4, score=0.793275202285739, total= 1.8min
[CV] max_depth=8, min_child_weight=10, n_estimators=50, subsample=0.4 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  3.7min remaining:    0.0s


[CV]  max_depth=8, min_child_weight=10, n_estimators=50, subsample=0.4, score=0.7740062049358123, total= 1.8min
[CV] max_depth=8, min_child_weight=10, n_estimators=50, subsample=0.4 
[CV]  max_depth=8, min_child_weight=10, n_estimators=50, subsample=0.4, score=0.7916775420192559, total= 1.8min
[CV] max_depth=8, min_child_weight=10, n_estimators=50, subsample=0.4 
[CV]  max_depth=8, min_child_weight=10, n_estimators=50, subsample=0.4, score=0.787144919877181, total= 1.7min
[CV] max_depth=8, min_child_weight=10, n_estimators=50, subsample=0.6 
[CV]  max_depth=8, min_child_weight=10, n_estimators=50, subsample=0.6, score=0.7887579594180926, total= 1.7min
[CV] max_depth=8, min_child_weight=10, n_estimators=50, subsample=0.6 
[CV]  max_depth=8, min_child_weight=10, n_estimators=50, subsample=0.6, score=0.7926183299878203, total= 1.8min
[CV] max_depth=8, min_child_weight=10, n_estimators=50, subsample=0.6 
[CV]  max_depth=8, min_child_weight=10, n_estimators=50, subsample=0.6, score=0.776834

[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed: 28.4min finished


CPU times: user 30min 6s, sys: 15 s, total: 30min 21s
Wall time: 30min 44s
Best Parameters: {'max_depth': 8, 'min_child_weight': 10, 'n_estimators': 50, 'subsample': 0.8}
Best score is 0.7877378115851132
RMSLE is 0.38542680033216076


  This is separate from the ipykernel package so we can avoid doing imports until


RMSLE went up that means that the default value (1.0) for subsample is the best option.

## colsample_bytree

This parameter is to choose the fraction of features that the algorithm will use. Remember that some features are more useful than others.

In [13]:
params = {'colsample_bytree': [0.8,0.9,1.0],
          'min_child_weight': [10],
          'max_depth': [8],
          'n_estimators': [50]}

hyperparameter_tunning(params)

Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV] colsample_bytree=0.8, max_depth=8, min_child_weight=10, n_estimators=50 
[CV]  colsample_bytree=0.8, max_depth=8, min_child_weight=10, n_estimators=50, score=0.7865058686107766, total= 1.4min
[CV] colsample_bytree=0.8, max_depth=8, min_child_weight=10, n_estimators=50 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.5min remaining:    0.0s


[CV]  colsample_bytree=0.8, max_depth=8, min_child_weight=10, n_estimators=50, score=0.791446645043126, total= 1.4min
[CV] colsample_bytree=0.8, max_depth=8, min_child_weight=10, n_estimators=50 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  3.0min remaining:    0.0s


[CV]  colsample_bytree=0.8, max_depth=8, min_child_weight=10, n_estimators=50, score=0.7739798142219648, total= 2.0min
[CV] colsample_bytree=0.8, max_depth=8, min_child_weight=10, n_estimators=50 
[CV]  colsample_bytree=0.8, max_depth=8, min_child_weight=10, n_estimators=50, score=0.7900780408808624, total= 1.5min
[CV] colsample_bytree=0.8, max_depth=8, min_child_weight=10, n_estimators=50 
[CV]  colsample_bytree=0.8, max_depth=8, min_child_weight=10, n_estimators=50, score=0.7855123433603144, total= 1.5min
[CV] colsample_bytree=0.9, max_depth=8, min_child_weight=10, n_estimators=50 
[CV]  colsample_bytree=0.9, max_depth=8, min_child_weight=10, n_estimators=50, score=0.78851208153541, total= 1.6min
[CV] colsample_bytree=0.9, max_depth=8, min_child_weight=10, n_estimators=50 
[CV]  colsample_bytree=0.9, max_depth=8, min_child_weight=10, n_estimators=50, score=0.7928138230903431, total= 1.6min
[CV] colsample_bytree=0.9, max_depth=8, min_child_weight=10, n_estimators=50 
[CV]  colsample_b

[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed: 26.6min finished


CPU times: user 28min 18s, sys: 15.7 s, total: 28min 34s
Wall time: 29min 1s
Best Parameters: {'colsample_bytree': 1.0, 'max_depth': 8, 'min_child_weight': 10, 'n_estimators': 50}
Best score is 0.7881835814256996
RMSLE is 0.38403234174560336


  This is separate from the ipykernel package so we can avoid doing imports until


The best value for colsample_bytree was the default 1.0. RMSLE remained at **0.3840** making it the best result tuning only 5 parameters with 5 folds.


## Results
***
After more than 2 hours of tuning, only 5 hyperparameters were explored. XGBoost is a powerful ensemble model; however, it is computationally expensive. The best hyperparameter values found were the following:

- n_estimators: 50
- max_depth: 8
- min_child_weight: 10
- subsample: 1.0 (default value)
- colsample_bytree: 1.0 (default value)

The best RMSLE was **0.3840**. There is more that this powerful algorithm can do. For this project, this was a good result.