<a href="https://colab.research.google.com/github/VictorSylva/VictorSylva/blob/main/GridSearchCV_and_RandomSearchCV.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**GridSearchCV and RandomSearchCV**

Taking Boston house price dataset to check accuracy of Random Forest Regression model and tuning hyperparameters-number of estimators and max depth of the tree to find the best value. 

First load boston data and split into train and test sets.

In [None]:
import numpy as np 
import pandas as pd


In [None]:
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

In [None]:
#splitting data into train and test
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data,target,test_size=0.3)

Apply Random Forest Regressor model with n_estimators of 5 and max_depth of 3

In [None]:
from sklearn import ensemble
dt=ensemble.RandomForestRegressor(n_estimators=5,max_depth=3)
dt.fit(x_train,y_train)
print('training score: ', dt.score(x_train,y_train))
print('test score: ',dt.score(x_test,y_test))

training score:  0.8676726734830894
test score:  0.8300173640130096


Let's try with cross validation and see:

In [None]:
from sklearn.model_selection import cross_val_score
scores1 = cross_val_score(ensemble.RandomForestRegressor(n_estimators=5,max_depth=3),x_train,y_train,cv=10)
np.average(scores1)

0.760223694586679

The *Cross Validation* alone is not giving us something good. Let's combine it with *GridSearchCV*

For this, a list of dictionaries of the list of hyperparameter values is passed. Then model is evaluated on every combination of each list of values to find the best one.



In [None]:
ensemble.RandomForestRegressor().get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [15]:
model=ensemble.RandomForestRegressor()
from sklearn.model_selection import GridSearchCV
parameters=[{'n_estimators':[20,30,40,60,100], 'max_depth': 
             [5,10,15,20],'max_features':[2,5,8]}]
             
grid_search = GridSearchCV(estimator=model,
                           param_grid=parameters,
                           cv=10,
                           n_jobs=-1)
                           
grid = grid_search.fit(x_train,y_train)
grid.best_score_

0.8562229683697149

In [None]:
grid.best_params_

{'max_depth': 10, 'max_features': 8, 'n_estimators': 20}

In [None]:
#trying the best hyperparameters suggested by GridSearchCV
#{'max_depth': 10, 'max_features': 5, 'n_estimators': 100}

from sklearn import ensemble
dt_1=ensemble.RandomForestRegressor(n_estimators=100, max_depth=10, max_features = 5)
dt_1.fit(x_train,y_train)
print('training score: ', dt_1.score(x_train,y_train))
print('test score: ',dt_1.score(x_test,y_test))

training score:  0.979257582511003
test score:  0.9016876387931118


#Random Search CV

In Random search CV, the best combination can not be identified because all the combinations are not tested. But the advantage is we can test a broad range of values for hyperparameters within the same computation time as grid search CV.

Below is the implementation of Random search for the above example of Boston Housing prices dataset.

In [16]:
from sklearn.model_selection import RandomizedSearchCV
model=ensemble.RandomForestRegressor()
param_grid=[{'n_estimators':[20,30,40,60,100], 'max_depth':[5,10,15,20] 
          },{'n_estimators':[20,30,40,60,100], 'max_depth':[5,10,15,20]
             ,'max_features':[2,5,8]}]

rnd_search = RandomizedSearchCV(model, param_grid, cv=10, 
                                          return_train_score=True)
rnd_search.fit(x_train,y_train)
rnd_search.best_score_

0.8527961988409496

In [None]:
rnd_search.best_params_

{'n_estimators': 100, 'max_features': 5, 'max_depth': 20}

In [None]:
#trying the best hyperparameters suggested by RandomizedSearchCV

#{'max_depth': 10, 'max_features': 8, 'n_estimators': 20}



from sklearn import ensemble
dt=ensemble.RandomForestRegressor(n_estimators=20, max_depth=10, max_features = 8)
dt.fit(x_train,y_train)
print('training score: ', dt.score(x_train,y_train))
print('test score: ',dt.score(x_test,y_test))

training score:  0.9644230305537048
test score:  0.881733229588773


#**Save and Load Machine Learning Models**

In Machine Learning, when we finally arrive at the model that we are satisfied with, we can save and reuse it later.

In this session, we will see the two(2) most common ways of saving our model and reloading it for future predictions:

1. Finalize Your Model with pickle
2. Finalize Your Model with Joblib

#Finalize Your Model with pickle

Pickle is the standard way of serializing objects in Python. You can use the pickle1 operation to serialize your machine learning algorithms and save the serialized format to a file. Later you can load this file to deserialize your model and use it to make new predictions. The example below demonstrates how you can save your model to file and load it to make predictions on the unseen test set. We will save the best model from our previous tutorial(see code above).

In [17]:
import pickle


pickle.dump(dt_1, open('model.pkl', 'wb'))



# some time later...
# load the model from disk
pickled_model = pickle.load(open('model.pkl', 'rb'))

Running the example saves the model to finalized model.pkl in your local working directory. Load the saved model and evaluating it provides an estimate of accuracy of the model on unseen data.

In [18]:
pickled_model.predict(x_test) 

array([13.18381377, 16.75036941, 26.16508854, 15.09399545, 19.7676893 ,
       13.70581901, 17.44155716, 19.63435763, 13.82730952, 13.89921688,
       14.30815742, 20.74550679, 19.44046885, 10.73719356, 23.52185118,
       29.93439654, 24.51996913, 30.07625864, 14.79489984, 20.80625405,
       20.87096238, 32.2564513 , 22.49688514, 23.15173559, 30.68041866,
       20.89820162, 13.83898047, 19.5883208 , 45.31      , 33.85058523,
       44.107     , 18.86286503, 23.34444551, 26.66165147, 20.58856005,
       24.16556849, 30.09678539,  7.29554679, 23.0336282 , 14.99766725,
       25.95037283, 40.057     , 27.32989216, 21.13984733, 20.16277227,
       31.51802794, 19.4120234 , 23.70092479, 22.00697911, 33.60031319,
       18.58805168, 20.40292212, 18.80858399, 33.19087323, 21.01011696,
       23.4038147 , 26.11975444, 34.8694125 , 31.74497273, 21.8425507 ,
       18.43068026, 16.09041883, 26.61116687, 20.92857005, 20.95343321,
       20.88336484, 19.2375718 , 15.66238684,  7.73877284, 46.62

#Finalize Your Model with Joblib

In [19]:
!pip install joblib

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [20]:
import joblib


joblib.dump(dt_1 , 'joblib_model')

['joblib_model']

In [21]:
jobLib = joblib.load('joblib_model')



jobLib.predict(x_test)


array([13.18381377, 16.75036941, 26.16508854, 15.09399545, 19.7676893 ,
       13.70581901, 17.44155716, 19.63435763, 13.82730952, 13.89921688,
       14.30815742, 20.74550679, 19.44046885, 10.73719356, 23.52185118,
       29.93439654, 24.51996913, 30.07625864, 14.79489984, 20.80625405,
       20.87096238, 32.2564513 , 22.49688514, 23.15173559, 30.68041866,
       20.89820162, 13.83898047, 19.5883208 , 45.31      , 33.85058523,
       44.107     , 18.86286503, 23.34444551, 26.66165147, 20.58856005,
       24.16556849, 30.09678539,  7.29554679, 23.0336282 , 14.99766725,
       25.95037283, 40.057     , 27.32989216, 21.13984733, 20.16277227,
       31.51802794, 19.4120234 , 23.70092479, 22.00697911, 33.60031319,
       18.58805168, 20.40292212, 18.80858399, 33.19087323, 21.01011696,
       23.4038147 , 26.11975444, 34.8694125 , 31.74497273, 21.8425507 ,
       18.43068026, 16.09041883, 26.61116687, 20.92857005, 20.95343321,
       20.88336484, 19.2375718 , 15.66238684,  7.73877284, 46.62