Finding best model and hyper parameter tunning using GridSearchCV

In [1]:
""" 
Now, we will look into,
1) How to hyper tune machine learning model parameters 
2) How to Choose best model for given machine learning problem

The process of choosing optimal parameter is called hyper tuning.

We will start by comparing traditional train_test_split approach with k fold cross validation. Then we will see how GridSearchCV helps run KFold cross validation with its convenient api. 

GridSearchCV helps find best parameters that gives maximum performance. 

RandomizedSearchCV is another class in sklearn library that does same thing as GridSearchCV, but without running exhaustive search, this helps with computation time and resources. 

We will also see how to find best model among all the classification algorithm using GridSearchCV. 

"""

' \nNow, we will look into,\n1) How to hyper tune machine learning model parameters \n2) How to Choose best model for given machine learning problem\n\nThe process of choosing optimal parameter is called hyper tuning.\n\nWe will start by comparing traditional train_test_split approach with k fold cross validation. Then we will see how GridSearchCV helps run KFold cross validation with its convenient api. \n\nGridSearchCV helps find best parameters that gives maximum performance. \n\nRandomizedSearchCV is another class in sklearn library that does same thing as GridSearchCV, but without running exhaustive search, this helps with computation time and resources. \n\nWe will also see how to find best model among all the classification algorithm using GridSearchCV. \n\n'

In [2]:
""" 
Problem:

For iris flower dataset in sklearn library, we are going to find out best model and best hyper parameters using GridSearchCV()

"""

' \nProblem:\n\nFor iris flower dataset in sklearn library, we are going to find out best model and best hyper parameters using GridSearchCV()\n\n'

In [3]:
from sklearn import svm, datasets
iris = datasets.load_iris()

In [4]:
import pandas as pd 
df = pd.DataFrame(iris.data, columns = iris.feature_names)
df["flower"] = iris.target
df["flower"] = df["flower"].apply(lambda x : iris.target_names[x])
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),flower
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


Approach 1: Use train_test_split and manually tune parameters by trial and error

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size = 0.3)

In [6]:
model = svm.SVC(kernel="rbf", C=30, gamma="auto")
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.9333333333333333

Approach 2: Use K Fold Cross validation

In [7]:
import numpy as np
from sklearn.model_selection import cross_val_score

In [8]:
cross_val_score(svm.SVC(kernel = 'linear', C = 10, gamma = "auto"), iris.data, iris.target, cv = 5)

array([1.        , 1.        , 0.9       , 0.96666667, 1.        ])

In [9]:
# Above approach is tiresome and very manual. We can use here for loop as an alternative

kernels = ["rbf", "linear"]
C = [1, 10, 20]
avg_scores = {}

for kval in kernels:
    for cval in C:
        cv_scores = cross_val_score(svm.SVC(kernel = kval, C = cval, gamma = "auto"), iris.data, iris.target, cv = 5)
        avg_scores[kval + "_" + str(cval)] = np.average(cv_scores)

avg_scores

{'rbf_1': 0.9800000000000001,
 'rbf_10': 0.9800000000000001,
 'rbf_20': 0.9666666666666668,
 'linear_1': 0.9800000000000001,
 'linear_10': 0.9733333333333334,
 'linear_20': 0.9666666666666666}

In [10]:
""" 
Conclusion: 
    From above results we can say that rbf with C=1 or 10 or linear with C=1 will give best performance.
    
Problem : 
    If there are more parameters, so many iterations lead to high computation and which is not convenient.
"""

' \nConclusion: \n    From above results we can say that rbf with C=1 or 10 or linear with C=1 will give best performance.\n    \nProblem : \n    If there are more parameters, so many iterations lead to high computation and which is not convenient.\n'

Approach 3: Use GridSearchCV

In [11]:
""" 
GridSearchCV API does exactly same thing as for loop above but in a single line of code.

Parameters:

    param_grid: 
        This is a dictionary or a list of dictionaries specifying the hyperparameters and their values to search over. We are searching over the values of the C (regularization parameter) and kernel parameters.

        In short, Just supply all the paramters here and this GridSearchCV will do permutation and combination of each of these parameters using kfold cross validation.

    return_train_score: 
        This parameter specifies whether to compute the training score during the grid search. Setting it to False avoids returning the training scores in the result.

        Training score is just to check how well the model fit the training data.
"""

' \nGridSearchCV API does exactly same thing as for loop above but in a single line of code.\n\nParameters:\n\n    param_grid: \n        This is a dictionary or a list of dictionaries specifying the hyperparameters and their values to search over. We are searching over the values of the C (regularization parameter) and kernel parameters.\n\n        In short, Just supply all the paramters here and this GridSearchCV will do permutation and combination of each of these parameters using kfold cross validation.\n\n    return_train_score: \n        This parameter specifies whether to compute the training score during the grid search. Setting it to False avoids returning the training scores in the result.\n\n        Training score is just to check how well the model fit the training data.\n'

In [12]:
from sklearn.model_selection import GridSearchCV 

clf = GridSearchCV(svm.SVC(gamma = "auto"), {
    'C' : [1, 10, 20],
    'kernel' : ['rbf', 'linear']
}, 
cv = 5, 
return_train_score = False
)

clf.fit(iris.data, iris.target)
clf.cv_results_

{'mean_fit_time': array([0.00104284, 0.00050154, 0.00150867, 0.00040126, 0.00100479,
        0.00102758]),
 'std_fit_time': array([0.00068851, 0.0004498 , 0.00148591, 0.00049145, 0.000634  ,
        0.00067271]),
 'mean_score_time': array([0.00083475, 0.00033455, 0.00098076, 0.00101528, 0.00088134,
        0.00094213]),
 'std_score_time': array([0.00042566, 0.00042338, 0.00054972, 0.00157948, 0.00132582,
        0.00099801]),
 'param_C': masked_array(data=[1, 1, 10, 10, 20, 20],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_kernel': masked_array(data=['rbf', 'linear', 'rbf', 'linear', 'rbf', 'linear'],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'C': 1, 'kernel': 'rbf'},
  {'C': 1, 'kernel': 'linear'},
  {'C': 10, 'kernel': 'rbf'},
  {'C': 10, 'kernel': 'linear'},
  {'C': 20, 'kernel': 'rbf'},
  {'C': 20, 'kernel': 'linear'}],


In [13]:
df = pd.DataFrame(clf.cv_results_)
df

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.001043,0.000689,0.000835,0.000426,1,rbf,"{'C': 1, 'kernel': 'rbf'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
1,0.000502,0.00045,0.000335,0.000423,1,linear,"{'C': 1, 'kernel': 'linear'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
2,0.001509,0.001486,0.000981,0.00055,10,rbf,"{'C': 10, 'kernel': 'rbf'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
3,0.000401,0.000491,0.001015,0.001579,10,linear,"{'C': 10, 'kernel': 'linear'}",1.0,1.0,0.9,0.966667,1.0,0.973333,0.038873,4
4,0.001005,0.000634,0.000881,0.001326,20,rbf,"{'C': 20, 'kernel': 'rbf'}",0.966667,1.0,0.9,0.966667,1.0,0.966667,0.036515,5
5,0.001028,0.000673,0.000942,0.000998,20,linear,"{'C': 20, 'kernel': 'linear'}",1.0,1.0,0.9,0.933333,1.0,0.966667,0.042164,6


In [14]:
df[['param_C', 'param_kernel', 'mean_test_score']]

Unnamed: 0,param_C,param_kernel,mean_test_score
0,1,rbf,0.98
1,1,linear,0.98
2,10,rbf,0.98
3,10,linear,0.973333
4,20,rbf,0.966667
5,20,linear,0.966667


In [15]:
"Based on above tabular view, I can supply first three values into my paramters to get the best performance."

'Based on above tabular view, I can supply first three values into my paramters to get the best performance.'

In [16]:
dir(clf)

['__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_check_feature_names',
 '_check_n_features',
 '_check_refit_for_multimetric',
 '_estimator_type',
 '_format_results',
 '_get_param_names',
 '_get_tags',
 '_more_tags',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_required_parameters',
 '_run_search',
 '_select_best_index',
 '_validate_data',
 'best_estimator_',
 'best_index_',
 'best_params_',
 'best_score_',
 'classes_',
 'cv',
 'cv_results_',
 'decision_function',
 'error_score',
 'estimator',
 'fit',
 'get_params',
 'inverse_transform',
 'multimetric_',
 'n_features_in_',
 'n_jobs

In [17]:
clf.best_score_

0.9800000000000001

In [18]:
clf.best_estimator_

In [19]:
clf.best_params_

{'C': 1, 'kernel': 'rbf'}

RandomizedSearchCV

In [20]:
""" 
Randomized search is another hyperparameter tuning technique that explores random combinations of hyperparameters within specified distributions or ranges. It's particularly useful when dealing with a large hyperparameter search space.

Use RandomizedSearchCV to reduce number of iterations and with random combination of parameters. It helps us to reduce the cost of computation.

n_iter: 
    It specifies the number of combinations to try randomly. Selecting too low of a number will decrease our chance of finding the best combination. Selecting too large of a number will increase the processing time. So, it trades off run time vs quality of the solution.

"""

" \nRandomized search is another hyperparameter tuning technique that explores random combinations of hyperparameters within specified distributions or ranges. It's particularly useful when dealing with a large hyperparameter search space.\n\nUse RandomizedSearchCV to reduce number of iterations and with random combination of parameters. It helps us to reduce the cost of computation.\n\nn_iter: \n    It specifies the number of combinations to try randomly. Selecting too low of a number will decrease our chance of finding the best combination. Selecting too large of a number will increase the processing time. So, it trades off run time vs quality of the solution.\n\n"

In [21]:
from sklearn.model_selection import RandomizedSearchCV

rs = RandomizedSearchCV(svm.SVC(gamma = "auto"), {
    'C' : [1, 10, 20],
    'kernel' : ['rbf', 'linear']
},
cv = 5, 
return_train_score = False,
n_iter=2
)

rs.fit(iris.data, iris.target)

pd.DataFrame(rs.cv_results_)[['param_C', 'param_kernel', 'mean_test_score']]

Unnamed: 0,param_C,param_kernel,mean_test_score
0,1,linear,0.98
1,20,rbf,0.966667


How to choose best model for ML problem?

In [22]:
from sklearn.svm import SVC 
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [23]:
# Parameter-grid
model_params = {
    'svm' : {
        'model' : SVC(gamma = 'auto'),
        'params' : {
            'C' : [1, 10, 20],
            'kernel' : ['rbf', 'linear']
        }
    },
    'random_forest' : {
        'model' : RandomForestClassifier(),
        'params' : {
            'n_estimators' : [1, 5, 10]
        }
    },
    'logistic_regression' : {
        'model' : LogisticRegression(solver = 'liblinear', multi_class = 'auto'),
        'params' : {
            'C' : [1, 5, 10],
        }
    }
}

In [24]:
for model_name, mp in model_params.items():
    print(type(model_name), type(mp))  
    print(model_name, mp)  

<class 'str'> <class 'dict'>
svm {'model': SVC(gamma='auto'), 'params': {'C': [1, 10, 20], 'kernel': ['rbf', 'linear']}}
<class 'str'> <class 'dict'>
random_forest {'model': RandomForestClassifier(), 'params': {'n_estimators': [1, 5, 10]}}
<class 'str'> <class 'dict'>
logistic_regression {'model': LogisticRegression(solver='liblinear'), 'params': {'C': [1, 5, 10]}}


In [25]:
scores = []

for model_name, mp in model_params.items():
    clf = GridSearchCV(mp['model'], mp['params'], cv = 5, return_train_score = False) 
    clf.fit(iris.data, iris.target)
    scores.append({
        'model' : model_name,
        'best_score' : clf.best_score_,
        'best_params' : clf.best_params_
    })

In [26]:
df = pd.DataFrame(scores, columns = ['model', 'best_score', 'best_params'])
df

Unnamed: 0,model,best_score,best_params
0,svm,0.98,"{'C': 1, 'kernel': 'rbf'}"
1,random_forest,0.953333,{'n_estimators': 5}
2,logistic_regression,0.966667,{'C': 5}


In [27]:
""" 
Conclusion:
    Based on above, I can conclude that SVM with C=1 and kernel='rbf' is the best model for solving my problem of iris flower classification.
"""

" \nConclusion:\n    Based on above, I can conclude that SVM with C=1 and kernel='rbf' is the best model for solving my problem of iris flower classification.\n"