### Hyperparameter Tuning

Hyperparameter tuning - finding the best parameters for our models that give the most accurate/optimal results

We'll be trying Hyperparameter Tuning on the iris flower dataset here

In [2]:
import pandas as pd
from sklearn.datasets import load_iris

In [3]:
X, y = load_iris(return_X_y=True, as_frame=True)

In [4]:
X.shape, y.shape

((150, 4), (150,))

In [6]:
X

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [7]:
y

0      0
1      0
2      0
3      0
4      0
      ..
145    2
146    2
147    2
148    2
149    2
Name: target, Length: 150, dtype: int64

### GridSearchCV

Using GridSearchCV, we can try out combinations of different parameters for our model (specified in a parameter grid) and see which one of them gives us the best/most optimal results 



In [30]:
from sklearn.model_selection import GridSearchCV

from sklearn.svm import SVC

# create a support vector classifier
svc = SVC()

# the model parameters that we can tweak
print(svc.get_params())

# create a dictionary of parameters that we want to search
gs = GridSearchCV(estimator=svc, 
                   param_grid={'kernel':['linear', 'rbf'], 'C':[1, 10, 20]}, 
                   cv=5)

gs.fit(X, y)

{'C': 1.0, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'scale', 'kernel': 'rbf', 'max_iter': -1, 'probability': False, 'random_state': None, 'shrinking': True, 'tol': 0.001, 'verbose': False}


In [32]:
df = pd.DataFrame(gs.cv_results_) # check the cross validation results
df

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.004751,0.000618,0.003398,0.000436,1,linear,"{'C': 1, 'kernel': 'linear'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
1,0.005551,0.000535,0.003951,0.000359,1,rbf,"{'C': 1, 'kernel': 'rbf'}",0.966667,0.966667,0.966667,0.933333,1.0,0.966667,0.021082,5
2,0.004298,0.000812,0.002993,0.000425,10,linear,"{'C': 10, 'kernel': 'linear'}",1.0,1.0,0.9,0.966667,1.0,0.973333,0.038873,4
3,0.004427,0.000584,0.003043,0.000709,10,rbf,"{'C': 10, 'kernel': 'rbf'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
4,0.004165,0.001025,0.002625,0.000656,20,linear,"{'C': 20, 'kernel': 'linear'}",1.0,1.0,0.9,0.933333,1.0,0.966667,0.042164,5
5,0.003747,0.001314,0.003414,0.001282,20,rbf,"{'C': 20, 'kernel': 'rbf'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1


In [35]:
# filter the important stuff that we need
df[['param_C', 'param_kernel', 'mean_test_score', 'rank_test_score']]

Unnamed: 0,param_C,param_kernel,mean_test_score,rank_test_score
0,1,linear,0.98,1
1,1,rbf,0.966667,5
2,10,linear,0.973333,4
3,10,rbf,0.98,1
4,20,linear,0.966667,5
5,20,rbf,0.98,1


In [36]:
gs.best_params_

{'C': 1, 'kernel': 'linear'}

In [37]:
gs.best_score_

0.9800000000000001

### RandomizedSearchCV

Trying out all the different specified parameter combinations can be costly, especially if the dataset is very large. So instead of using GridSearchCV which will try out all the possible combinations, we can try RandomizedSearchCV which will try out only a set number of random parameter combinations. RandomizedSearchCV can prove very useful where we have performance constraints and evaluating all the possible parameters is just not feasible.

From the docs:

In contrast to GridSearchCV, not all the parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter.

In [43]:
from sklearn.model_selection import RandomizedSearchCV
rs = RandomizedSearchCV(estimator=SVC(), 
                        param_distributions={'kernel':['linear', 'rbf'], 'C':[1, 10, 20]}, 
                        cv=5, 
                        n_iter=2) # randomly try out 2 combinations from the 6 possible combinations

rs.fit(X, y)

In [39]:
df2 = pd.DataFrame(rs.cv_results_)
df2

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kernel,param_C,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.005885,0.001189,0.00449,0.000813,rbf,10,"{'kernel': 'rbf', 'C': 10}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
1,0.00415,0.000692,0.002917,0.000691,linear,1,"{'kernel': 'linear', 'C': 1}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1


In [42]:
df2[['param_C', 'param_kernel', 'mean_test_score', 'rank_test_score']]

Unnamed: 0,param_C,param_kernel,mean_test_score,rank_test_score
0,10,rbf,0.98,1
1,1,linear,0.98,1


#### Using GridSearch to compare between different models (with different parameters)

We can use GridSearch & RandomizedSearch to not only compare between different parameters of a model but also among different models as well


In [44]:
# import the models that we want to try
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [28]:
# these are the models with different parameters that we want to try out
models = {
    'svm': {
        'model': SVC(gamma='auto'),
        'params': {
            'C': [1,10,20],
            'kernel': ['rbf','linear']
        }  
    },
    'random_forest': {
        'model': RandomForestClassifier(),
        'params': {
            'n_estimators': [1,5,10]
        }
    },
    'logistic_regression' : {
        'model': LogisticRegression(solver='liblinear', multi_class='auto'),
        'params': {
            'C': [1,5,10]
        }
    }
}

In [45]:
# try out all the models and store the results in a list
scores = []
for model_name, model_params in models.items():
    clf = GridSearchCV(estimator=model_params['model'], 
                       param_grid=model_params['params'], 
                       cv=5)
    
    clf.fit(X, y)
    
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })

df3 = pd.DataFrame(scores,columns=['model','best_score','best_params'])
df3

Unnamed: 0,model,best_score,best_params
0,svm,0.98,"{'C': 1, 'kernel': 'rbf'}"
1,random_forest,0.953333,{'n_estimators': 10}
2,logistic_regression,0.966667,{'C': 5}


From, the above we can conclude that SVM with C=1 and kernel='rbf' is the best model out of the ones we tried