# ML Tutorial Day 15

## Hyper-Parameter Tuning

We will see how to find the best model for our problem and perform hyper-parameter tuning to further refine performance.

In [84]:
# loading dataset
from sklearn.datasets import load_iris
iris = load_iris()

In [85]:
# preparing dataset
import pandas as pd
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['flower'] = iris.target
df['flower'] = df['flower'].apply(lambda x: iris.target_names[x])
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),flower
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [86]:
# preparing training and testingn sets
from sklearn.model_selection import train_test_split as tts
X_train, X_test, y_train, y_test = tts(iris.data, iris.target, test_size = 0.2)

In [87]:
# now we will try different models
from sklearn.svm import SVC
model = SVC(kernel = 'rbf', C = 30, gamma = 'auto')
model.fit(X_train, y_train)

# accuracy of model
model.score(X_test, y_test)

0.9333333333333333

The model's score will change everytime we run the `train_test_split` method, as the datasets will be randomized each time. We can't rely on this score and thus, use K-fold cross validation.

In [88]:
# using the K-fold cross validation to find average score
from sklearn.model_selection import cross_val_score
a = cross_val_score(SVC(kernel='linear', C=10, gamma='auto'), iris.data, iris.target, cv = 5)
b = cross_val_score(SVC(kernel='rbf', C=10, gamma='auto'), iris.data, iris.target, cv = 5)
c = cross_val_score(SVC(kernel='rbf', C=20, gamma='auto'), iris.data, iris.target, cv = 5)

print(f"Score for first combinations: {a.mean() :0.2f} \nScore for first combinations: {b.mean() :0.2f} \nScore for first combinations: {c.mean() :0.2f} \n")

Score for first combinations: 0.97 
Score for first combinations: 0.98 
Score for first combinations: 0.97 



We can try various combinations for hyperparameters in the above method but it will be very tedious and unorganised. We can do the same thing using `GridSearchCV` from `sklearn`

In [89]:
from sklearn.model_selection import GridSearchCV as gs
clf = gs(SVC(gamma = 'auto'), {
    'C' : [1, 10, 20],
    'kernel' : ['rbf', 'linear']
}, cv = 5, return_train_score = False)
clf.fit(iris.data, iris.target)
df = pd.DataFrame(clf.cv_results_)
df = df[['param_C', 'param_kernel', 'mean_test_score']]
df

Unnamed: 0,param_C,param_kernel,mean_test_score
0,1,rbf,0.98
1,1,linear,0.98
2,10,rbf,0.98
3,10,linear,0.973333
4,20,rbf,0.966667
5,20,linear,0.966667


In [90]:
# we can find the best combination of parameters and the best score
print(clf.best_score_)
print(clf.best_params_)

0.9800000000000001
{'C': 1, 'kernel': 'rbf'}


Now if we have a lot of parameters and each paramters has multiple possible values, then the computational cost would be too high. to counter that, we can use `RandomizedSearchCV` which randomly chooses the values of the parameters for n (supplied by user) different sets and then gives the best value at the end.

In [91]:
# using RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV as rsCV
rs = rsCV(SVC(gamma = 'auto'), {
    'C' : [1,10,20],
    'kernel' : ['rbf', 'linear']
},
cv = 5,
return_train_score = False,
n_iter = 4)
rs.fit(iris.data, iris.target)
pd.DataFrame(rs.cv_results_)[['param_C', 'param_kernel', 'mean_test_score']]

Unnamed: 0,param_C,param_kernel,mean_test_score
0,1,linear,0.98
1,10,linear,0.973333
2,10,rbf,0.98
3,1,rbf,0.98


Now we will look at how to choose the best model for a given problem.

In [92]:
# importing various models
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier as rfc
from sklearn.linear_model import LogisticRegression as lr

# defining the model parameters
model_params = {
    'svm' : {
        'model' : SVC(gamma='auto'),
        'params' : {
            'C' : [1, 10, 20],
            'kernel' : ['rbf', 'linear']
        }
    },
    'random_forest' : {
        'model' : rfc(),
        'params' : {
            'n_estimators' : [1, 5, 10]
        }
    },
    'logistic_regression' : {
        'model' : lr(solver='liblinear', multi_class = 'auto'),
        'params' : {
            'C' : [1, 5, 10]
        }
    }
}

In [None]:
score = []

# using GridSearchCV to find the best model with the optimum parameters
for model_name, mp in model_params.items():
    clf = gs(mp['model'], mp['params'], cv = 5, return_train_score = False)
    clf.fit(iris.data, iris.target)
    score.append({
        'model' : model_name,
        'best_score' : clf.best_score_,
        'best_params' : clf.best_params_
    })



In [96]:
df = pd.DataFrame(score, columns=['model', 'best_score', 'best_params'])
df

Unnamed: 0,model,best_score,best_params
0,svm,0.98,"{'C': 1, 'kernel': 'rbf'}"
1,random_forest,0.96,{'n_estimators': 1}
2,logistic_regression,0.966667,{'C': 5}
