## Hyper Parameter Tuning
Process of choosing the specific parameters once you have decided on a particular model to use. (eg. Choosing kernel ('rbf', 'linear', 'poly'), c (integer), and gamma (float) parameters for a SVM model)

## Import and pre-process dataset.

In [16]:
from sklearn import svm, datasets

iris = datasets.load_iris()

In [17]:
import pandas as pd
import numpy as np

df = pd.DataFrame(iris.data, columns=iris.feature_names)
df["flower"] = iris.target
df["flower"] = df["flower"].apply(lambda x: iris.target_names[x])
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),flower
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## Perform training/testing sample set splits.

In [7]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)

## Train model. Accuracy changes every time training/testing sample split is performed.

In [8]:
model = svm.SVC(kernel="rbf", C=30, gamma="auto")
model.fit(x_train, y_train)
model.score(x_test, y_test)

0.9555555555555556

## Use 'cross_val_score' to use KFold cross validation. Now random sampling doesn't affect our accuracy score.

In [11]:
from sklearn.model_selection import cross_val_score
cross_val_score(svm.SVC(kernel="linear", C=10, gamma="auto"), iris.data, iris.target, cv=5)

array([1.        , 1.        , 0.9       , 0.96666667, 1.        ])

In [12]:
cross_val_score(svm.SVC(kernel="rbf", C=10, gamma="auto"), iris.data, iris.target, cv=5)

array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

In [13]:
cross_val_score(svm.SVC(kernel="rbf", C=20, gamma="auto"), iris.data, iris.target, cv=5)

array([0.96666667, 1.        , 0.9       , 0.96666667, 1.        ])

## Running 'cross_val_score' for each parameter combination is time-consuming, but we can write a for loop to get a cross-validation score for each parameter combination and look at the average scores to find the optimal combination.

In [18]:
kernels = ["rbf", "linear"]
C = [1,10,20]
avg_scores = {}

for kval in kernels:
    for cval in C:
        cv_scores = cross_val_score(svm.SVC(kernel=kval, C=cval, gamma="auto"), iris.data, iris.target, cv=5)
        avg_scores[kval + "_" + str(cval)] = np.average(cv_scores)
avg_scores

{'rbf_1': 0.9800000000000001,
 'rbf_10': 0.9800000000000001,
 'rbf_20': 0.9666666666666668,
 'linear_1': 0.9800000000000001,
 'linear_10': 0.9733333333333334,
 'linear_20': 0.9666666666666666}

## GridSearchCV is an sklearn function that performs that same test as our for loop above (uses KFold cross-validation).

In [23]:
from sklearn.model_selection import GridSearchCV

clf = GridSearchCV(svm.SVC(gamma="auto"), {
    "C" : [1,10,20],
    "kernel" : ["rbf", "linear"]
}, cv=5, return_train_score=False)

clf.fit(iris.data, iris.target)
pd.DataFrame(clf.cv_results_)[["param_C", "param_kernel", "mean_test_score"]]

Unnamed: 0,param_C,param_kernel,mean_test_score
0,1,rbf,0.98
1,1,linear,0.98
2,10,rbf,0.98
3,10,linear,0.973333
4,20,rbf,0.966667
5,20,linear,0.966667


In [24]:
print(clf.best_score_)
print(clf.best_params_)

0.9800000000000001
{'C': 1, 'kernel': 'rbf'}


## Computational cost is very high for these methods because you could be running this for several different parameter permutations.
## RandomizedSearchCV helps by not running for every permutation, but instead for random combinations.

In [26]:
from sklearn.model_selection import RandomizedSearchCV
rs = RandomizedSearchCV(svm.SVC(gamma="auto"), {
    "C" : [1,10,20],
    "kernel" : ["rbf", "linear"]
}, cv=5, return_train_score=False, n_iter=2)

rs.fit(iris.data, iris.target)
pd.DataFrame(rs.cv_results_)[["param_C", "param_kernel", "mean_test_score"]]

Unnamed: 0,param_C,param_kernel,mean_test_score
0,1,linear,0.98
1,20,linear,0.966667


## We can use GridSearchCV in combination with a dictionary of models/parameters to run a for loop returning the best score and parameters for each model.

In [27]:
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [39]:
model_params = {
    "svm" : {
        "model" : svm.SVC(gamma="auto"),
        "params" : {
            "C" : [1,10,20],
            "kernel" : ["rbf", "linear"]
        }
    },
    "random_forest" : {
        "model" : RandomForestClassifier(),
        "params" : {
            "n_estimators" : [1,5,10]
        }
    },
    "logistic_regression" : {
        "model" : LogisticRegression(solver="liblinear", multi_class="auto"),
        "params" : {
            "C" : [1,5,10]
        }
    }
}

In [41]:
scores = []

for model_name, mp in model_params.items():
    clf = GridSearchCV(mp["model"], mp["params"], cv=5, return_train_score=False)
    clf.fit(iris.data, iris.target)
    scores.append({
        "model" : model_name,
        "best_score" : clf.best_score_,
        "best_params" : clf.best_params_
    })
pd.DataFrame(scores)

Unnamed: 0,model,best_score,best_params
0,svm,0.98,"{'C': 1, 'kernel': 'rbf'}"
1,random_forest,0.966667,{'n_estimators': 10}
2,logistic_regression,0.966667,{'C': 5}
