In [1]:
from sklearn import svm, datasets
import pandas as pd

In [11]:
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['flower'] = iris.target
# df['flower'] = df['flower'].apply(lambda x: iris.target_names[x])
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),flower
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [12]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(['flower'],axis=1), df['flower'],
                                                    test_size=0.2, random_state=42)

# Approach 1: Use train_test_split and manually tune parameters by trial and error

So we have loaded Iris flower dataset, Now the traditional approach that we can take to solve this problem is we use Train Test Splitting

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)

And then lets say we first try the SVM model, so first we will see how to do HyperParameter tuning and then will look into how to choose the model so just assume that u are going to use SVM model

In [14]:
model = svm.SVC(kernel='rbf',C=30,gamma='auto')
model.fit(X_train,y_train)
model.score(X_test, y_test)

0.9555555555555556

Above we randomly initialize with some parameters, since we dont know what is the best Parameters.

The issue here is that base on your Train and Test set the score might vary, right now our score is 97% but if we exevute it again it will change so we cant rely on this mthod as the score keeps changing base on our sample, so for that reason we use K FOLD Cross Validation

# Approach 2: Use K Fold Cross validation¶
Below what we will do is try cross_val_score for 5 folds and try this method on different values of "Kernel" and "C"

In [15]:
from sklearn.model_selection import cross_val_score
cross_val_score(svm.SVC(kernel='linear',C=10,gamma='auto'),iris.data, iris.target, cv=5)

array([1.        , 1.        , 0.9       , 0.96666667, 1.        ])

In [16]:
cross_val_score(svm.SVC(kernel='rbf',C=10,gamma='auto'),iris.data, iris.target, cv=5)

array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

In [17]:
cross_val_score(svm.SVC(kernel='rbf',C=20,gamma='auto'),iris.data, iris.target, cv=5)

array([0.96666667, 1.        , 0.9       , 0.96666667, 1.        ])

You can see above we got 5 values for different parameters, u can find the average of these values and based on that you can determine the optimal value of these Parameters.

But u can see that this method is very manual and repitative cuz there are so many values you can supply as different combinations so u will have to make alot of cross validation to try out different combinations.

So the other approach we can take is we can just run a for loop

In [18]:
import numpy as np
kernels = ['rbf', 'linear']
C = [1,10,20]
avg_scores = {}
for kval in kernels:
    for cval in C:
        cv_scores = cross_val_score(svm.SVC(kernel=kval,C=cval,gamma='auto'),iris.data, iris.target, 
                                    cv=5),
        avg_scores[kval + '_' + str(cval)] = np.average(cv_scores)

avg_scores

{'rbf_1': 0.9800000000000001,
 'rbf_10': 0.9800000000000001,
 'rbf_20': 0.9666666666666668,
 'linear_1': 0.9800000000000001,
 'linear_10': 0.9733333333333334,
 'linear_20': 0.9666666666666666}

# Approach 3: Use GridSearchCV
As above we can see using this way we can also find the best optimal score but u can see that this approach also has some issues like if we have 4 parameters then will have to run like 4 loops then it will be too many itertions and its just not convenient.

Luckily SKLearn provides an API called GridSearchCV which will do the exact same thing, i will do the exact same thing as shown in the for loop above but we will be able to do that in a single line of code

Now the first parameter we pass in GridSearchCV is the model so for our example we use SVM and we apply gamma value to be auto, then the Second parameter is very important, this is your parameter grid, in parameter grid, u will say i want the value of "C" to be 1, 10 and 20, these are the different value that we are going to try, and then "kernel" and in kernel we want to try "rbf" and "linear".

There are other parameters as well in GridSearchCV for example "cv" as third parameter to define how many Cross Validation u want to run, GridSearchSV uses cross validation, its just that we have the for loop step here in a line of code

In [22]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(svm.SVC(gamma='auto'), 
                   {
                       'kernel' : ['rbf', 'linear', 'poly'],
                       'C':[1,10,20,50]
                   }, cv=5, return_train_score=False)
grid.fit(X_train, y_train)
# View the GridSearchCV result in a nice DataFrame
gridresult = pd.DataFrame(grid.cv_results_)
# Show only important columns
gridresult[['param_C','param_kernel','mean_test_score', 'rank_test_score']]

Unnamed: 0,param_C,param_kernel,mean_test_score,rank_test_score
0,1,rbf,0.961905,3
1,1,linear,0.980952,1
2,1,poly,0.961905,3
3,10,rbf,0.971429,2
4,10,linear,0.961905,3
5,10,poly,0.952381,8
6,20,rbf,0.961905,3
7,20,linear,0.961905,3
8,20,poly,0.942857,9
9,50,rbf,0.942857,9


And now u can have many many parameters, all u have to do is supply them in parameter grid inside GridSearchCV and it will take care of it and show u the scores in this nice dataframes

We can do dir method in our clf to see what other properties it has

In [24]:
dir(grid)

['__abstractmethods__',
 '__annotations__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__sklearn_clone__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_build_request_for_signature',
 '_check_feature_names',
 '_check_n_features',
 '_check_refit_for_multimetric',
 '_estimator_type',
 '_format_results',
 '_get_default_requests',
 '_get_metadata_request',
 '_get_param_names',
 '_get_tags',
 '_more_tags',
 '_parameter_constraints',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_required_parameters',
 '_run_search',
 '_select_best_index',
 '_validate_data',
 '_validate_params',
 'best_estimator_',
 'best_index_',
 'best_params_',
 'best_score_',
 '

In [28]:
# dir(grid) to display all properties
# of gridsearchcv
dir(grid)

print(grid.best_estimator_)
print(grid.best_score_)
print(grid.best_params_)

SVC(C=1, gamma='auto', kernel='linear')
0.980952380952381
{'C': 1, 'kernel': 'linear'}


One issue that can happen with GridSearchCV is the Computation cost, our dataset now is very limited but imagine if u have millions of datapoints and then for parameters u have so many values, right now "C" values are only 1,10,20 but what if i just want to try range lets say from 1 to 50 then our Computation cost will go very high because this will literelly try Permutation and Combinations for every value of each of these parameters.

To tackle this Computation problems, SKLearn libraries comes up with another class called RandomizedSearchCV.

# RandomizedSearchCV

RandomizedSearchCV will not try every single Permutatuon and combination of parameters but it will try random combinations of these parameters value and u can chose what those iterations could be so lets see how it works.

Lets say we want to try only 2 iterations, so we pass 2 in **'n_iter'** parameter, so it will just randomly try only 2 combination and you can go with whatever is best, this is useful when you have low computation power

In [29]:
from sklearn.model_selection import RandomizedSearchCV
rs = RandomizedSearchCV(svm.SVC(gamma='auto'), {
        'C': [1,10,20,30],
        'kernel': ['rbf','linear','poly']
    }, 
    cv=5, 
    return_train_score=False, 
    n_iter=2 # means try 2 random combo only
)
rs.fit(iris.data, iris.target) # train the data
pd.DataFrame(rs.cv_results_)[['param_C','param_kernel','mean_test_score']] # view as df

Unnamed: 0,param_C,param_kernel,mean_test_score
0,1,linear,0.98
1,1,rbf,0.98


As we can see above, it randomly tried C value and the kernel as well, if we run it again it will randomly change.

This way it just randomly tries the value of C and kernel and it gives u the best score.

This works well in a practical life cuz if u dont have too much computation power then u just want to try random value or parameters and just go with whatever comes out the best

# Choosing best Model
### How about different models with different hyperparameters?
Alright so now we looked into parameter tuning, now lets see how do we chose the best model. For our iris dataset we are going to try SVM, Random Forest and Logistic Regression and we will figure out which one will give u the best performant with best parameters

In [31]:
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
model_params = {
    'svm': {
        'model': svm.SVC(gamma='auto'),
        'params' : {
            'C': [1,10,20,30],
            'kernel': ['rbf','linear','poly']
        }  
    },
    'random_forest': {
        'model': RandomForestClassifier(),
        'params' : {
            'n_estimators': [1,5,10],
            'criterion' : ['gini', 'entropy']
        }
    },
    'logistic_regression' : {
        'model': LogisticRegression(solver='liblinear',multi_class='auto'),
        'params': {
            'C': [1,5,10]
        }
    }
}

Once we have initialized this Dictionoary, we can write a simple for loop and this for loop is doing nothing but its just going through this Dictionary values and for each of the values it will GridSearchCV and in GridSearchCV the first param is the model, just trying each of the model from the Dictionary one by one with the corresponding parameters grid that we have specified in the above Dictionary.

Then we run the Training and just append the scores into the scores list

In [57]:
scores = []
for model, mp in model_params.items():
    grid = GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    grid.fit(X_train, y_train)
    scores.append({
        'model' : model,
        'best_score': grid.best_score_,
        'best_params': grid.best_params_
    })
# make a df
resultdf = pd.DataFrame(scores,columns=['model','best_score','best_params'])
resultdf

Unnamed: 0,model,best_score,best_params
0,svm,0.980952,"{'C': 1, 'kernel': 'linear'}"
1,random_forest,0.942857,"{'criterion': 'gini', 'n_estimators': 5}"
2,logistic_regression,0.952381,{'C': 1}


As we can see above a nice table view of the model, best score and best parameters of each model, so here we have a Conclusion that best model for our dataset is SVM as it will give us a 98% score with C : 1 and kernel : "linear" parameters

So not only we did Hyperparameters tuning but we also selected the best model, above we have used only 3 models for demonstration but u can use 100 models if u like so this is more like Trial and Error approach but in Practical life this works really well and this is what is used to figure out the best model and the best parameters