# What is Hyperparameter Tuning?
#### What is a Hyperparameter?
- They are the parameters specified in the ML Algorithm, now since these parameters affect the accuracy of the ML algorithm we call them as Hyperparameters.
- Tuning: Adjusting

- In short, Hyperparamter Tuning is to adjust the parameters of ML algorithms in such a way that the specified pararmeters will give us the max accuracy.

# Loading the libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

# Load the data

In [2]:
data = pd.read_csv("Health_Insurance.csv")
data.head()

Unnamed: 0,Age,Purchased
0,19,0
1,35,0
2,26,0
3,27,0
4,19,0


In [3]:
data.shape

(400, 2)

# Seperate X and y

In [4]:
X = data.drop(columns = "Purchased")
y = data["Purchased"]              

# Split the data into train set and test set

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Apply KNeighborsClassifier Algorithm on the data

In [6]:
from sklearn.neighbors import KNeighborsClassifier
knc = KNeighborsClassifier()
knc

In [7]:
knc.fit(X_train, y_train)

# Perform Predictions

In [8]:
y_pred = knc.predict(X_test)
y_pred

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1,
       0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1], dtype=int64)

# Check Accuracy

In [9]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.825

# Note:
Since, KNeighborsClassifier Algorithm is applied on data, we get an accuracy of 82.5% when the n_neighbors in KNC (by default) has values of 5. and also the distance formula (by default) used is Euclidean distance formula

# Hyperparameter Tuning:

- Let's tune (adjust) the n_neighbors and observe how the accuracy changes with changes in n_neighbors

In [10]:
from sklearn.neighbors import KNeighborsClassifier
knc = KNeighborsClassifier()

## Hyperparameter optimization
parameters = {"n_neighbors" : [3, 5, 7, 9, 11, 13, 15, 17, 19 ,21]}
parameters

{'n_neighbors': [3, 5, 7, 9, 11, 13, 15, 17, 19, 21]}

In [11]:
from sklearn.model_selection import GridSearchCV
gscv = GridSearchCV(knc, parameters)
gscv

In [12]:
gscv.fit(X_train, y_train)

In [13]:
gscv.cv_results_

{'mean_fit_time': array([0.0036943 , 0.00384154, 0.00361342, 0.00317302, 0.00241828,
        0.0039731 , 0.00324621, 0.00319858, 0.00280733, 0.00200334]),
 'std_fit_time': array([0.00078552, 0.00074832, 0.00047536, 0.00018488, 0.00051409,
        0.00012303, 0.00097253, 0.00102734, 0.00132576, 0.00063332]),
 'mean_score_time': array([0.00897102, 0.00918946, 0.01081781, 0.00802464, 0.00681047,
        0.01005063, 0.00823092, 0.00635633, 0.00685773, 0.00679193]),
 'std_score_time': array([0.00325322, 0.00257082, 0.00231094, 0.00334296, 0.00100255,
        0.00260524, 0.00208327, 0.0006727 , 0.00082626, 0.00143881]),
 'param_n_neighbors': masked_array(data=[3, 5, 7, 9, 11, 13, 15, 17, 19, 21],
              mask=[False, False, False, False, False, False, False, False,
                    False, False],
        fill_value='?',
             dtype=object),
 'params': [{'n_neighbors': 3},
  {'n_neighbors': 5},
  {'n_neighbors': 7},
  {'n_neighbors': 9},
  {'n_neighbors': 11},
  {'n_neighbors'

In [16]:
res = pd.DataFrame(gscv.cv_results_)
res[["params", "mean_test_score"]]

Unnamed: 0,params,mean_test_score
0,{'n_neighbors': 3},0.753125
1,{'n_neighbors': 5},0.803125
2,{'n_neighbors': 7},0.8
3,{'n_neighbors': 9},0.7875
4,{'n_neighbors': 11},0.80625
5,{'n_neighbors': 13},0.81875
6,{'n_neighbors': 15},0.815625
7,{'n_neighbors': 17},0.815625
8,{'n_neighbors': 19},0.815625
9,{'n_neighbors': 21},0.815625


# Notes:
- At k = 15, the KNN Algorithm is performing better on the data and moreover the accuracy is atable after k = 15.
- The optimized k - value for the above data is 15.
- When the k values is 15 and above the accuracy stops increasing or remains constant for k = 17, 19, 21, etc. 

# Check for the distance metric along with the k value. 
- When you add the distance parameter whether k = 15 will still give the highest accuracy? I
- 1. If yes then for which distance metric?
- 2. If not then for which K and which distance metric the accuracy will be the highest? 

In [17]:
from sklearn.neighbors import KNeighborsClassifier
knc = KNeighborsClassifier()

## Hyperparameter optimization
parameters = {"n_neighbors" : [3, 5, 7, 9, 11, 13, 15, 17, 19 ,21], "p" : [1, 2]}  # p = 1 is for Manhattan distance & p = 2 is for Euclidean distance
parameters

{'n_neighbors': [3, 5, 7, 9, 11, 13, 15, 17, 19, 21], 'p': [1, 2]}

In [18]:
from sklearn.model_selection import GridSearchCV
gscv = GridSearchCV(knc, parameters)
gscv

In [19]:
gscv.fit(X_train, y_train)

In [20]:
gscv.cv_results_

{'mean_fit_time': array([0.00590816, 0.00447388, 0.00476818, 0.00383024, 0.00422263,
        0.00328026, 0.00311923, 0.00325522, 0.00336695, 0.00309014,
        0.003159  , 0.00331163, 0.00405059, 0.00286565, 0.00283122,
        0.00288954, 0.00376768, 0.00390248, 0.00290799, 0.00313339]),
 'std_fit_time': array([0.00162129, 0.00168665, 0.0011025 , 0.00189504, 0.00076109,
        0.00052785, 0.00029593, 0.00121162, 0.00120925, 0.00062619,
        0.0006456 , 0.00092809, 0.00082525, 0.00078683, 0.00053923,
        0.00064298, 0.00073472, 0.0016404 , 0.00076388, 0.00147498]),
 'mean_score_time': array([0.01428528, 0.01156659, 0.01044044, 0.0096673 , 0.01084104,
        0.00874023, 0.00787129, 0.00802298, 0.00861816, 0.00854182,
        0.00898576, 0.00807366, 0.00874724, 0.00932803, 0.00907907,
        0.00864186, 0.00877743, 0.0095027 , 0.00891442, 0.01033778]),
 'std_score_time': array([0.00369976, 0.00266895, 0.00251438, 0.00152371, 0.00234352,
        0.00217292, 0.00124901, 0.001113

In [22]:
res = pd.DataFrame(gscv.cv_results_)
res[["params", "mean_test_score"]]

Unnamed: 0,params,mean_test_score
0,"{'n_neighbors': 3, 'p': 1}",0.753125
1,"{'n_neighbors': 3, 'p': 2}",0.753125
2,"{'n_neighbors': 5, 'p': 1}",0.803125
3,"{'n_neighbors': 5, 'p': 2}",0.803125
4,"{'n_neighbors': 7, 'p': 1}",0.8
5,"{'n_neighbors': 7, 'p': 2}",0.8
6,"{'n_neighbors': 9, 'p': 1}",0.7875
7,"{'n_neighbors': 9, 'p': 2}",0.7875
8,"{'n_neighbors': 11, 'p': 1}",0.80625
9,"{'n_neighbors': 11, 'p': 2}",0.80625


# Notes:
- The distance metric does not affect accuracy since we see that for k = 15, we get the best accuracy for both Manhattan and Euclidean distance.
- We can say that for this data, distance metric is not affecting accuracy much.
- Only the k value has a affect on the accuracy