- Parameters - Arguements specified inside a function eg: sum(a, b) here a and b are called the parameters
- Hyperparameter - It is a parameter for a ML Model, by changing this the accuracy of the entire ML model varies
- eg: KNeighborsClassifier(n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None)

    1. Here changing n_neighbors value or changing the p value or changing the metric will significantly impact the accuracy of KNeighborsClassifier. Hence, we call them as Hyperparameters
    2. Tuning - Tweaking or changing by small bit


## Loading the standard libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

## Loading the dataset

In [2]:
data = pd.read_csv('Health_Insurance.csv')
data.head()

Unnamed: 0,Age,Purchased
0,19,0
1,35,0
2,26,0
3,27,0
4,19,0


In [3]:
data.shape

(400, 2)

## Seperate X and y

In [4]:
X = data[['Age']]
y = data['Purchased']

## Divide the data into train test split

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

## Apply KNeigbhors Classifier on the train data

In [6]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

In [7]:
knn.fit(X_train, y_train)

KNeighborsClassifier()

## Perform the prediction on X-test

In [8]:
y_pred = knn.predict(X_test)
y_pred

array([0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0], dtype=int64)

## Performance Check 

In [10]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.79

- Since, we have applied Knn with default values, the number of nearest neighbors selected by the algorithm is 5.

# Hyperparameter Tuning

## Q. To figure exactly at how many neighbors will the KNN perform better?

- What will happen to accuracy of KNN when the value of K is selected something other than the default like  k= 7 or k = 11 or k=15, etc
- At what value of K does the algorithm gives the best accuracy

In [11]:
knn = KNeighborsClassifier()

## Hyperparameter optimization
parameters = {'n_neighbors' : [7, 9, 11, 13, 15, 17, 19, 21]}

## For apply knn with different values of k use GridSearchCV

- GridSearchCV(ML_model, parameter)

In [12]:
from sklearn.model_selection import GridSearchCV
gscv = GridSearchCV(knn, parameters)

In [13]:
## Fit gscv on train and test data
gscv.fit(X_train, y_train)

GridSearchCV(estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [7, 9, 11, 13, 15, 17, 19, 21]})

In [14]:
gscv.cv_results_

{'mean_fit_time': array([0.00171452, 0.00214157, 0.00141969, 0.00161448, 0.00168486,
        0.00189257, 0.00120296, 0.00170975]),
 'std_fit_time': array([0.00038242, 0.0008151 , 0.00047672, 0.0005979 , 0.00038276,
        0.0001942 , 0.00039976, 0.0004047 ]),
 'mean_score_time': array([0.00326385, 0.00314922, 0.0023242 , 0.00235567, 0.00343447,
        0.00233326, 0.00256572, 0.00241904]),
 'std_score_time': array([0.00052642, 0.00123378, 0.00046367, 0.0005095 , 0.001147  ,
        0.00026441, 0.00032596, 0.00061035]),
 'param_n_neighbors': masked_array(data=[7, 9, 11, 13, 15, 17, 19, 21],
              mask=[False, False, False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'n_neighbors': 7},
  {'n_neighbors': 9},
  {'n_neighbors': 11},
  {'n_neighbors': 13},
  {'n_neighbors': 15},
  {'n_neighbors': 17},
  {'n_neighbors': 19},
  {'n_neighbors': 21}],
 'split0_test_score': array([0.9       , 0.86666667, 0.86666667, 0.91666667, 0.9

In [16]:
## the result are not in proper interpretation
res = pd.DataFrame(gscv.cv_results_)
res

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.001715,0.000382,0.003264,0.000526,7,{'n_neighbors': 7},0.9,0.816667,0.766667,0.75,0.816667,0.81,0.052281,7
1,0.002142,0.000815,0.003149,0.001234,9,{'n_neighbors': 9},0.866667,0.816667,0.733333,0.75,0.833333,0.8,0.050553,8
2,0.00142,0.000477,0.002324,0.000464,11,{'n_neighbors': 11},0.866667,0.833333,0.816667,0.75,0.833333,0.82,0.038586,6
3,0.001614,0.000598,0.002356,0.000509,13,{'n_neighbors': 13},0.916667,0.833333,0.816667,0.75,0.833333,0.83,0.053125,1
4,0.001685,0.000383,0.003434,0.001147,15,{'n_neighbors': 15},0.916667,0.833333,0.816667,0.75,0.833333,0.83,0.053125,1
5,0.001893,0.000194,0.002333,0.000264,17,{'n_neighbors': 17},0.916667,0.833333,0.816667,0.75,0.833333,0.83,0.053125,1
6,0.001203,0.0004,0.002566,0.000326,19,{'n_neighbors': 19},0.916667,0.833333,0.816667,0.75,0.833333,0.83,0.053125,1
7,0.00171,0.000405,0.002419,0.00061,21,{'n_neighbors': 21},0.916667,0.833333,0.816667,0.75,0.833333,0.83,0.053125,1


In [17]:
res[['param_n_neighbors', 'mean_test_score', 'rank_test_score']]

Unnamed: 0,param_n_neighbors,mean_test_score,rank_test_score
0,7,0.81,7
1,9,0.8,8
2,11,0.82,6
3,13,0.83,1
4,15,0.83,1
5,17,0.83,1
6,19,0.83,1
7,21,0.83,1


- Conclusion:

1. The mean_test_score which is also the accuracy is not increasing after the n_neighbors is 13
2. At n_neighbors = 13 we get the highest accuracy for the dataset

### For what combination of n_neighbors and distance metric(Euclidean or Manhattan) is the accuracy highest?

In [18]:
knn = KNeighborsClassifier()

## Hyperparameter optimization
parameters = {'n_neighbors' : [7, 9, 11, 13, 15, 17, 19, 21], 'p' :[1, 2]}

In [19]:
from sklearn.model_selection import GridSearchCV
gscv = GridSearchCV(knn, parameters)

In [20]:
## Fit gscv on train and test data
gscv.fit(X_train, y_train)

GridSearchCV(estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [7, 9, 11, 13, 15, 17, 19, 21],
                         'p': [1, 2]})

In [22]:
res1 = pd.DataFrame(gscv.cv_results_)
res1

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_neighbors,param_p,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.001343,0.000428,0.003899,0.001733,7,1,"{'n_neighbors': 7, 'p': 1}",0.9,0.816667,0.766667,0.75,0.816667,0.81,0.052281,13
1,0.001901,0.000445,0.002968,0.000768,7,2,"{'n_neighbors': 7, 'p': 2}",0.9,0.816667,0.766667,0.75,0.816667,0.81,0.052281,13
2,0.00207,0.000564,0.00246,0.000447,9,1,"{'n_neighbors': 9, 'p': 1}",0.866667,0.816667,0.733333,0.75,0.833333,0.8,0.050553,15
3,0.001213,0.000247,0.003046,0.000646,9,2,"{'n_neighbors': 9, 'p': 2}",0.866667,0.816667,0.733333,0.75,0.833333,0.8,0.050553,15
4,0.00183,0.000524,0.003078,0.000896,11,1,"{'n_neighbors': 11, 'p': 1}",0.866667,0.833333,0.816667,0.75,0.833333,0.82,0.038586,11
5,0.001438,0.00052,0.002278,0.000337,11,2,"{'n_neighbors': 11, 'p': 2}",0.866667,0.833333,0.816667,0.75,0.833333,0.82,0.038586,11
6,0.001437,0.000379,0.003335,0.000993,13,1,"{'n_neighbors': 13, 'p': 1}",0.916667,0.833333,0.816667,0.75,0.833333,0.83,0.053125,1
7,0.001642,0.000381,0.002238,0.000493,13,2,"{'n_neighbors': 13, 'p': 2}",0.916667,0.833333,0.816667,0.75,0.833333,0.83,0.053125,1
8,0.001209,0.000399,0.002241,0.000239,15,1,"{'n_neighbors': 15, 'p': 1}",0.916667,0.833333,0.816667,0.75,0.833333,0.83,0.053125,1
9,0.001635,0.000371,0.002594,0.000344,15,2,"{'n_neighbors': 15, 'p': 2}",0.916667,0.833333,0.816667,0.75,0.833333,0.83,0.053125,1


In [23]:
res1[['param_n_neighbors','param_p', 'mean_test_score', ]]

Unnamed: 0,param_n_neighbors,param_p,mean_test_score
0,7,1,0.81
1,7,2,0.81
2,9,1,0.8
3,9,2,0.8
4,11,1,0.82
5,11,2,0.82
6,13,1,0.83
7,13,2,0.83
8,15,1,0.83
9,15,2,0.83


## Conclusion

1. When the n_neighbors value is 13 and distance_metric = Manhattan_distance (p=1) the accuracy is 83%
2. When the n_neighbors value is 13 and distance_metric = Euclidean_distance (p=2) the accuracy is 83%

So, we say that distance is not affecting the accuracy you can either use Manhattan Distance or Euclidean distance (any distance metric is fine)