### Question.1.Write code to load in this .csv dataset. Start by splitting the dataset using train_test_split method into a train and test partition. Use the test partition as a validation set that gets left out until a final validation step. Use a test size of 10%.

### Consider the following scenario: You are using the k-NN classifier model to predict diagnosis from the radius_mean, texture_mean, perimeter_mean and area_mean columns of the breast-cancer-data data set. You have performed a grid search experiment to determine which value of k optimizes the k-NN classifier. 

In [None]:
#for imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score
from sklearn.metrics import accuracy_score

In [None]:
#load breast cancer dataset
df_bc = pd.read_csv("breast-cancer-data.csv")
df_bc

In [None]:
#Get the X and y values
X = df_bc[["radius_mean", "texture_mean", "perimeter_mean", "area_mean"]]
y = df_bc["diagnosis"]
print(X)
print(y)

In [None]:
#Partition data into train and test using test size of 10%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0)

#Creating a KNN classifier
knn = KNeighborsClassifier()

#Creating GridSearchCV

#Defining range
k_range = list(range(1, 31))
print(f"K range:{k_range}")

#Assigning range to test
param_grid = dict(n_neighbors=k_range)
print(f"Parameters:{param_grid}")

#Perform GridSearchCV
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', return_train_score=False)
grid.fit(X, y)

In [None]:
#Getting the best value of hyper parameter - k
print(f"Best value of k that optimizes the k-NN classifier:{grid.best_params_['n_neighbors']}")

### Question.2. Implement code to realize the scenario above, showing which optimal value of k is found that optimizes the model.  Use only the train part from the original 90-10% split performed above for this process.

In [None]:
#Using best k value to the training set

best_value_k = grid.best_params_['n_neighbors']
knn_classifier = KNeighborsClassifier(n_neighbors = best_value_k )
knn_classifier.fit(X_train, y_train)

#Predicting diagnosis
y_pred = knn_classifier.predict(X_test)

#Calculating the accuract
acc_score = accuracy_score(y_test, y_pred)
print(f"Accuracy with k={best_value_k} is {acc_score:.4f}")


### Question.3. What is the issue in this paper in model selection that is addressed. How does that apply specifically to our scenario?
***

#### According to the paper, the issue was noted where the variance of the model selection criterion had the possibility of over-fitting during model selection. And with over-fitting in model selection, undesirable optimistic bias can arise. In the example for kernel ridge regression classifier, when the authors added one thousand independent realisations, the cross-validation estimate of the mean squared error, forming the model selection criterion, decreased. For a short duration, it seemed that the model had improvements, but after approximately 30 to 40 iterations, the test error began to climb. In conclusion, the paper tries to provide the concept that when overfitting in model selection occurs (i.e.when selecting a model or its hyper-parameters) it gives exceptionally well results on the training data but fails to generalize on test data or unseen data. Also, the paper mentions that high variance can lead to over-fitting in model selection and result in poor performance, even when the number of hyper-parameters is relatively small.

#### In order to avoid the issue,  the strategies to choose the best model generally involve convex-optimization problems. The paper suggests to implement hyper parameter tuning, k-fold cross-validation among others.

#### In our scenario, we are keen on finding the optimum value for our hyper-parameter with low variance and bias. Therefore, GridsearchCV k-fold cross validation algorithm within the hyper parameter tuning is used for our scenario.
***

### Question.4. Based on your understanding of this issue and recommendations of the paper, write code to implement a solution to the problem that likely affects our given scenario, according to the paper’s main thesis. In your code, compare the new training solution to the old one in terms by testing using the left-out validation set above. 

In [None]:
#Splitting the data into train, validation and test set
X_train_new, X_temp, y_train_new, y_temp = train_test_split(X, y, test_size = 0.20, random_state = 42)
X_val, X_test_new, y_val, y_test_new = train_test_split(X_temp, y_temp, test_size = 0.50, random_state = 42)

param_knn_4 = {
    'n_neighbors': range(1, 11),
    'leaf_size': (20, 40),
    'p': (1, 2),
    'weights': ('uniform', 'distance'),
    
}

param_knn_4

In [None]:
score_4 = []
for i in range(10):
    inner_cv = KFold(n_splits = 6, shuffle = True, random_state = i)
    grid_search = GridSearchCV(
        estimator = KNeighborsClassifier(),
        param_grid = param_knn_4,
        cv = inner_cv,
        n_jobs = 1
    )
    score = cross_val_score(grid_search, X_train, y = y_train, cv = 5)
    score_4.append(score.mean())
print(f"DONE")

In [None]:
#Getting the best k value
best_index = np.argmax(score_4)
best_n_neighbors = param_knn_4['n_neighbors'][best_index // 2]
best_leaf_size = param_knn_4['leaf_size'][best_index % 2] 
best_p = param_knn_4['p'][best_index % 2]
best_weights = param_knn_4['weights'][best_index %2]
#best_metric = param_grid['metric'][best_index % 2]

print(f"The best k value with hyperparameter tuning with train-validation-test split is {best_index}.")

In [None]:
#Training KNN with best k value
best_knn_4 = KNeighborsClassifier(n_neighbors = best_k_4, leaf_size = best_leaf_size, p = best_p, weights = best_weights) #metric = best_metric)
best_knn_4.fit(X_train_new, y_train_new)

In [None]:
#Best validation set
validation_accuracy = best_knn_4.score(X_val, y_val)
print(f"Validation accuracy with best hyperparameter values: {validation_accuracy:.4f}.")

In [None]:
#Best test set
test_accuracy = best_knn_4.score(X_test_new, y_test_new)
print(f"Test accuracy with best hyperparameter values: {test_accuracy:.4f}.")

In [None]:
#Calculating the accuract
acc_score = accuracy_score(y_test, y_pred)
print(f"(OLD) Test accuracy with k={best_value_k} before hyper parameter tuning is {acc_score:.4f}.")