<h2><center>K-Nearest Neighbors and Cross Validation</center>

## K-Nearest Neighbors Algorithm

The K-Nearest Neighbors (KNN) Algorithm is a supervised machine learning algorithm that is used to solve both classification and regression problems. It is simple and easy-to-implement. A supervised machine learning algorithm uses given labelled input and unlabelled output to learn relaionship between them.

The KNN algorithm focuses on the idea of similarity. It assumes that similar data points are close to each other. It calculates the distance between two similar points depending on their proximity or closeness. The Euclidian distance is one of the most familiar choice for determining the proximity between two points.

#### The basic KNN Algorithm includes - 

1. Load the dataset.
2. Choose number of neighbors and initialize the value of K.
3. For each example in the dataset-
    a. From te data, calculate the distance between the query example and the current example.
    b. Create an ordered collection by adding the distance and index of the example.
4. Sort the ordered collection of distances and indices from smallest to largest by the distances.
5. Fetch the first K entries from the sorted collection.
6. Get the labels of the selected K entries
7. If regression, return the mean of the K labels
8. If classification, return the mode of the K labels

### The Dataset

In [107]:
#Import necessary libraries
import numpy as np
import pandas as pd

#Load the dataset as a dataframe.
data = pd.read_csv('lung_cancer_examples.csv')

#Display the first 5 columns of the dataset.
data.head()

Unnamed: 0,Name,Surname,Age,Smokes,AreaQ,Alcohol,Result
0,John,Wick,35,3,5,4,1
1,John,Constantine,27,20,2,5,1
2,Camela,Anderson,30,0,5,2,0
3,Alex,Telles,28,0,8,1,0
4,Diego,Maradona,68,4,5,6,1


In [108]:
#Seeing the summary of the dataset using info() function
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 7 columns):
Name       59 non-null object
Surname    59 non-null object
Age        59 non-null int64
Smokes     59 non-null int64
AreaQ      59 non-null int64
Alcohol    59 non-null int64
Result     59 non-null int64
dtypes: int64(5), object(2)
memory usage: 3.3+ KB


In [109]:
#Checking for the number of columns and rows using the shape function
data.shape

(59, 7)

After looking at the overall structure and summary of our dataframe, we can see that the columns 'Name' and 'Surname' containg string values. These cannot be converted into integer or float. So, we can exclude these columns.

Since, KNN algorithm works on the principles of supervised learning, we should have Input and Output. Looking at our dataset, we can determine our input as Age, Smokes, AreaQ and Alcohol. The only column that we can keep as out output is 'Result'.

In [110]:
#Removing columns Name and Surname as they contain string values.
#Removing the Result column as it is our output variable.
Ip = data.drop(columns=['Name','Surname','Result'])

#Checking if our dataframe Ip contains only input variables.
Ip.head()

Unnamed: 0,Age,Smokes,AreaQ,Alcohol
0,35,3,5,4
1,27,20,2,5
2,30,0,5,2
3,28,0,8,1
4,68,4,5,6


In [111]:
#Creating a new dataframe Op to store the 'Result' column.
Op = data['Result'].values

#Checking for the output variable.
Op[:10]

array([1, 1, 0, 0, 1, 0, 0, 0, 0, 1], dtype=int64)

We need to split the data into training data and testing data.
Here, we are keeping the test size to be 3, which means 30% of our data is testing data and the remaining 70% is training data.


In [112]:
#Importing necessary libraries
from sklearn.model_selection import train_test_split

#spliting dataset into train and test data
Ip_train, Ip_test, Op_train, Op_test = train_test_split(Ip, Op, test_size=0.3, random_state=1, stratify=Op)

In [113]:
#Importing necessary libraries
from sklearn.neighbors import KNeighborsClassifier

#Creating KNN classifier
k = KNeighborsClassifier(n_neighbors = 3)


#Using the fit() function to fit the classifier to the data
k.fit(Ip_train,Op_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

After running the KNeighborsClassifier(), and setting the value of n_neighbors to be 3, we get the model predictions using the predict() function. Here, we are showing first 5 model predictions on the Ip_test data.

In [114]:
#Fetching first 5 model predictions on Ip_test data.
k.predict(Ip_test)[:5]

array([1, 0, 0, 1, 1], dtype=int64)

Here, we get 2 zeros and 3 ones. 0 denotes patients that donot have cancer and 1 denotes patients that have cancer. This means two out of five patients donot have cancer.

In [115]:
#checking the accuracy of our model on the test data
k.score(Ip_test, Op_test)

0.8333333333333334

<b>Therefore, we have a 83.33% accurate model</b>

## Cross Validation

Cross validation is one of the simplest techniques used to test machiune learning algorithms. The dataset is randomly split into k number of groups. One of which is the test set and the rest are considered as training sets.
Cross validation enables us to test on different splits of data, so we know how the model performs on unseen data.

In [116]:
#Impoting necessary libraries
from sklearn.model_selection import cross_val_score

#create a new KNN model
kval = KNeighborsClassifier(n_neighbors=3)

#train model with cv of 5 
kval_scores = cross_val_score(kval, Ip, Op, cv=7)

#print each cv score (accuracy) and average them
print(kval_scores)
print("Cross Validation Score(mean):{}".format(np.mean(ap_scores)))

[0.66666667 0.77777778 1.         1.         1.         1.
 0.875     ]
Cross Validation Score(mean):0.9027777777777778


Since we got three zeros, this indicates that we have 3 patients that do not have cancer and using Cross Validation the mean score is 90.27%. We can also find how many neighbours perform best, we use GridSearchCV which works by training model multiple times on a specified range of parameters. This helps us in testing our model with each parameters and find out the optimal values to get the most accurate results.

In [117]:
#Importing necessary libraries
from sklearn.model_selection import GridSearchCV

#create a new knn model
knn = KNeighborsClassifier()

#create a dictionary of all values we want to test for n_neighbors
test_val = {"n_neighbors": np.arange(1, 15)}

#use gridsearch to test all values for n_neighbors
gs = GridSearchCV(knn, test_val, cv=7)

#fit model to data
gs.fit(Ip, Op)



GridSearchCV(cv=7, error_score='raise-deprecating',
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='warn', n_jobs=None,
             param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [118]:
#check top performing n_neighbors value
gs.best_params_

{'n_neighbors': 1}

In [119]:
#check mean score for the top performing value of n_neighbors
gs.best_score_

0.9322033898305084

### CONCLUSION

Here, the optimal value for n_neighbors is 1. The mean accuracy of the scores is 93.22% which is obtained through cross-validation. By using grid search to find the optimal parameter for model, we have improved our model accuracy by 3%