### Introduction to K Nearest Neighbors

This notebook provides a fundamental introduction to the KNeighborsClassifier from scikit-learn. It demonstrates building several versions of the model by varying the k values and assessing their performance. Additionally, it includes data preprocessing steps such as scaling to enhance the classifier's performance.

Furthermore, this template features detailed code explanations for more complex tasks.

#### Index

- [Assigning X and y](#-Problem-1)
- [Creating train/test split](#-Problem-2)
- [Column transformer](#-Problem-3)
- [Pipeline (5)](#-Problem-4)
- [Pipeline (50)](#-Problem-5)
- [False predictions](#-Problem-6)

In [7]:
#Import required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

In [4]:
#Connect to the data
default = pd.read_csv(r'C:\Users\agnek\OneDrive\Documents\Educational_Training Materials\default.csv')

In [4]:
#Quick overview of the data
default.head()

Unnamed: 0.1,Unnamed: 0,default,student,balance,income
0,1,No,No,729.526495,44361.625074
1,2,No,Yes,817.180407,12106.1347
2,3,No,No,1073.549164,31767.138947
3,4,No,No,529.250605,35704.493935
4,5,No,No,785.655883,38463.495879


[Back to top](#-Index)

### Assigning X and y

Defining X as all columns except for `default` and `y` as `default`.

In [10]:
#Define X and y
X = default[['student', 'balance', 'income']]
y = default['default']

print(X.head())
print('==============')
print(y.head())

  student      balance        income
0      No   729.526495  44361.625074
1     Yes   817.180407  12106.134700
2      No  1073.549164  31767.138947
3      No   529.250605  35704.493935
4      No   785.655883  38463.495879
0    No
1    No
2    No
3    No
4    No
Name: default, dtype: object


[Back to top](#-Index)

### Creating Train Test Split

Using the `train_test_split` function to create a train test split on `X` and `y` with 25% of the data assigned as the test set.  Set `random_state = 42` for consistency. 

In [11]:
#Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=42)

# Answer check
print(X_train.shape)
print(X_test.shape)

(7500, 3)
(2500, 3)


[Back to top](#-Index)

### Column Transformer

Using the `make_column_transformer` to create a column transformer. Inside the `make_column_transformer` specifying an instance of the `OneHotEncoder` transformer from scikit-learn. Inside `OneHotEncoder` setting `drop` equal to `'if_binary'`. Applying this transformation to the `student` column. On the `remainder` columns, applying a `StandardScaler()` transformation.

[Documentation for `make_column_transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html)

In [12]:

### Create a column transformer 
transformer = make_column_transformer((OneHotEncoder(drop = 'if_binary'), ['student']), 
                                     remainder = StandardScaler())

# Answer check
print(transformer)

ColumnTransformer(remainder=StandardScaler(),
                  transformers=[('onehotencoder',
                                 OneHotEncoder(drop='if_binary'),
                                 ['student'])])


### Code Overview

This code creates a column transformer using the make_column_transformer function from the sklearn.compose module. This transformer applies different preprocessing steps to different columns of a DataFrame. Here’s a detailed explanation:

#### make_column_transformer:

This function is a convenience wrapper around ColumnTransformer, which constructs a transformer that can apply different preprocessing and transformation steps to different subsets of the features.

First Argument ((OneHotEncoder(drop='if_binary'), ['student'])):

#### OneHotEncoder(drop='if_binary'): 
This specifies the transformer to apply to the column 'student'.
OneHotEncoder: A transformer that converts categorical data into a one-hot encoded format (a binary column for each category).
drop='if_binary': This argument tells the encoder to drop one of the binary columns if the feature is binary (i.e., has only two categories). This avoids multicollinearity.
['student']: This specifies the column to which the OneHotEncoder should be applied. In this case, it’s the 'student' column.

#### Second Argument (remainder=StandardScaler()):

#### remainder: 
This parameter specifies what to do with the remaining columns that were not explicitly specified in the transformations.
#### StandardScaler(): 
This scaler standardizes the remaining columns by removing the mean and scaling to unit variance. It ensures that each feature contributes equally to the model, regardless of their original scale.

#### Breakdown
Transforming the 'student' Column:

The OneHotEncoder will be applied to the 'student' column.
If 'student' is a binary categorical column (e.g., 'yes' or 'no'), one of the categories will be dropped to avoid redundancy.
Standardizing the Remaining Columns:

All other columns not specified (i.e., all columns except 'student') will be standardized using the StandardScaler.
StandardScaler will compute the mean and standard deviation for each feature in the training set and will standardize each feature by subtracting the mean and dividing by the standard deviation.

[Back to top](#-Index)

### Pipeline with KNN and n_neighbors = 5

Using column `transformer` defined above, to create a `Pipeline` named `fivepipe` below with steps `transform` and `knn` that transform columns and subsequently build a KNN model using `KNeighborsClassifier()`.  

Using the `fit` function to fit the pipe on the training data and useing the `.score method of the fit pipe to determine the accuracy on the test data.  Assigning this to `fivepipe_acc` below.

In [13]:
# Create pipeline with n_neighbors = 5
fivepipe = Pipeline([('transform', transformer), ('knn', KNeighborsClassifier())])
fivepipe.fit(X_train, y_train)
fivepipe_acc = fivepipe.score(X_test, y_test)

# Answer check
print(fivepipe_acc)

0.968


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


[Back to top](#-Index)

### Pipeline with n_neighbors = 50

Using column `transformer` defined above, creating a `Pipeline` named `fiftypipe` below with steps `transform` and `knn` that transforms columns and subsequently builds a KNN model using `KNeighborsClassifier()`. Building the KNN model with `n_neighbors = 50`

Useing the `fit` function to fit the pipe on the training data and using the `.score` method of the fit pipe to determine the accuracy on the test data.  Assigning this to `fiftypipe_acc` below.


In [14]:
#Create pipeline with n_neighbors = 50
fiftypipe = Pipeline([('transform', transformer), ('knn', KNeighborsClassifier(n_neighbors=50))])
fiftypipe.fit(X_train, y_train)
fiftypipe_acc = fiftypipe.score(X_test, y_test)

# Answer check
print(fiftypipe_acc)

0.9712


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


[Back to top](#-Index)

### False Predictions

Finally, compare the two pipelines based on the number of sum of the errors (FP+FN) -- those observations who the model predicted to default but incorrectly so. Assigning these values as integers to `five_fp` and `fifty_fp` respectively.   


In [15]:
#Count the false predictions 
five_fp = sum(fivepipe.predict(X_test) != y_test)
fifty_fp = sum(fiftypipe.predict(X_test) != y_test)

# Answer check
print(f'Number of False Predictions with five neighbors: {five_fp}')
print(f'Number of False Predictions with fifty neighbors: {fifty_fp}')

Number of False Predictions with five neighbors: 80
Number of False Predictions with fifty neighbors: 72


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


### Summary Analysis
The findings indicate a comparison of the number of false predictions made by the KNeighborsClassifier with two different values of neighbors (k):

##### Number of False Predictions with Five Neighbors (k=5): 80
##### Number of False Predictions with Fifty Neighbors (k=50): 72

### Analysis

#### Model Performance:

When using five neighbors (k=5), the model made 80 false predictions.
Increasing the number of neighbors to fifty (k=50) reduced the number of false predictions to 72.

#### Impact of k on Model Accuracy:

The reduction in false predictions when increasing k from 5 to 50 suggests that a higher k value may improve the model's accuracy.
This improvement is likely because a higher k value considers more neighboring data points, leading to more stable and generalized decision boundaries.

#### Bias-Variance Tradeoff:

A smaller k value (e.g., k=5) typically results in a more flexible model that can capture finer details in the data but may also lead to higher variance and overfitting.
A larger k value (e.g., k=50) smooths out the decision boundaries, reducing variance and potentially improving generalization, as indicated by the fewer false predictions.

#### Optimal k Value:

While increasing k to 50 has reduced false predictions, it is essential to note that there is a point where further increasing k may lead to underfitting, where the model becomes too generalized and fails to capture important patterns in the data.
Finding the optimal k value involves balancing the bias-variance tradeoff and often requires cross-validation and performance evaluation on different datasets.

### Conclusion
Increasing the number of neighbors in the KNeighborsClassifier from 5 to 50 has resulted in fewer false predictions, suggesting an improvement in model accuracy. However, it is crucial to continue exploring and validating different k values to find the optimal balance for the specific dataset and problem at hand. This process ensures that the model remains both accurate and generalizable.