# K Nearest Neighbors (KNN)

Classify an example to the majority class of its neighbors.

![example](https://www.researchgate.net/publication/359786522/figure/fig3/AS:1142065312346112@1649300997168/Visualization-of-k-Nearest-Neighbors-with-two-classes-blue-circles-and-red-triangles.ppm)

The KNN algorithm has no parameters. Thus, it has no real `.fit` step. It works as follows:

1. Get data points and their classes.
2. Receive a new point for classification.
3. Calculate the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) between that point and every point in `1.`.
4. Take the classes of the closest `k` points.
5. Return the majority class.

# Imports

In [147]:
# Add magic command to delete all saved variables

In [148]:
# Imports and constant values here

# Load data

In [149]:
# Load and display the `iris.csv` file from the GitHub repo.

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [150]:
# Display a summary of the features.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [151]:
# Display a summary statistics.

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
sepal_length,150.0,5.843333,0.828066,4.3,5.1,5.8,6.4,7.9
sepal_width,150.0,3.054,0.433594,2.0,2.8,3.0,3.3,4.4
petal_length,150.0,3.758667,1.76442,1.0,1.6,4.35,5.1,6.9
petal_width,150.0,1.198667,0.763161,0.1,0.3,1.3,1.8,2.5


In [152]:
# Check the distribution of the target feature.

species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

## Data preparation

In [153]:
# Split the data for training and testing.

X_train.shape=(120, 4)
X_test.shape=(30, 4)
y_train.shape=(120,)
y_test.shape=(30,)


## The `KNN` algorithm

Implement `KNN`.

It should have the following methods:

- `constructor`: accept a `n_neighbors` parameter;
- `fit`: accept an `X` matrix and a `y` vector and save them;
- `predict`: accept an `X` matrix and return the majority class in the nearest `n_neighbors`.

In [154]:
class KNN:
  pass

## Model performance

### Before scaling

In [155]:
# Apply your model on the training data. Display performance metrics.

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



In [156]:
# Apply the built-in KNeighborsClassifier on the training data.

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



### After scaling

In [157]:
# Scale and display the testing data.

array([[ 0.35451684, -0.57925837,  0.5576453 ,  0.02332414],
       [-0.13307079,  1.67028869, -1.16259727, -1.17620281],
       [ 2.30486738, -1.02916778,  1.81915651,  1.48941263],
       [ 0.23261993, -0.35430366,  0.44296246,  0.42316645],
       [ 1.2077952 , -0.57925837,  0.61498672,  0.28988568],
       [-0.49876152,  0.77046987, -1.27728011, -1.04292204],
       [-0.2549677 , -0.35430366, -0.07311031,  0.15660491],
       [ 1.32969211,  0.09560575,  0.78701097,  1.48941263],
       [ 0.47641375, -1.9289866 ,  0.44296246,  0.42316645],
       [-0.01117388, -0.80421307,  0.09891395,  0.02332414],
       [ 0.84210448,  0.32056046,  0.78701097,  1.08957031],
       [-1.23014297, -0.12934896, -1.33462153, -1.44276436],
       [-0.37686461,  0.99542457, -1.39196294, -1.30948358],
       [-1.10824606,  0.09560575, -1.27728011, -1.44276436],
       [-0.86445224,  1.67028869, -1.27728011, -1.17620281],
       [ 0.59831066,  0.54551516,  0.5576453 ,  0.55644722],
       [ 0.84210448, -0.

In [158]:
# Apply your model on the scaled training data. Display performance metrics.

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



In [159]:
# Apply the built-in model on the scaled training data. Display performance metrics.

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



## Apply KNN and Logistic Regression on the Titanic dataset

You have freedom to create the machine learning pipeline.