## Notes

- K-NN is one of the popular techniques used in machine learning, applied in supervised learning problems.
- It is majorly used in classification problems, it works for making predictions for new instances based on the neighbouring elements.
- In classification, we assign a class label to a new data based on the *k* closest neighbors (where k is positive integer).
    - We classify based on the majority of neighbours classes.

**KNN for Classification**: 
1. We calculate distances between all points and new instance (Distance Euclidienne, it can also be Manhatten distance).
2. we classify them in increasing order, the top 5 (k = 5) are the closest neighbors.
3. we see majority classe label then assign it to the new instance.
![](https://miro.medium.com/max/1374/0*q0Xqkta3uKCkzV6o.png)


**KNN for Regression**: 

Since regression is about continuous values, then we'll have to predict conitous values instead of classes labels.

1. We calculate distances between all points and new instance.
2. we classify them in increasing order, the top 5 (example of k=5) are the closest neighbors.
3. we calculate average of `Y`s, then we assign it to the instance.

- A heuristically optimal number k of nearest neighbours is found based on a relevant accuracy metric. This is done using grid search and cross-validation.
    - Then we update the prediction based on the found number `K`.



- How is it related to Linear Regression ?
    - Knn has higher prediction power, because it takes care of non-linearity.
    - however if linear regression function is close to the reality, then it will perform better.

---

## Code

### 1 - Reading the dataset

In [1]:
# Importing required libraries
import pandas as pd

data = pd.read_csv('data/Profit.csv')
data.head()

Unnamed: 0,Marketing Spend,Profit
0,471784.1,192261.83
1,443898.53,191792.06
2,407934.54,191050.39
3,383199.62,182901.99
4,366168.42,166187.94


In [2]:
# Top 5 rows of the data
data.head()

Unnamed: 0,Marketing Spend,Profit
0,471784.1,192261.83
1,443898.53,191792.06
2,407934.54,191050.39
3,383199.62,182901.99
4,366168.42,166187.94


In [3]:
# Checking the shape of the data
print('Shape of the dataset (No. of rows, No. of columns):', data.shape)


Shape of the dataset (No. of rows, No. of columns): (200, 2)


### 2 - Defining the input-output features

In [5]:
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values

In [6]:
# Checking the shape of input and output features
print('Shape of the input features:', X.shape)
print('Shape of the output features:', y.shape)


Shape of the input features: (200, 1)
Shape of the output features: (200,)


### 3 - Defining the training-test features

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.1, random_state=17)

In [8]:
# Checking the shape of the training and test sets
print('Shape of the training input data:', X_train.shape)
print('Shape of the training output data:', y_train.shape)
print('Shape of the test input data:', X_test.shape)
print('Shape of the test output data:', y_test.shape)


Shape of the training input data: (180, 1)
Shape of the training output data: (180,)
Shape of the test input data: (20, 1)
Shape of the test output data: (20,)


### 4 - Defining and training a K-NN regression model

#### 4.1 - Initializing a K-NN Regression model

In [22]:
from sklearn.neighbors import KNeighborsRegressor
regressor = KNeighborsRegressor()


#### 4.2 -  Hyperparameter tuning

To define the model, we need to know how many neighbours we should use with which we can get the best results. 

For this purpose, we will use the grid search and 10-fold cross-validation for hyperparameter tuning. 



In [25]:
# Finding the optimal value of K
from sklearn.model_selection import GridSearchCV

k_range = list(range(1, 21))
param_grid = dict(n_neighbors=k_range)

grid = GridSearchCV(regressor, param_grid, cv=10, scoring='r2', return_train_score=False,verbose=0)

grid.fit(X_train, y_train)
print(grid.best_params_)


{'n_neighbors': 3}


As we can see in the output, we have got 2 as the optimal number of neighbours. 

#### 4.3 - Defining training the K-NN Regression model

In [26]:
# Defining the KNN regressor with optimal value of K

regressor = KNeighborsRegressor(n_neighbors=3)
regressor.fit(X_train, y_train)


KNeighborsRegressor(n_neighbors=3)

### 5 - Predicting and evaluating the predictions

#### 5.1 - Predicting part

In [27]:
# Making predictions on the test data
y_pred = regressor.predict(X_test)


In [28]:
# Comparing the predicted profits with actual profits
pd.DataFrame(data={'Predicted Profit': y_pred, 'Actual Profit': y_test})


Unnamed: 0,Predicted Profit,Actual Profit
0,192245.163333,192261.83
1,49474.083333,49530.75
2,152211.77,152161.77
3,106273.806667,105683.54
4,93709.043333,42509.73
5,97424.506667,97427.84
6,107401.006667,107404.34
7,96476.176667,96479.51
8,99934.256667,99937.59
9,155752.6,155702.6


#### 5.1 - Evaluating part

In [29]:
# Mean Squared Error (MSE)
from sklearn.metrics import mean_squared_error
MSE=mean_squared_error(y_test, y_pred)
print('Mean Squared Error is:', MSE)


Mean Squared Error is: 262014167.49203998


- The MSE looks too high, we should check with RMSE.

In [30]:
# Root Mean Squared Error (RMSE)
import math
RMSE= math.sqrt(MSE)
print('Root Mean Squared Error is:', RMSE)


Root Mean Squared Error is: 16186.851685613234


- This shows the actual error between the actual and predicted values of profits. So there is an overall 16186.47 of error between the actual and predicted profits. 


To find how well the K-NN regression model was fitted with the data, we will obtain the R-squared which is a measure of the fitness of the regression models. 



In [31]:
# R-Squared
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print('R-Squared is:', r2)


R-Squared is: 0.8525144611428654


**Results** :

- As we can see that the value of R-squared is nearer to 1 on a scale of 0 to 1, we can say that the model was well fitted and the prediction results will be satisfactory with this well-fitted model. 

