## K Neighbour Regressor

The idea here is to predict the value based on the k neighbours average <b>(or can be anyother aggregation?)</b>

The best k value can be selected using the K folds method.

The R2 adjusted is proven in some examples to be much better compared to Linear Regression.

Unlike regression, it doesn't work on the (regression) Coefficients / slope, instead uses k nearest neighbours create a line (if k=1, line touches all points) and predict the outcome.


The major issue here is the <b>Curse of Dimensionality</b> 

As much as the features keep increasing , it becomes harder to find a plan that separates all points & captures all the data.

y = f(x)+ e

f(x) = Sigma(y_i) 1/K (=> for x in N)

e = error term

![image.png](attachment:image.png)

![image.png](attachment:image.png)

#### The value of K

Choosing the number of K is a major factor in deciding what the K neighbour regressor is going to predict
![image.png](attachment:image.png)

#### The value of K
Choosing the value of K is a major factor, because it decides what the K neighbour regressor is going to predict.

For K=1, it touches all the points in training data and hence the test error is very low (low biasness). But when you try to predict, it will have much higher variance. Hence the model is biased to the training data.


![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)

![image-4.png](attachment:image-4.png)

The more you keep increasing the value of K, the more smoother the line becomes and more data points are captured to find the average value...

![image-5.png](attachment:image-5.png)




##### R2

In [1]:
y = [2,4,5,4,5]
y_hat = [2.8, 3.4, 4, 4.6, 5.2]

In [2]:
from sklearn.metrics import r2_score

In [3]:
r2_score(y, y_hat)

0.6000000000000001

In [7]:
import numpy as np

In [8]:
y = np.array(y)
y_hat = np.array(y_hat)

In [17]:
1 - (sum((y-y_hat)**2) / sum( (y - np.mean(y))**2 ))

0.6000000000000001

In [15]:
sum( (y - np.mean(y))**2 )

6.0

In [19]:
(sum((y_hat-np.mean(y))**2) / sum( (y - np.mean(y))**2 ))

0.6000000000000001

<b>r2 score always increases when the number of coefficients increase.</b> and the squared sum of residuals i.e. (y_hat - y_i)**2 decreases. 
This causes the nominator to be a smaller number. And a smaller denominator divided by a number (let's assume constant) results in a much smaller number. Subtracting this smaller number from 1 results in higher value, hence higher r2 value.


#### Adjusted r2

The adjusted r2 score is introduced to solve the issue of higher r2 score for increasing number of features. Becasue r2 doesn't account for the number of features used to predict and hence gives higher value showing the good fit of the model. But here is the issue, it is not accounting the number of attributes & samples used to predict that outcome.

The formula of Adjusted r2 is;

r2_adjusted = 1 - ( (1-r2)(N-1)/(N-P-1)  )

r2 = r2 score
P = number of independent features
N = number of samples in DS (training only I guess!)

![image.png](attachment:image.png)

#### How Adjusted r2 helps in fixing the issue of r2 & increasing number of parameters?

<b>When the features are correlated</b>, the r2 score is high. Hence, the equation portion (1-r2) becomes very small & the multiplication of rest of the equation i.e. (N-1)/N-P-1 when multiplied with the smaller (1-r2) so the result is also a small fraction. Subtracting this small fraction from 1 results in bigger r2 adjusted. 

Hence there is a slight difference between r2 and r2_adjusted in this case.

<b>When the features are not correlated</b>, the r2 score is still a moderate / little higher number because of the number of increased p(no. of features) & having so many coefficient values with parameter cause the r2 to be moderate. But the "P" subtraction in the denominator causes the portion to be a bigger number. And when this is subtracted from 1, it's a small r2_adjusted value. 

Hence there is a huge difference between r2 & r2 Adjusted.


In all cases, r2_adjusted will always be less than or equal to r2_adjusted value.


<b> Considering the above thing in mind, r2_adjusted will not be a good metric to evaluate the K Neighbour Regressor.

In [21]:
# Calculating r2_adjusted for above calculation:
R2 = r2_score(y, y_hat)

In [22]:
y

array([2, 4, 5, 4, 5])

In [28]:
N = len(y)
P = 1 # number of independent features (let's say 1 for now)

R2_Adjusted = 1 - (1-R2)*(N-1)/(N-P-1)

In [29]:
R2_Adjusted

0.4666666666666668

In [25]:
R2

0.6000000000000001

In [58]:
N = 100000000000000000000000000
P = 1 # number of independent features (let's say 1 for now)

R2_Adjusted = 1 - (1-R2)*(N-1)/(N-P-1)

In [59]:
R2_Adjusted

0.6000000000000001

In [66]:
N = 5
P = 1 # number of independent features (let's say 1 for now)

R2_Adjusted = 1 - (1-R2)*(N-1)/(N-P-1)

In [67]:
R2_Adjusted

0.4666666666666668

In [69]:
N = 5
P = 2  # number of independent features (let's say 1 for now)

R2_Adjusted = 1 - (1-R2)*(N-1)/(N-P-1)
R2_Adjusted

0.20000000000000018

In [72]:
N = 5
P = 3  # number of independent features (let's say 1 for now)

R2_Adjusted = 1 - (1-R2)*(N-1)/(N-P-1)
R2_Adjusted

-0.5999999999999996

Increase in N cause the R2_adjusted to be increased (holding rest constant).


Incease in P (nummber of independent Features) cause the R2_adjusted to decrease abruptly.

R2_adjusted is not a good fit for K Neighbours Regressor, becasue the increasing number of dimensions (independent features) causes the R2_adjusted to decrease.

Also, another reason could be that the K Neighbour Coefficient is not a coefficient dependent model. i.e. It doesn't assigns slope to feature columns and hence the r2 will not increase by the increase in the coefficients (number of dependent features)

### K Neighbour Regression Implementation

In [77]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [73]:
from sklearn.neighbors import KNeighborsRegressor

In [74]:
model = KNeighborsRegressor(n_neighbors=1)

In [75]:
model

KNeighborsRegressor(n_neighbors=1)

In [78]:
df = pd.read_csv('Filtered Laptop Data-Copy1.csv')

In [80]:
df.drop('Unnamed: 0', axis=1, inplace=True)

In [82]:
df.head()

Unnamed: 0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,...,Cpu Type,Cpu Power,SSD,HDD,Flash Storage,Hybrid,IPS,Touchscreen,Processor,Processor Power
0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37,...,Intel Core i5,2.3GHz,128.0,,,,True,False,Intel,Core i5
1,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34,...,Intel Core i5,1.8GHz,,,128.0,,,,Intel,Core i5
2,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86,...,Intel Core i5,2.5GHz,256.0,,,,False,False,Intel,Core i5
3,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83,...,Intel Core i7,2.7GHz,512.0,,,,True,False,Intel,Core i7
4,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37,...,Intel Core i5,3.1GHz,256.0,,,,True,False,Intel,Core i5


In [87]:
df.drop(columns=['Company', 'TypeName', 'ScreenResolution', 'Cpu',
       'Memory', 'Gpu', 'OpSys', 'DisplayType',
       'DisplayResolution', 'Cpu Type', 'Cpu Power', 
       'Flash Storage', 'Hybrid','Processor', 'Processor Power'], inplace=True)

In [100]:
df['IPS']

0        True
1           0
2       False
3        True
4        True
        ...  
1298     True
1299     True
1300        0
1301        0
1302        0
Name: IPS, Length: 1303, dtype: object

In [91]:
df.fillna(0, inplace=True)

In [92]:
X = df.drop(columns=['Price']).copy()
y = df['Price']

In [93]:
from sklearn.model_selection import train_test_split

In [94]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=36)

In [95]:
X_train

Unnamed: 0,Inches,Ram,Weight,SSD,HDD,IPS,Touchscreen
100,15.6,8GB,1.910,256.0,0.0,False,False
384,13.3,16GB,1.100,512.0,0.0,False,True
1111,15.6,4GB,2.240,0.0,500.0,0,0
348,11.6,4GB,1.500,0.0,0.0,False,True
80,15.6,8GB,1.880,256.0,0.0,True,False
...,...,...,...,...,...,...,...
1185,15.6,8GB,2.591,256.0,1000.0,True,False
936,15.6,4GB,2.180,0.0,1000.0,0,0
926,12.5,8GB,1.360,256.0,0.0,True,False
610,15.6,32GB,2.500,1000.0,0.0,True,False
