### The purpose of this note
is to find the value of k of the best performing model based on the test MSE.

### KNN Algorithm Steps:
1. Load the data-Train the model
2. Initialize K to your chosen number of neighbors
3. For each observation in the train data:
3. * Calculate the distance between the query(test) observation and the current observation from the train data.
3.* Add the distance and the index of the observation to an ordered collection(memory table)
4. Sort the ordered collection of distances and indices from smallest to largest (in ascending order) by the distances
5. Pick the first K entries from the sorted collection
6. Get the labels of the selected K entries
7. If regression, return the mean of the K labels
8. If classification, return the mode of the K labels


## Instructions:
- Read the Advertisement data and view the top rows of the dataframe to get an understanding of the data and the columns.
- Select the first 7 observations and the columns `TV` and `Sales` to make a new data frame.
- Create a scatter plot of the new data frame `TV` budget vs `Sales`.


## Reading data using pandas

**Pandas:** popular Python library for data exploration, manipulation, and analysis

- Anaconda users: pandas is already installed
- Other users: [installation instructions](http://pandas.pydata.org/pandas-docs/stable/install.html)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

###Read the data
What are the features?
- **TV:** advertising dollars spent on TV for a single product in a given market (in thousands of dollars)
- **Radio:** advertising dollars spent on Radio
- **Newspaper:** advertising dollars spent on Newspaper

What is the response?
- **Sales:** sales of a single product in a given market (in thousands of items)

What else do we know?
- Because the response variable is continuous, this is a **regression** problem.
- There are 200 **observations** (represented by the rows), and each observation is a single market.

<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html" target="_blank">pd.read_csv(filename)</a>
Returns a pandas dataframe containing the data and labels from the file data

<a href="" target="_blank">data.head()</a>
Returns the first 5 rows of the dataframe with the column names

In [None]:
# read data into a DataFrame
data = pd.read_csv(r'Advertising.csv', index_col=0,header=0)
data.head()

Unnamed: 0,TV,radio,newspaper,sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


## Preparing X and y using pandas

- scikit-learn expects X (feature matrix) and y (response vector) to be NumPy arrays.
- However, pandas is built on top of NumPy.
- Thus, X can be a pandas DataFrame and y can be a pandas Series!

In [None]:
X = data[['TV', 'radio', 'newspaper']]
Y = data['sales']
print(np.shape(data))

(200, 4)


###Splitting data into 70:30 train:test ratio

In [None]:
from sklearn.model_selection import train_test_split
#Split the data into test and train
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2,random_state=10)

In [None]:
print(np.shape(X_train))

(160, 3)


In [None]:
print(np.shape(Y_train))

(160,)


In [None]:
print(np.shape(X_test))

(40, 3)


In [None]:
print(np.shape(Y_test))

(40,)


###predicting using the KNeighbors_Regressor

In [None]:
#sample value passed for k, k can be passed through trial and error also like shown in classifier

from sklearn.neighbors import KNeighborsRegressor
model_KNN=KNeighborsRegressor(n_neighbors=10, metric='euclidean')

#Printing all the parameters of KNN
print(model_KNN)

#fit the model on the data and predict the values
model_KNN.fit(X_train,Y_train)
Y_pred=model_KNN.predict(X_test)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='euclidean',
                    metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                    weights='uniform')


### Mean Squard Error of the model

In [None]:
#Metrics Evaluation
from sklearn.metrics import r2_score,mean_squared_error
r2score=r2_score(Y_test,Y_pred)
print("RSquare = ",r2score)
rmse=np.sqrt(mean_squared_error(Y_test,Y_pred))
print("RMSE = ", rmse)

RSquare =  0.8608042002876257
RMSE =  2.3801706871567005


###Accuracy of the model on testing data

In [None]:
#Measuring accuracy on Testing Data
print('Accuracy',100- (np.mean(np.abs((Y_test - Y_pred) / Y_test)) * 100))

Accuracy 80.22229002154205


###Accuracy of the model

In [None]:
X_pred=model_KNN.predict(X_test)
print('Accuracy from the model {:^0.2f}',
      format(r2_score(Y_test,X_pred)*100))

Accuracy from the model {:^0.2f} 86.08042002876257


Here Compared to LR, r2 is more but rmse is also more which is not good. 
**So we go for trial and error for getting optimum k value**

In [None]:
np.sqrt(len(X_train))

12.649110640673518

In [None]:
from sklearn.metrics import r2_score
for K in range(1,16):
    model_KNN = KNeighborsRegressor(K,metric="euclidean")
    model_KNN.fit(X_train, Y_train)
    Y_pred = model_KNN.predict(X_test)
    print ("R-square is ", r2_score(Y_test,Y_pred), "for K-Value:",K)

R-square is  0.947505618529669 for K-Value: 1
R-square is  0.9293159406535845 for K-Value: 2
R-square is  0.9411330460488659 for K-Value: 3
R-square is  0.9394530965066451 for K-Value: 4
R-square is  0.9284857726620428 for K-Value: 5
R-square is  0.9171381300471962 for K-Value: 6
R-square is  0.8986986579101233 for K-Value: 7
R-square is  0.8941504138170911 for K-Value: 8
R-square is  0.8798578264845459 for K-Value: 9
R-square is  0.8608042002876257 for K-Value: 10
R-square is  0.8470466691144289 for K-Value: 11
R-square is  0.8483613805522731 for K-Value: 12
R-square is  0.8510763597780449 for K-Value: 13
R-square is  0.8490397921317565 for K-Value: 14
R-square is  0.8496233693296317 for K-Value: 15


**Ideally we get high r2 for k=1, but we cant go for such small no. we go further and see k=3 is optimum**

In [None]:
from sklearn.neighbors import KNeighborsRegressor
model_KNN=KNeighborsRegressor(n_neighbors=3, metric='euclidean')
model_KNN.fit(X_train,Y_train)
Y_pred=model_KNN.predict(X_test)

We need an **evaluation metric** in order to compare our predictions with the actual values!

## Model evaluation metrics for regression

Evaluation metrics for classification problems, such as **accuracy**, are not useful for regression problems. Instead, we need evaluation metrics designed for comparing continuous values.

Let's create some example numeric predictions, and calculate **three common evaluation metrics** for regression problems:

**Mean Absolute Error** (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

In [None]:
#Metrics Evaluation

from sklearn.metrics import r2_score,mean_squared_error
r2score=r2_score(Y_test,Y_pred)
print(r2score)
rmse=np.sqrt(mean_squared_error(Y_test,Y_pred))
print(rmse)

0.9411330460488659
1.5478569414229182


Comparing these metrics:

- **MAE** is the easiest to understand, because it's the average error.
- **MSE** is more popular than MAE, because MSE "punishes" larger errors.
- **RMSE** is even more popular than MSE, because RMSE is interpretable in the "y" units.

Here Comparitively among LR and KNN,KNN with k=3 we get better model as this small data set. Hectic of LR steps are also eliminated. Hence we can opt for KNN instead of LR

###Accuracy of the model on testing data

In [None]:
#Measuring accuracy on Testing Data
print('Accuracy',100- (np.mean(np.abs((Y_test - Y_pred) / Y_test)) * 100))

Accuracy 86.55901704884404


###Accuracy of the model

In [None]:
X_pred=model_KNN.predict(X_test)
print('Accuracy from the model {:^0.2f}',
      format(r2_score(Y_test,X_pred)*100))

Accuracy from the model {:^0.2f} 94.1133046048866
