# Exercise E5-3: Non-Linear Regression by KNN

## Step 0: Introduction to the Theory

![image-2.png](attachment:image-2.png)

In [None]:
# Embed a video
from IPython.display import Video, Audio, Image, YouTubeVideo

# reference to youtube video
id='3lp5CmSwrHI'

YouTubeVideo(id=id, width=600, height=300)

### Task
Abalones are small sea snails that look a bit like mussels. <br>
The age of an abalone matches to the number of rings seen inside the shell. By training and applying a  model, we can calculate the age without cutting the shell and killing the living abalone.<br>
Your task is to create a supervised machine learning model for predicting the age of an abalone from other measurable parameters, applying various regression methods,including the KNN method. 


![image-5.png](attachment:image-5.png) ![image-6.png](attachment:image-6.png)

Images from<br>
up: By Sharktopus - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=14082271 <br>
down: CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=203608

## Step 1: Development Environment

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn import metrics as sm

from math import sqrt

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Get acquainted with this huge data repository: https://archive.ics.uci.edu/ml/datasets.php. <br>
From there find, read about, and download the data source Abalone https://archive.ics.uci.edu/ml/datasets/Abalone.<br>
Rename your local version of the data file to abalone.csv and then use it for the example below.

In [None]:
# load data
abalone = pd.read_csv("../../data/abalone.data", header=None, 
                      names=['Sex', 'Length', 'Diam', 'Heigh','Whole', 'Shucke', 'Viscera','Shell', 'Rings'])

In [None]:
abalone.shape

## Step 2: Data Exploration and Preparation

In [None]:
abalone.head()

In [None]:
abalone.describe()

In [None]:
# see the data types
abalone.info()

In [None]:
# to check null values in data
abalone.isnull().sum()

In [None]:
# remove the column (axis=1) "Sex" as categorical
abalone = abalone.drop("Sex", axis=1)

In [None]:
# see the correlation between the features
corr_matrix = abalone.corr()
corr_matrix

In [None]:
# plot the matrix as a heat map
plt.subplots(figsize = (8, 6))
sns.heatmap(corr_matrix, annot=True)

In [None]:
corr_matrix["Rings"]

In [None]:
abalone["Rings"].hist(bins=15)
plt.show()

In [None]:
X = abalone.drop("Rings", axis=1)
X = X.values
y = abalone["Rings"]
y = y.values

In [None]:
X.shape

In [None]:
y.shape

In [None]:
# plot the data
cmap = sns.cubehelix_palette(as_cmap=True)
f, ax = plt.subplots()
plt.xlabel('Rings')
plt.ylabel('Size')

points = ax.scatter(X[:, 6], X[:, 1], cmap=cmap)
# points = axts.scatter(X_test[:,0], X_test[:,1], X_test[:,2], y_test, cmap=cmap)
# points = axtr.scatter(X_train[:,0], X_train[:,1], X_train[:,2], y_train, cmap=cmap)
f.colorbar(points)

plt.show()


## Step 3: Train a Model

In [None]:
k = 10

### Use numpy

In [None]:
# add a new observation
new_data_point = np.array([0.569552, 0.446407, 0.154437, 1.016849, 0.439051, 0.222526, 0.291208])

In [None]:
# find the distances from it to all training points
distances = np.linalg.norm(X - new_data_point, axis=1)
distances

In [None]:
# sort the distances and take the 3 nearest
nearest_3NN = distances.argsort()[:k]
nearest_3NN

In [None]:
# get their rings
nearest_rings = y[nearest_3NN]
nearest_rings

In [None]:
# take their mean as a prediction for the rings of the new data point
prediction = nearest_rings.mean()
prediction

### Use scikit-learn

In [None]:
# split the data into train and test sets
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# you can add random_state=1 as a tool for always reproducing the same split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
X_train

In [None]:
X_train.shape

In [None]:
y_train

In [None]:
# create an instance of the KNN regression modeel for our experiment
knn_model = KNeighborsRegressor(n_neighbors=k)

In [None]:
# fit the model to our train data
knn_model.fit(X_train, y_train)

In [None]:
# plot the train set
plt.xlabel('Rings')
plt.ylabel('Size')

plt.scatter(X_train[:,6], X_train[:,1], y_train)

plt.show()

In [None]:
# plot the train set
plt.xlabel('Rings')
plt.ylabel('Size')

plt.scatter(X_test[:,6], X_test[:,1], y_test)

plt.show()

## Step 4: Test the Model

In [None]:
# test it with the test data
y_predicted = knn_model.predict(X_test)

### Estimate the Errors in Prediction

In [None]:
# Mean Absolute Error (MAE) - the mean of the absolute value of the errors
mae = sm.mean_absolute_error(y_test, y_predicted)
mae

In [None]:
# Mean Squared Error (MSE) - the mean of the squared errors
mse = sm.mean_squared_error(y_test, y_predicted)
mse

In [None]:
# Root Mean Squared Error (RMSE) - the square root of the mean of the squared errors
rmse = sqrt(mse)
rmse

In [None]:
# Explained variance score: 1 is perfect prediction
evs = sm.explained_variance_score(y_test, y_predicted)
evs

In [None]:
# R-squared
R2 = sm.r2_score(y_test, y_predicted)
R2

## Step 5: Validation with New Data

In [None]:
# add a new observation
new_data_point = np.array([0.569552, 0.446407, 0.154437, 1.016849, 0.439051, 0.222526, 0.291208]).reshape(1, -1)

In [None]:
# predict Rings for it
new_rings = knn_model.predict(new_data_point)
new_rings

## Reference

https://scikit-learn.org/stable/auto_examples/neighbors/plot_regression.html<br>
https://www.saedsayad.com/k_nearest_neighbors_reg.htm <br>
https://realpython.com/knn-python/<br>
https://www.youtube.com/watch?v=3lp5CmSwrHI
   