# Exploring K-Nearest Neighbors

To exemplify the implementation of a KNN for classification we will use the data set that we have been using in the previous modules and that has been normalized because this type of model requires it.
<h3 style="font-family: Comic Sans MS; color: #68FF33">KNN for classification</h3>

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
total_data = pd.read_csv('https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_iris.csv')

X = total_data.drop(columns=['specie'])
y = total_data['specie']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
22,-1.506521,1.249201,-1.567576,-1.315444
15,-0.173674,3.090775,-1.283389,-1.05218
65,1.038005,0.098217,0.364896,0.264142
11,-1.264185,0.788808,-1.226552,-1.315444
42,-1.748856,0.328414,-1.397064,-1.315444


To ensure the correct functioning of this model and avoid errors, the standardization of the data is necessary because if the features are not on the same scale, those with larger magnitudes can dominate the distance and affect the result of the algorithm.

For example, if we have two characteristics: age (with values between 0 and 100) and annual income (with values between 0 and 100,000), the difference in scale between the two variables could cause annual income to have a disproportionate impact on the distance by ignoring the importance of age.

Standardizing the data helps all characteristics contribute equally to the distance, which can improve the performance of the KNN algorithm. The choice between normalization and Min-Max standardization will depend on the behavior of the variables and how they affect model performance. If we have features with different scales and range, Min-Max is the best alternative. If on the other hand they have the same or similar scale, normalization is the most appropriate.

In [5]:
import plotly.express as px

total_data['specie'] = total_data['specie'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

fig = px.scatter_3d(total_data, x = "petal width (cm)", y = "petal length (cm)", 
                    z = "sepal width (cm)", color = "specie", width = 1000, height = 500,
                    size = total_data["petal length (cm)"].abs(),
                    color_discrete_sequence = ["#E58139", "#39E581", "#8139E5"])
camera = dict(
    up=dict(x=1, y=3.5, z=0),
    eye=dict(x=2, y=0, z=0)
)

fig.update_layout(scene_camera=camera)
fig.show()

In [8]:
# The next dot plot for the relationship between the variables two by two would have to be done in the EDA
import matplotlib.pyplot as plt
import seaborn as sns

# fig, axis = plt.subplots(2, 3, figsize = (12, 6))

#palette = ["#E58139", "#39E581", "#8139E5"]
# sns.scatterplot(ax = axis[0, 0], data = total_data, x = "sepal length (cm)", y = "sepal width (cm)", hue = "specie")
# sns.scatterplot(ax = axis[0, 1], data = total_data, x = "sepal length (cm)", y = "petal length (cm)", hue = "specie")
# sns.scatterplot(ax = axis[0, 2], data = total_data, x = "sepal length (cm)", y = "petal width (cm)", hue = "specie")
# sns.scatterplot(ax = axis[1, 0], data = total_data, x = "sepal width (cm)", y = "petal length (cm)", hue = "specie")
# sns.scatterplot(ax = axis[1, 1], data = total_data, x = "sepal width (cm)", y = "petal width (cm)", hue = "specie")
# sns.scatterplot(ax = axis[1, 2], data = total_data, x = "petal length (cm)", y = "petal width (cm)", hue = "specie")

# plt.tight_layout()
# plt.show()

<img src="./plt2.png" width="1000" height="600" />
<br>
<br>Comparing the predictors one by one (to make it more graphical and explicit) the separation as a function of class values is better observed. Therefore, the KNN model is also very appropriate to solve the problem.

In [9]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()
model.fit(X_train, y_train)

In [10]:
y_pred = model.predict(X_test)
y_pred 

array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 0, 0], dtype=int64)

In [11]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

1.0

<h3 style="font-family: Comic Sans MS; color: #68FF33">KNN for regression</h3>
<br>To exemplify the implementation of a KNN algorithm we will show how to generate a dataset that meets our needs.

In [12]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

X, y = make_regression(n_samples=1000, n_features=4, noise=1, random_state=42)
X = pd.DataFrame(X, columns=['Var1', 'Var2', 'Var3', 'Var4'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.head()

Unnamed: 0,Var1,Var2,Var3,Var4
29,-0.51827,0.357113,1.477894,-0.219672
535,0.457687,-2.1207,-0.606865,-2.238231
695,-0.224633,0.940771,-0.982487,-0.989628
557,0.360648,-0.320298,1.643378,-2.077812
836,-0.307962,-0.144519,-0.79242,-0.675178


In regression it is also necessary to standardize the data. In this case, they are already standardized. As we have done before, we will draw again the 3D graph and the relationships one by one of the features of the artificially generated dataset:

In [13]:
total_data = X.copy()
total_data["target"] = y

fig = px.scatter_3d(total_data, x = "Var1", y = "Var2", z = "Var3", color = "target", width = 1000, height = 500,
                    size = total_data["Var4"].abs())
camera = dict(
    up = dict(x = 1, y = 3.5, z = 0),
    eye = dict(x = 2, y = 0, z = 0)
)

fig.update_layout(scene_camera = camera)
fig.show()

In [None]:
# fig, axis = plt.subplots(2, 3, figsize = (15, 7))

# palette = sns.color_palette("gnuplot2_r", as_cmap=True)
# sns.scatterplot(ax = axis[0, 0], data = total_data, x = "Var1", y = "Var2", hue = "target", palette = palette)
# sns.scatterplot(ax = axis[0, 1], data = total_data, x = "Var1", y = "Var3", hue = "target", palette = palette)
# sns.scatterplot(ax = axis[0, 2], data = total_data, x = "Var1", y = "Var4", hue = "target", palette = palette)
# sns.scatterplot(ax = axis[1, 0], data = total_data, x = "Var2", y = "Var3", hue = "target", palette = palette)
# sns.scatterplot(ax = axis[1, 1], data = total_data, x = "Var2", y = "Var4", hue = "target", palette = palette)
# sns.scatterplot(ax = axis[1, 2], data = total_data, x = "Var3", y = "Var4", hue = "target", palette = palette)

# plt.tight_layout()

# plt.show()

<img src="plt3.png" width="1000" height="600" />
<br>
<br>We see that for most variables a certain differentiating pattern is established and that regression can yield good results.

In [14]:
from sklearn.neighbors import KNeighborsRegressor

model = KNeighborsRegressor()
model.fit(X_train, y_train)

In [15]:
y_pred = model.predict(X_test)

In [16]:
from sklearn.metrics import mean_squared_error, r2_score

print(f"Mean Squared Error: {mean_squared_error(y_test, y_pred)}")
print(f"R2 Score: {r2_score(y_test, y_pred)}")

Mean Squared Error: 564.8367646867368
R2 Score: 0.9547914514799766
