## Was ist der K-Nearest-Neighbor Algorithmus?
(https://realpython.com/knn-python/)


KNN ist ein Supervised Learning Algorithmus, der verwendet wird, um Klassen zu klassifizieren.
Die Klassen werden anhand der Klassen der K nächsten Nachbarn bestimmt.
Die Anzahl der Nachbarn wird durch den Parameter k bestimmt.
Nachbarn werden über die Distanz zwischen den Datenpunkten bestimmt.
Die Distanz zwischen den Datenpunkten wird durch die euklidische Distanz bestimmt.
Die euklidische Distanz zwischen zwei Punkten (x1, y1) und (x2, y2) ist: sqrt((x1 - x2)^2 + (y1 - y2)^2)

In [5]:
import numpy as np
data1=np.array([4, 7])
data2=np.array([3, 5])
distance=np.sqrt(np.sum((data1-data2)**2))
distance2=np.linalg.norm(data1-data2)
print(distance)
print(distance2)




2.23606797749979
2.23606797749979


In [1]:
import pandas as pd

df=pd.read_csv("abalone.data")
df.head()

Unnamed: 0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
0,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
1,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
2,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
3,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
4,I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8


Name		Data Type	Meas.	Description
	----		---------	-----	-----------
	Sex		nominal			M, F, and I (infant)
	Length		continuous	mm	Longest shell measurement
	Diameter	continuous	mm	perpendicular to length
	Height		continuous	mm	with meat in shell
	Whole weight	continuous	grams	whole abalone
	Shucked weight	continuous	grams	weight of meat
	Viscera weight	continuous	grams	gut weight (after bleeding)
	Shell weight	continuous	grams	after being dried
	Rings		integer			+1.5 gives the age in years

In [2]:
columns= ["sex","length","diameter","height","whole_weight","shucked_weight","viscera_weight","shell_weight","rings"]
df.columns=columns
df.head()

Unnamed: 0,sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings
0,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
1,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
2,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
3,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
4,I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8


target is rings (Age of the abalone)
convert sex to 0, 1

In [8]:
X = df.drop("rings", axis=1)
y = df["rings"]
X["sex"]=X["sex"].apply(lambda x: 0 if x=="M" else 1)
X.head()

Unnamed: 0,sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight
0,0,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07
1,1,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21
2,0,0.44,0.365,0.125,0.516,0.2155,0.114,0.155
3,1,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055
4,1,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12


In [9]:
new_data_point=np.array([0,0.5,0.4,0.1,0.2,0.1,0.1,0.1])

distances=np.linalg.norm(X-new_data_point,axis=1)


[0.21219743 1.12621867 0.3490949  ... 1.11296743 1.44068673 2.02033407]


in distances sind die euklidischen Distanzen zu allen Datenpunkten gespeichert

In [10]:
# argsort: Returns the indices that would sort this array.
k = 5
nearest_neighbors = distances.argsort()[:k]
print(nearest_neighbors)

[2325  541  118 3346  528]


In [11]:
y[nearest_neighbors]

2325    11
541     15
118      8
3346    11
528      9
Name: rings, dtype: int64

In [12]:
# bei regression: Durchschnitt der K nächsten Nachbarn
# bei classification: häufigste Klasse der K nächsten Nachbarn

age_regression= y[nearest_neighbors].mean()

# mode(): Compute the mode along the specified axis. Mode is the most common value in the array.
age_classification= y[nearest_neighbors].mode()[0]

print("regression: ", age_regression)
print("classification: ", age_classification)

regression:  10.8
classification:  11


# KNN with scikit-learn

In [13]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [15]:
knn_reg = KNeighborsRegressor(n_neighbors=5)
knn_reg.fit(X_train, y_train)

knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [16]:
knn_reg.predict([new_data_point])

array([10.4])

In [17]:
knn_clf.predict([new_data_point])

array([9], dtype=int64)

In [19]:
# print root mean squared error for the regression model
from sklearn.metrics import mean_squared_error
from math import sqrt

rms = sqrt(mean_squared_error(y_test, knn_reg.predict(X_test)))
print(rms)

2.2626359696381737


In [21]:
# this means: ~2 rings off on average

In [20]:
# print accuracy for the classification model
from sklearn.metrics import accuracy_score

print(accuracy_score(y_test, knn_clf.predict(X_test)))

0.21770334928229665


In [22]:
# this means: ~20% the correct age was predicted