# K nearest neighbors

Implementaremos el algoritmo que vimos en la teoría, utilizando sklearn.

Sklearn tiene una implementación para KNN classifier: [documentación KNNClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

Y otra para el regressor: [documentación KNNRegressor](
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)




In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Vamos a trabajar con el dataset penguins que podemos cargar desde seaborn.

La idea es que entrenemos un KNN para clasificar pinguinos (predecir la variable species)

In [3]:
df = sns.load_dataset("penguins")

In [4]:
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


¿ Hay nulos ?

In [5]:
df.isna().sum()

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

En caso de haberlos, por simplicidad los vamos a descartar.

Descartar nulos:

In [6]:
df = df.dropna()

In [7]:
df.isna().sum()

species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

Dividimos en X e y

In [8]:
X = df.drop("species", axis=1)
y = df["species"].copy()

In [9]:
X.head()

Unnamed: 0,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Torgersen,39.3,20.6,190.0,3650.0,Male


In [10]:
y.head()

0    Adelie
1    Adelie
2    Adelie
4    Adelie
5    Adelie
Name: species, dtype: object

¿ Cuántos pinguinos tenemos de cada especie ?



In [11]:
y.value_counts()

species
Adelie       146
Gentoo       119
Chinstrap     68
Name: count, dtype: int64

¿ Y en porcentajes ?

In [12]:
y.value_counts(normalize=True)

species
Adelie       0.438438
Gentoo       0.357357
Chinstrap    0.204204
Name: proportion, dtype: float64

## Baseline

¿Cómo se les ocurre definir un baseline para este caso?

No hay una única manera correcta, tiene que ser un modelo simple.



In [13]:
# COMPLETAR

El modelo que desarrollemos, tiene que ser mejor que este baseline. ¿ Qué accuracy_score tiene el baseline ?

In [14]:
from sklearn.metrics import accuracy_score

# MEDIR ACCURACY_SCORE DEL BASELINE

## Train - test split

Como vimos la clase anterior, es importante guardarnos un conjunto de test para evaluar el modelo.

Vamos a hacer un train-test split utilizando sklearn.

primero, importar train_test_split de sklearn:

In [15]:
from sklearn.model_selection import train_test_split

Aplicar la función para obtener: X_train, X_test, y_train e y_test.

Vamos a tomar un 15% de los datos para el conjunto de test. Como las clases no están balanceadas, sería bueno utilizar el stratify que nos provee sklearn (vimos en el notebook de la clase pasada)

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, stratify=y, random_state=42)

In [17]:
X_train = X_train.copy()

In [18]:
X_train.shape

(283, 6)

In [19]:
X_test.shape

(50, 6)

In [20]:
y_train.shape

(283,)

In [21]:
y_test.shape

(50,)

## Preprocesamiento de datos

Vimos que en KNN es muy importante que los datos estén en una misma escala.

¿En que rango de valores se encuentran las variables numéricas del dataset?

In [22]:
df.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,333.0,333.0,333.0,333.0
mean,43.992793,17.164865,200.966967,4207.057057
std,5.468668,1.969235,14.015765,805.215802
min,32.1,13.1,172.0,2700.0
25%,39.5,15.6,190.0,3550.0
50%,44.5,17.3,197.0,4050.0
75%,48.6,18.7,213.0,4775.0
max,59.6,21.5,231.0,6300.0


Debemos llevar todo a una misma escala. Para esto utilizaremos el StandardScaler de sklearn.

Importar standard scaler:

In [23]:
from sklearn.preprocessing import StandardScaler

Crear una instancia de StandardScaler

In [24]:
scaler = StandardScaler()

Como siempre en Sklearn, tenemos que hacer un fit con nuestros datos de entrenamiento a el objeto.

Hacer un fit a el scaler con los datos NUMERICOS de train:

In [25]:
columnas_numericas = ["bill_length_mm",	"bill_depth_mm",	"flipper_length_mm",	"body_mass_g"]

In [26]:
scaler.fit(X_train[columnas_numericas])

Ahora, con el scaler podemos transformar los datos tanto en train como en test.

Transformar los datos numéricos de train (aplicar el scaler):

In [27]:
X_train[columnas_numericas] = scaler.transform(X_train[columnas_numericas])

In [28]:
X_train.head()

Unnamed: 0,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
253,Biscoe,2.871447,-0.10202,2.077011,2.270111,Male
79,Torgersen,-0.350822,0.965417,-0.440463,-0.264906,Male
283,Biscoe,1.89556,-0.762814,2.148939,1.775474,Male
206,Dream,-0.27717,0.050471,-1.015885,-1.068692,Female
328,Biscoe,-0.129867,-1.626929,0.494599,0.446135,Female


Ahora nos quedan 2 variables categóricas, vamos a aplicar one hot encoder.

Recuerden que el fit se hace sobre los datos de entrenamiento y luego sobre los datos de test aplicamos únicamente transform.

Importar one hot encoder:

In [29]:
from sklearn.preprocessing import OneHotEncoder

Instanciar one hot encoder para cada variable categórica:

In [30]:
ohe_island = OneHotEncoder(sparse=False)
ohe_sex = OneHotEncoder(sparse=False)

Hacer fit con los datos de entrenamiento para ambos encoders:

In [31]:
ohe_island.fit(X_train[["island"]])
ohe_sex.fit(X_train[["sex"]])



Obtener las variables con one hot encoded para ambas variables categóricas:

In [33]:
islands = ohe_island.transform(X_train[["island"]])
encoded_island_df = pd.DataFrame(data=islands, columns= ohe_island.get_feature_names_out())

sex = ohe_sex.transform(X_train[["sex"]])
encoded_sex_df = pd.DataFrame(data=sex, columns= ohe_sex.get_feature_names_out())

Hacer concat con X_train:

In [34]:
X_train.head()

Unnamed: 0,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
253,Biscoe,2.871447,-0.10202,2.077011,2.270111,Male
79,Torgersen,-0.350822,0.965417,-0.440463,-0.264906,Male
283,Biscoe,1.89556,-0.762814,2.148939,1.775474,Male
206,Dream,-0.27717,0.050471,-1.015885,-1.068692,Female
328,Biscoe,-0.129867,-1.626929,0.494599,0.446135,Female


In [None]:
X_train = pd.concat([X_train.reset_index(drop=True), encoded_island_df, encoded_sex_df], axis=1)
X_train.head()

Unnamed: 0,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,x0_Biscoe,x0_Dream,x0_Torgersen,x0_Female,x0_Male
0,Biscoe,2.871447,-0.10202,2.077011,2.270111,Male,1.0,0.0,0.0,0.0,1.0
1,Torgersen,-0.350822,0.965417,-0.440463,-0.264906,Male,0.0,0.0,1.0,0.0,1.0
2,Biscoe,1.89556,-0.762814,2.148939,1.775474,Male,1.0,0.0,0.0,0.0,1.0
3,Dream,-0.27717,0.050471,-1.015885,-1.068692,Female,0.0,1.0,0.0,1.0,0.0
4,Biscoe,-0.129867,-1.626929,0.494599,0.446135,Female,1.0,0.0,0.0,1.0,0.0


descartar las columnas originales:

In [None]:
X_train = X_train.drop(["island", "sex"], axis=1)
X_train.head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,x0_Biscoe,x0_Dream,x0_Torgersen,x0_Female,x0_Male
0,2.871447,-0.10202,2.077011,2.270111,1.0,0.0,0.0,0.0,1.0
1,-0.350822,0.965417,-0.440463,-0.264906,0.0,0.0,1.0,0.0,1.0
2,1.89556,-0.762814,2.148939,1.775474,1.0,0.0,0.0,0.0,1.0
3,-0.27717,0.050471,-1.015885,-1.068692,0.0,1.0,0.0,1.0,0.0
4,-0.129867,-1.626929,0.494599,0.446135,1.0,0.0,0.0,1.0,0.0


## KNN

Ahora, con nuestro dataset limpio, entrenemos un KNN classifier.

Primero, importar knn classifier de sklearn:

In [None]:
from sklearn.neighbors import KNeighborsClassifier

Instanciar un KNN con n_neighbors = 5 y weights="uniform".

INVESTIGAR: ¿Qué significa weights = "uniform" ???

In [None]:
knn = KNeighborsClassifier(n_neighbors=5, weights='uniform')

Entrenar el modelo con los datos de entrenamiento:

In [None]:
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

Generar las predicciones para train y test. Tener en cuenta que para generar las de test, debemos aplicar el preprocesamiento a los datos (OHE y scaler)

In [None]:
# Preprocesamiento y predicciones de test:
X_test[columnas_numericas] = scaler.transform(X_test[columnas_numericas])
islands = ohe_island.transform(X_test[["island"]])
encoded_island_df = pd.DataFrame(data=islands, columns= ohe_island.get_feature_names())

sex = ohe_sex.transform(X_test[["sex"]])
encoded_sex_df = pd.DataFrame(data=sex, columns= ohe_sex.get_feature_names())

X_test = pd.concat([X_test.reset_index(drop=True), encoded_island_df, encoded_sex_df], axis=1)

X_test = X_test.drop(["island", "sex"], axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value[:, i].tolist())


In [None]:
pred_train = knn.predict(X_train)
pred_test = knn.predict(X_test)

Medir accuracy_score para train y test.

In [None]:
accuracy_score(y_train, pred_train)

0.9929328621908127

In [None]:
accuracy_score(y_test, pred_test)

1.0