# PRÁCTICA INDEPENDIENTE: KNN

## Introducción

A partir del dataset `wdbc` deberán construir un clasficador basado en KNN. El mismo contiene una serie de features que fueron computadas a partir de imágenes digitalizadas de muestras de tejido mamario. Los features describen algunas características de los núcleos celulares presentes en las imágenes. 

Los features son los siguientes:

* ID number 
* Diagnosis (M = malignant, B = benign) 

Diez features computados sobre cada núcleo celular (dando origen a 30 features -una por cada núcleo-)

* radius (mean of distances from center to points on the perimeter) 
* texture (standard deviation of gray-scale values) 
* perimeter 
* area 
* smoothness (local variation in radius lengths) 
* compactness (perimeter^2 / area - 1.0) 
* concavity (severity of concave portions of the contour) 
* concave points (number of concave portions of the contour) 
* symmetry 
* fractal dimension ("coastline approximation" - 1)

#### Ejercicio 1. Importar las librerias necesarias

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler


#### Ejercicio 2. Leemos el dataset desde una url

In [2]:
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data", header=None)

In [3]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [4]:
df.shape

(569, 32)

#### Ejercicio 3: Hacer el split entre features y variable target

In [5]:
data = df[list(df.columns[2:])]

In [6]:
target = df[1]

#### Ejercicio 4. Hacer el split entre test y training. Luego estandarizar. 

In [7]:
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size = 0.10)

# Utilizamos sklearn para estandarizar la matriz de Features

scaler  = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

#### Ejericio 5: Establecer en 3 el parametro K

In [8]:
neigh = KNeighborsClassifier(n_neighbors=3)

In [9]:
neigh.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

#### Ejercicio 6:  Evaluar la perfomance del modelo en términos de score

In [10]:
neigh.score(X_train, y_train)

0.978515625

#### Ejercicio 7: Hacer la prediccion sobre el conjunto de testing

In [11]:
predict = neigh.predict(X_test) 

# clase real
actual = list(y_test)
print(predict)
print(actual)

['B' 'B' 'B' 'B' 'M' 'B' 'B' 'M' 'M' 'B' 'B' 'M' 'B' 'B' 'M' 'B' 'M' 'B'
 'B' 'B' 'B' 'M' 'B' 'M' 'B' 'B' 'B' 'M' 'M' 'M' 'M' 'M' 'M' 'B' 'B' 'M'
 'B' 'B' 'B' 'B' 'M' 'M' 'B' 'B' 'M' 'B' 'B' 'B' 'M' 'B' 'B' 'B' 'B' 'B'
 'M' 'M' 'B']
['B', 'B', 'M', 'B', 'M', 'B', 'B', 'M', 'M', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'M', 'B', 'M', 'B', 'B', 'B', 'M', 'M', 'M', 'B', 'M', 'M', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'M', 'M', 'B', 'B', 'M', 'B', 'B', 'B', 'M', 'B', 'M', 'B', 'B', 'B', 'M', 'M', 'B']


#### Ejercicio 8: Emplear una matriz de confusion para observar los aciertos vs errores en la clasifiacion

In [12]:
cm = confusion_matrix(actual, predict, labels=['M', 'B'])
print(cm)

[[19  2]
 [ 2 34]]


In [13]:
import numpy as np
from sklearn.model_selection import cross_val_score 
from sklearn import metrics

metrics.accuracy_score(actual, predict)

0.9298245614035088

#### BONUS Ejercicio 9: encontrar cuál es el mejor K

Importamos del módulo `model_selection` de `sklearn` el método `cross_val_score`.

In [14]:
import numpy as np
from sklearn.model_selection import cross_val_score

Hacemos una validación cruzada para estimar el parámetro $k$ (cantidad de vecinos cercanos) del algoritmo. Guardamos los valores del error de clasificación ($1 - error\_clasificacion$), lo guardamos en un diccionario y luego buscamos el mínimo valor (es decir, el error de clasificación mínimo).

In [15]:
error_rates = {}
for i in range(2,10):
    nei = KNeighborsClassifier(n_neighbors=i)
    error = 1- np.mean(cross_val_score(nei, X_train, y_train, cv=10))
    error_rates.update({i:error})

In [16]:
error_rates

{2: 0.04088050314465408,
 3: 0.033037365889752146,
 4: 0.03695893451720311,
 5: 0.03499815020347763,
 6: 0.03884572697003319,
 7: 0.033037365889752146,
 8: 0.03499815020347763,
 9: 0.03485016648168704}

In [17]:
error_rates
min(error_rates, key=error_rates.get)

3