# Práctica Independiente: KNN

### A partir del dataset `Breast Cancer Wisconsin` construir un clasficador basado en KNN

En esta práctica vamos a construir un clasificador para predecir el diagnóstico.

Este dataset proviene de observaciones sobre las células de distintos pacientes. 
La información que contiene es la siguiente:

Attribute Information:

1) ID number

2) Diagnosis (M = malignant, B = benign)) 

Ten real-valued features are computed for each of the three cell nucleus: 

a) radius (mean of distances from center to points on the perimeter) 

b) texture (standard deviation of gray-scale values) 

c) perimeter 

d) area 

e) smoothness (local variation in radius lengths) 

f) compactness (perimeter^2 / area - 1.0) 

g) concavity (severity of concave portions of the contour) 

h) concave points (number of concave portions of the contour) 

i) symmetry 

j) fractal dimension ("coastline approximation" - 1)








#### Ejercicio 1. Importar las librerias necesarias

In [33]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, accuracy_score

#### Ejercicio 2. Leemos el dataset desde una url

In [2]:
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data", header=None)

In [3]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [0]:
df.shape

(569, 32)

#### Ejercicio 3: Hacer el split entre features y variable target

In [9]:
X = df.drop([1], axis=1)
y = pd.Series(df[1])
pd.Series(y).value_counts(normalize=True)

B    0.627417
M    0.372583
Name: 1, dtype: float64

#### Ejercicio 4. Hacer el split entre test y training

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=12)

pandas.core.frame.DataFrame

#### Ejericio 5: Establecer en 3 el parametro K

In [11]:
k=3

#### Ejercicio 6:  Evaluar la perfomance del modelo en términos de score

In [28]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train) 
model.score

<bound method ClassifierMixin.score of KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')>

#### Ejercicio 7: Hacer la prediccion sobre el conjunto de testing

In [30]:
scaler = StandardScaler()
# X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
y_pred = model.predict(X_test)
y_pred

array(['M', 'B', 'B', 'M', 'M', 'B', 'B', 'M', 'B', 'B', 'B', 'M', 'B',
       'B', 'B', 'M', 'M', 'B', 'B', 'M', 'B', 'B', 'B', 'M', 'M', 'B',
       'M', 'B', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'M', 'B', 'B',
       'M', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'M', 'M', 'B', 'B',
       'M', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'M', 'B', 'B',
       'M', 'M', 'M', 'B', 'B', 'B', 'M', 'M', 'B', 'B', 'B', 'M', 'B',
       'M', 'B', 'M', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'M',
       'B', 'B', 'B', 'M', 'B', 'B', 'B', 'M', 'B', 'M', 'B', 'B', 'M',
       'M', 'M', 'M', 'B', 'B', 'B', 'M', 'B', 'M', 'B', 'B', 'B', 'M',
       'B', 'M', 'B', 'B', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'M', 'B',
       'M', 'M', 'M', 'M', 'M', 'M', 'B', 'M', 'B', 'B', 'B', 'B', 'B'],
      dtype=object)

#### Ejercicio 8: Emplear una matriz de confusion para observar los aciertos vs errores en la clasifiacion

In [35]:
print(confusion_matrix(y_test, y_pred))
accuracy_score(y_test, y_pred)

[[87  3]
 [ 6 47]]


0.9370629370629371

#### BONUS Ejercicio 9: encontrar cuál es el mejor K