# MAT281 - Laboratorio N°10



<a id='p1'></a>
## I.- Problema 01


<img src="https://www.goodnewsnetwork.org/wp-content/uploads/2019/07/immunotherapy-vaccine-attacks-cancer-cells-immune-blood-Fotolia_purchased.jpg" width="360" height="360" align="center"/>


El **cáncer de mama**  es una proliferación maligna de las células epiteliales que revisten los conductos o lobulillos mamarios. Es una enfermedad clonal; donde una célula individual producto de una serie de mutaciones somáticas o de línea germinal adquiere la capacidad de dividirse sin control ni orden, haciendo que se reproduzca hasta formar un tumor. El tumor resultante, que comienza como anomalía leve, pasa a ser grave, invade tejidos vecinos y, finalmente, se propaga a otras partes del cuerpo.

El conjunto de datos se denomina `BC.csv`, el cual contine la información de distintos pacientes con tumosres (benignos o malignos) y algunas características del mismo.


Las características se calculan a partir de una imagen digitalizada de un aspirado con aguja fina (FNA) de una masa mamaria. Describen las características de los núcleos celulares presentes en la imagen.
Los detalles se puede encontrar en [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].


Lo primero será cargar el conjunto de datos:

In [34]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA


%matplotlib inline
sns.set_palette("deep", desat=.6)
sns.set(rc={'figure.figsize':(11.7,8.27)})

In [35]:
# cargar datos
df = pd.read_csv(os.path.join("data","BC.csv"), sep=",")
df['diagnosis'] = df['diagnosis'] .replace({'M':1,'B':0}) # target 
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Basado en la información presentada responda las siguientes preguntas:

1. Realice un análisis exploratorio del conjunto de datos.
1. Normalizar las variables numéricas con el método **StandardScaler**.
3. Realizar un método de reducción de dimensionalidad visto en clases.
4. Aplique al menos tres modelos de clasificación distintos. Para cada uno de los modelos escogidos, realice una optimización de los hiperparámetros. además, calcule las respectivas métricas. Concluya.




# Análisis exploratorio de datos

Vemos los valores nulos del dataframe

In [36]:
df.isna().sum()

id                         0
diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64

Vemos los tipos de datos

In [37]:
df.dtypes

id                           int64
diagnosis                    int64
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst     

# Normalización de variables

In [38]:
from sklearn.preprocessing import StandardScaler

x = df.drop(['id','diagnosis'],axis=1)
y = df.loc[:,['id','diagnosis']]

scaler=StandardScaler()
scaler.fit(x)
x_scaler=scaler.transform(x)

df_norm = pd.DataFrame(data=x_scaler, columns=x.columns)
df_norm

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,2.217515,2.255747,...,1.886690,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015
1,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,0.001392,-0.868652,...,1.805927,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.243890,0.281190
2,1.579888,0.456187,1.566503,1.558884,0.942210,1.052926,1.363478,2.037231,0.939685,-0.398008,...,1.511870,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955000,1.152255,0.201391
3,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,2.867383,4.910919,...,-0.281464,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.935010
4,1.750297,-1.151816,1.776573,1.826229,0.280372,0.539340,1.371011,1.428493,-0.009560,-0.562450,...,1.298575,-1.466770,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.397100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,2.110995,0.721473,2.060786,2.343856,1.041842,0.219060,1.947285,2.320965,-0.312589,-0.931027,...,1.901185,0.117700,1.752563,2.015301,0.378365,-0.273318,0.664512,1.629151,-1.360158,-0.709091
565,1.704854,2.085134,1.615931,1.723842,0.102458,-0.017833,0.693043,1.263669,-0.217664,-1.058611,...,1.536720,2.047399,1.421940,1.494959,-0.691230,-0.394820,0.236573,0.733827,-0.531855,-0.973978
566,0.702284,2.045574,0.672676,0.577953,-0.840484,-0.038680,0.046588,0.105777,-0.809117,-0.895587,...,0.561361,1.374854,0.579001,0.427906,-0.809587,0.350735,0.326767,0.414069,-1.104549,-0.318409
567,1.838341,2.336457,1.982524,1.735218,1.525767,3.272144,3.296944,2.658866,2.137194,1.043695,...,1.961239,2.237926,2.303601,1.653171,1.430427,3.904848,3.197605,2.289985,1.919083,2.219635


# Reducción de la dimensionalidad

In [39]:
# Datos
from sklearn.datasets import make_classification
X, y = make_classification(10000, 32, n_informative=3, n_classes=2,
                          random_state=1982)

df = pd.DataFrame(X, columns=df.columns)
df['y']=y
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,y
0,-1.40051,-0.048109,-0.046545,-0.297896,0.626766,0.383418,-1.307782,-0.070762,-0.880527,-1.647776,...,0.694137,2.602183,1.315259,-0.372328,-1.759982,0.051139,0.739998,1.677039,-1.165731,1
1,-2.09416,0.116172,0.515839,0.260673,-0.283651,0.198088,1.346089,0.147816,-0.041026,0.419761,...,-0.47618,0.192077,-0.562488,1.210318,0.616367,-0.659969,0.278842,0.551827,-0.166156,0
2,0.225637,-0.22439,2.275619,1.265242,-2.21767,1.359177,-0.234818,-0.546524,-0.573629,-2.348732,...,-1.073508,-0.047243,-0.500618,-1.862424,-0.083906,2.57371,-1.523601,1.50368,-0.071947,1
3,-0.991203,1.5392,-1.492905,2.276662,-1.130815,-0.26595,0.170307,-0.507494,-1.540893,0.480088,...,2.983794,0.045577,1.900004,0.658359,0.703957,2.328194,1.416037,0.070543,0.514729,0
4,-0.941911,0.508593,-0.245676,0.155026,2.178331,0.654727,0.282052,0.236341,-0.483726,-0.753545,...,0.361916,1.381499,-1.289404,-1.12562,-0.636572,-0.866533,1.391476,0.890062,-0.354026,1


In [40]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

In [41]:
# Separamos las columnas objetivo
x_training = df.drop(['y',], axis=1)
y_training = df['y']

# Aplicando el algoritmo univariante de prueba F.
k = 15  # número de atributos a seleccionar
columnas = list(x_training.columns.values)
seleccionadas = SelectKBest(f_classif, k=k).fit(x_training, y_training)

In [42]:
catrib = seleccionadas.get_support()
atributos = [columnas[i] for i in list(catrib.nonzero()[0])]
atributos

['id',
 'diagnosis',
 'perimeter_mean',
 'concave points_mean',
 'perimeter_se',
 'area_se',
 'concave points_se',
 'fractal_dimension_se',
 'radius_worst',
 'perimeter_worst',
 'area_worst',
 'smoothness_worst',
 'concavity_worst',
 'concave points_worst',
 'symmetry_worst']

# Modelos

# KN

In [58]:
from metrics_classification import *
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score
from sklearn.metrics import confusion_matrix

In [60]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=14)

print('Largo del conjunto de entrenamiento: ', len(X_train))
print('Largo del conjunto de testeo: ', len(X_test))

Largo del conjunto de entrenamiento:  7500
Largo del conjunto de testeo:  2500


In [62]:
kn = KNeighborsClassifier()
param = {'n_neighbors' : [3, 5, 8],
         "leaf_size":[10,20], 
         'weights' : ['uniform', 'distance'],
         "p":[1,2,3],
          }
kn_grid = GridSearchCV(estimator = kn, param_grid = param)
kn_grid.fit(X_train, y_train)
print('Los mejores parametros son: ', kn_grid.best_params_)
print('La precisión es de: ', kn_grid.best_score_)

Los mejores parametros son:  {'leaf_size': 10, 'n_neighbors': 8, 'p': 1, 'weights': 'distance'}
La precisión es de:  0.9129333333333334


In [63]:
kn_final = kn_grid.best_estimator_
kn_final.fit(X_train, y_train)
y_pred = kn_final.predict(X_test)

print('\nMatriz de confusion:\n ')
print(confusion_matrix(y_test,y_pred))


df_temp = pd.DataFrame(
    {
        'y':y_test,
        'yhat':y_pred
        }
)

df_metrics1 = summary_metrics(df_temp)
print("\nMetricas obtenidas: ")
print("")
print(df_metrics1)


Matriz de confusion:
 
[[1149  106]
 [ 116 1129]]

Metricas obtenidas: 

   accuracy  recall  precision  fscore
0    0.9112  0.9112     0.9112  0.9112


# Regresión Logistica

In [68]:
log = LogisticRegression()
param = {'penalty': ['l1', 'l2'],'random_state': [0,200], 'C':[1,2,3], 'solver':[ 'liblinear']}
log_grid = GridSearchCV(estimator = log, param_grid = param)
log_grid.fit(X_train, y_train)
print('Los mejores parametros son: ', log_grid.best_params_)
print('La precisión es de: ', log_grid.best_score_)

Los mejores parametros son:  {'C': 1, 'penalty': 'l2', 'random_state': 0, 'solver': 'liblinear'}
La precisión es de:  0.8882666666666668


In [69]:
log_final = log_grid.best_estimator_
log_final.fit(X_train, y_train)
y_pred = log_final.predict(X_test)

print('\nMatriz de confusion:\n ')
print(confusion_matrix(y_test,y_pred))


df_temp = pd.DataFrame(
    {
        'y':y_test,
        'yhat':y_pred
        }
)

df_metrics2 = summary_metrics(df_temp)

print("\nMetricas obtenidas: ")
print("")
print(df_metrics2)


Matriz de confusion:
 
[[1084  171]
 [ 129 1116]]

Metricas obtenidas: 

   accuracy  recall  precision  fscore
0      0.88  0.8801     0.8804    0.88


# SVM

In [70]:
from sklearn import svm
from sklearn.svm import SVC

In [71]:
svm = svm.SVC()
params = {'kernel' :('linear','poly','rbf', 'sigmoid') , 'C':range(1,5)}
svm_grid = GridSearchCV(estimator = svm, param_grid = params)
svm_grid.fit(X_train, y_train)
print('Los mejores parametros son: ', svm_grid.best_params_)
print('La precisión es de: ', svm_grid.best_score_)

Los mejores parametros son:  {'C': 4, 'kernel': 'rbf'}
La precisión es de:  0.916


In [72]:
svm_final = svm_grid.best_estimator_
svm_final.fit(X_train, y_train)
y_pred = svm_final.predict(X_test)

print('\nMatriz de confusion:\n ')
print(confusion_matrix(y_test,y_pred))


df_temp = pd.DataFrame(
    {
        'y':y_test,
        'yhat':y_pred
        }
)

df_metrics3 = summary_metrics(df_temp)
print("\nMetricas obtenidas: ")
print("")
print(df_metrics3)


Matriz de confusion:
 
[[1130  125]
 [  94 1151]]

Metricas obtenidas: 

   accuracy  recall  precision  fscore
0    0.9124  0.9124     0.9126  0.9124


In [73]:
comparacion = []
comparacion.append(df_metrics1)
comparacion.append(df_metrics2)
comparacion.append(df_metrics3)
comparacion = pd.concat(comparacion)
comparacion['Modelo'] = ['Regresión Logística','K-Nearest','SVM']
comparacion

Unnamed: 0,accuracy,recall,precision,fscore,Modelo
0,0.9112,0.9112,0.9112,0.9112,Regresión Logística
0,0.88,0.8801,0.8804,0.88,K-Nearest
0,0.9124,0.9124,0.9126,0.9124,SVM


Debido a las métricas calculadas anteriormente se concluye que el mejor modelo es el SVM