## Ejercicio: sklearn breast cancer

1. Carga el dataset [breast_cancer de `sklearn`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html)

In [34]:
from sklearn import datasets
cancer = datasets.load_breast_cancer()

from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np

df = pd.DataFrame(data=np.c_[cancer.data, cancer.target],
                 columns = list(cancer.feature_names) + ['target'])

y_col = 'target'
X_cols = [col for col in df.columns if col not in y_col]

X = df[X_cols].fillna(0)
y = df[y_col]

X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y,
                                                   test_size = 0.2,
                                                   random_state=42)

Vamos a probar todos los métodos de clasificación vistos hasta ahora mediante GridSearchCV. Para ello vamos a ir poco a poco utilizando los métodos que acabamos de ver.

En este caso, el objetivo será predecir si una persona tiene cáncer o no, por lo que utilizaremos algoritmos de clasificación para predecir la variable "target".

2. Utiliza el método simple para, seleccionando un algoritmo de tu agrado, probar varias combinaciones de parámetros para obtener la mejor predicción:

In [44]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split

# Creamos modelo:
model = LogisticRegression()

# Definimos parámetros:
parameters = {
    'penalty': ['l1', 'l2'],
    'C': [0.001, 0.01, 0.1, 0.5, 1, 5, 10, 100],
    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag']
    
}

# Nos creamos iterador basado en cross validation:
grid = GridSearchCV(estimator = model,
                   param_grid = parameters,
                   n_jobs = -1,
                   scoring = 'accuracy',
                   cv = 10)

# Entrenamos sobre train:
grid.fit(X_train, y_train)

GridSearchCV(cv=10, estimator=LogisticRegression(), n_jobs=-1,
             param_grid={'C': [0.001, 0.01, 0.1, 0.5, 1, 5, 10, 100],
                         'penalty': ['l1', 'l2'],
                         'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag']},
             scoring='accuracy')

In [45]:
print(f"Train: {grid.score(X_train, y_train)}")
print(f"Test: {grid.score(X_test, y_test)}")

Train: 0.9846153846153847
Test: 0.956140350877193


3. Selecciona ahora dos algoritmos de clasificación y utiliza el "método pro" para encontrar la mejor combinación de modelo y parámetros posible. ¿Es mejor que la que habías obtenido antes?

In [37]:
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier

pipe = Pipeline(steps=[("modelo", LogisticRegression())])


pipe_lg_params = {
    'modelo': [LogisticRegression()],
    'modelo__penalty': ['l1', 'l2'],
    'modelo__C': [0.001, 0.01, 0.1, 0.5, 1, 5, 10, 100],
    'modelo__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag']
}

pipe_dt_params = {
    'modelo': [DecisionTreeClassifier()],
    'modelo__max_depth': np.arange(1, 10),
    'modelo__min_samples_split': [2, 5, 10]
}

search_space = [pipe_lg_params, pipe_dt_params]

# Nos creamos iterador basado en cross validation:
grid2 = GridSearchCV(
    estimator = pipe,
    param_grid = search_space,
    n_jobs = -1,
    scoring = 'accuracy',
    cv = 10)

grid2.fit(X_train, y_train)

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('modelo', LogisticRegression())]),
             n_jobs=-1,
             param_grid=[{'modelo': [LogisticRegression(C=100,
                                                        solver='newton-cg')],
                          'modelo__C': [0.001, 0.01, 0.1, 0.5, 1, 5, 10, 100],
                          'modelo__penalty': ['l1', 'l2'],
                          'modelo__solver': ['newton-cg', 'lbfgs', 'liblinear',
                                             'sag']},
                         {'modelo': [DecisionTreeClassifier()],
                          'modelo__max_depth': array([1, 2, 3, 4, 5, 6, 7, 8, 9]),
                          'modelo__min_samples_split': [2, 5, 10]}],
             scoring='accuracy')

In [48]:
grid2.cv_results_

{'mean_fit_time': array([0.00290098, 0.00250058, 0.01290047, 0.00219977, 0.15769393,
        0.07571802, 0.01080089, 0.03657613, 0.00200007, 0.00170002,
        0.02060242, 0.00199995, 0.17341979, 0.07856021, 0.01199992,
        0.0393038 , 0.00210071, 0.00220032, 0.17052305, 0.00195255,
        0.15861964, 0.06030872, 0.01310432, 0.04660254, 0.00530131,
        0.00220103, 0.31097641, 0.0032001 , 0.20077534, 0.05460715,
        0.01545315, 0.03640213, 0.0025002 , 0.00200078, 0.34039662,
        0.00200133, 0.2321142 , 0.05865872, 0.01360228, 0.03390517,
        0.00240099, 0.00290065, 0.37366421, 0.00220053, 0.18968029,
        0.05870738, 0.01400063, 0.03315377, 0.00160007, 0.00220058,
        0.4116607 , 0.00200045, 0.20299335, 0.06071713, 0.01550167,
        0.04450941, 0.00230463, 0.00160096, 0.36280684, 0.0017004 ,
        0.28173332, 0.06030884, 0.01932118, 0.04373443, 0.00725222,
        0.00913067, 0.00800397, 0.01110876, 0.01120918, 0.01100042,
        0.01389952, 0.01299989,

In [47]:
print(f"Train: {grid2.best_estimator_.score(X_train, y_train)}")
print(f"Test: {grid2.best_estimator_.score(X_test, y_test)}")

Train: 0.9846153846153847
Test: 0.956140350877193


4. Utiliza lo que hemos denominado como "Next level" (nombre inventado para hacer referencia a ese estilo de búsqueda de hiperparámetros) para encontrar el mejor de cada uno de los algoritmos de clasificación que hemos visto hasta la fecha, con su mejor combinación, incluyendo etapas extra en el preprocesamiento de datos como un escalado (es decir, las etapas del pipeline):

In [55]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.feature_selection import SelectKBest, SelectFromModel

dt = DecisionTreeClassifier(max_depth=5)

pipe_dt = Pipeline(steps=[("scaler", StandardScaler()),
                       ("selector", SelectKBest()),
                       ("modelo", DecisionTreeClassifier())])

pipe_lr = Pipeline(steps=[("scaler", MinMaxScaler()),
                       ("selector", SelectFromModel(estimator=dt)),
                       ("modelo", LogisticRegression())])

pipe_dt_params = {
    'selector__k': np.arange(1, len(X.columns)),
    'modelo': [DecisionTreeClassifier()],
    'modelo__max_depth': np.arange(1, 10),
    'modelo__min_samples_split': [2, 5, 10]
}

pipe_lg_params = {
    'modelo': [LogisticRegression()],
    'modelo__penalty': ['l1', 'l2'],
    'modelo__C': [0.001, 0.01, 0.1, 0.5, 1, 5, 10, 100],
    'modelo__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag']
}


# Nos creamos iterador basado en cross validation:
grid_dt = GridSearchCV(
    estimator = pipe_dt,
    param_grid = pipe_dt_params,
    n_jobs = -1,
    scoring = 'accuracy',
    cv = 10)

grid_lr = GridSearchCV(
    estimator = pipe_lr,
    param_grid = pipe_lg_params,
    n_jobs = -1,
    scoring = 'accuracy',
    cv = 10)

grids3 = {
    'gs_dt': grid_dt,
    'gs_lr': grid_lr
}

In [56]:
for nombre, grid_search in grids3.items():
    print(f"Calculando {nombre}")
    grid_search.fit(X_train, y_train)

Calculando gs_dt
Calculando gs_lr


In [61]:
print(f"DT: {grids3['gs_dt'].best_estimator_.score(X_test, y_test)}")
print(f"LR: {grids3['gs_lr'].best_estimator_.score(X_test, y_test)}")

DT: 0.9298245614035088
LR: 0.9649122807017544


In [62]:
# Mejores opciones:

In [63]:
from sklearn.feature_selection import SelectFromModel
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
# Comenzamos definiendo los pipelines principales:
reg_log = Pipeline(steps = [
    ("scaler", StandardScaler()),
    ("selecModel",SelectFromModel(DecisionTreeClassifier(max_depth=10, random_state=17))),
    ("reglog", LogisticRegression())
])
svc = Pipeline(steps =[
    ("scaler", StandardScaler()),
    ("selectkbest", SelectKBest()),
    ("svc", svm.SVC())
])
decision_tree = Pipeline(steps =[
    ("scaler", StandardScaler()),
    ("selectkbest", SelectKBest()),
    ('decision_tree', DecisionTreeClassifier()),
     ])
K_nearest = Pipeline(steps =[
    ("scaler", StandardScaler()),
    ("selectkbest", SelectKBest()),
    ('KNeighbor',KNeighborsClassifier())
    ])
# Y definimos sus parámetros:
re_log_param = {
    "reglog__penalty": ["l1", "l2"],
    "reglog__C": np.arange(0, 4, 0.5)
}
svc_param = {
    "selectkbest__k": list(range(8,25)),
    "svc__C": np.arange(0.1, 0.9, 0.1),
    "svc__kernel": ['linear', 'poly', 'rbf']
}
decision_tree_params = {
    "selectkbest__k": list(range(2,10)),
    'decision_tree__max_depth': [10, 100, 500, 1000],
    'decision_tree__criterion': ['gini', 'entropy']
}
knn_params = {
    "selectkbest__k": [2, 3, 4],
    'KNeighbor__n_neighbors': list(range(4,10)),
}
# Nos creamos los grids de cada uno:
gs_reg_log = GridSearchCV(reg_log,
                         re_log_param,
                         cv = 10,
                         scoring='accuracy',
                         n_jobs=-1,
                         verbose=1)
gs_svm = GridSearchCV(svc,
                         svc_param,
                         cv = 10,
                         scoring='accuracy',
                         n_jobs=-1,
                         verbose=1)
gs_decision_tree = GridSearchCV(decision_tree,
                         decision_tree_params,
                         cv = 10,
                         scoring='accuracy',
                         n_jobs=-1,
                         verbose=1)
gs_K_nearest = GridSearchCV(K_nearest,
                         knn_params,
                         cv = 10,
                         scoring='accuracy',
                         n_jobs=-1,
                         verbose=1)
grids = {
    "gs_reg_log": gs_reg_log,
    "gs_svm": gs_svm,
    "gs_rand_forest": gs_decision_tree,
    'gs_K_nearest': gs_K_nearest}

In [64]:
# Entrenamos:
for nombre, grid_search in grids.items():
    grid_search.fit(X_train, y_train)

Fitting 10 folds for each of 16 candidates, totalling 160 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:    0.4s
[Parallel(n_jobs=-1)]: Done 160 out of 160 | elapsed:    0.7s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 10 folds for each of 408 candidates, totalling 4080 fits


[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done 1200 tasks      | elapsed:    4.1s
[Parallel(n_jobs=-1)]: Done 3200 tasks      | elapsed:   12.6s
[Parallel(n_jobs=-1)]: Done 4080 out of 4080 | elapsed:   15.7s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 10 folds for each of 64 candidates, totalling 640 fits


[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done 640 out of 640 | elapsed:    2.1s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 10 folds for each of 18 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:    0.4s
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:    0.6s finished


In [67]:
best_grids = [(i, j.best_score_) for i, j in grids.items()]
best_grids
test_score =[]
for i, j in grids.items():
    test_score.append(j.score(X_test, y_test))
    best_grids = pd.DataFrame(best_grids, columns = ['Grid', 'Best score'])
results = best_grids.sort_values(by='Best score', ascending=False)
results['Test Score']= test_score
results

Unnamed: 0,Grid,Best score,Test Score
1,gs_svm,0.978019,0.973684
0,gs_reg_log,0.953865,0.982456
3,gs_K_nearest,0.947198,0.921053
2,gs_rand_forest,0.938454,0.95614


Ejemplo en el que ejecutamos todas las cominaciones como una sola, sobre un pipline común, sin diferenciar si es LogisticRegression o DecisionTreeClassifier:

In [49]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.feature_selection import SelectKBest, SelectFromModel

dt = DecisionTreeClassifier(max_depth=5)

pipe_dt = Pipeline(steps=[("scaler", StandardScaler()),
                       ("selector", SelectKBest()),
                       ("modelo", dt)])

pipe_dt_params = {
    'selector__k': np.arange(1, len(X.columns)),
    'modelo': [DecisionTreeClassifier()],
    'modelo__max_depth': np.arange(1, 10),
    'modelo__min_samples_split': [2, 5, 10]
}

pipe_lg_params = {
    'selector__k': np.arange(1, len(X.columns)),
    'modelo': [LogisticRegression()],
    'modelo__penalty': ['l1', 'l2'],
    'modelo__C': [0.001, 0.01, 0.1, 0.5, 1, 5, 10, 100],
    'modelo__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag']
}


search_space = [pipe_lg_params, pipe_dt_params]

# Nos creamos iterador basado en cross validation:
grid3 = GridSearchCV(
    estimator = pipe_dt,
    param_grid = search_space,
    n_jobs = -1,
    scoring = 'accuracy',
    cv = 10)

grid3.fit(X_train, y_train)

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('selector', SelectKBest()),
                                       ('modelo',
                                        DecisionTreeClassifier(max_depth=5))]),
             n_jobs=-1,
             param_grid=[{'modelo': [LogisticRegression(C=0.1,
                                                        solver='newton-cg')],
                          'modelo__C': [0.001, 0.01, 0.1, 0.5, 1, 5, 10, 100],
                          'modelo__penalty': ['l1', 'l2'],
                          'modelo__solver': ['newton-cg', 'lbfgs', 'liblinear',
                                             'sag'],
                          'selector__k': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])},
                         {'modelo': [DecisionTreeClassifier()],
                          'modelo__max_

In [50]:
grid3.best_estimator_

Pipeline(steps=[('scaler', StandardScaler()), ('selector', SelectKBest(k=28)),
                ('modelo', LogisticRegression(C=0.1, solver='newton-cg'))])

In [51]:
grid3.best_estimator_.score(X_test, y_test)

0.9824561403508771