<style type="text/css">
    ol { list-style-type: upper-alpha; }
    p { text-align: center; font-weight: bold; }
</style>

<center>
  <img src=https://i.imgur.com/0TSSaqL.png width="550">
</center>
<center>
  <h3>
    <b>CAPSTONE</b><br/>
    <b>Predict H1N1 and Seasonal Flu Vaccines</b><br/>
    <b>GRUPO 3 - Modeling</b>
  </h3>
</center>

# Contexto

El objetivo es predecir si una persona fue vacunada con la vacuna H1N1 o gripe estacional, de acuerdo a información compartida acerca de su contexto, opiniones y comportamientos a nivel salud.

Después de la fase de EDA (Exploratory Data Analysis), nos quedamos con 32 de las 36 características iniciales:

For all binary variables: 0 = No; 1 = Yes.

*   `h1n1_concern` - Level of concern about the H1N1 flu.
  *   0 = Not at all concerned; 1 = Not very concerned; 2 = Somewhat concerned; 3 = Very concerned.
*   `h1n1_knowledge` - Level of knowledge about H1N1 flu.
  *   0 = No knowledge; 1 = A little knowledge; 2 = A lot of knowledge.
*   `behavioral_antiviral_meds` - Has taken antiviral medications. (binary)
*   `behavioral_avoidance` - Has avoided close contact with others with flu-like symptoms. (binary)
*   `behavioral_face_mask` - Has bought a face mask. (binary)
*   `behavioral_wash_hands` - Has frequently washed hands or used hand sanitizer. (binary)
*   `behavioral_large_gatherings` - Has reduced time at large gatherings. (binary)
*   `behavioral_outside_home` - Has reduced contact with people outside of own household. (binary)
*   `behavioral_touch_face` - Has avoided touching eyes, nose, or mouth. (binary)
*   `doctor_recc_h1n1` - H1N1 flu vaccine was recommended by doctor. (binary)
*   `doctor_recc_seasonal` - Seasonal flu vaccine was recommended by doctor. (binary)
*   `chronic_med_condition` - Has any of the following chronic medical conditions: asthma or an other lung condition, diabetes, a heart condition, a kidney condition, sickle cell anemia or other anemia, a neurological or neuromuscular condition, a liver condition, or a weakened immune system caused by a chronic illness or by medicines taken for a chronic illness. (binary)
*   `child_under_6_months` - Has regular close contact with a child under the age of six months. (binary)
*   `health_worker` - Is a healthcare worker. (binary)
*   `health_insurance` - Has health insurance. (binary)
*   `opinion_h1n1_vacc_effective` - Respondent's opinion about H1N1 vaccine effectiveness.
  *   1 = Not at all effective; 2 = Not very effective; 3 = Don't know; 4 = Somewhat effective; 5 = Very effective.
*   `opinion_h1n1_risk` - Respondent's opinion about risk of getting sick with H1N1 flu without vaccine.
  *   1 = Very Low; 2 = Somewhat low; 3 = Don't know; 4 = Somewhat high; 5 = Very high.
*   `opinion_h1n1_sick_from_vacc` - Respondent's worry of getting sick from taking H1N1 vaccine.
  *   1 = Not at all worried; 2 = Not very worried; 3 = Don't know; 4 = Somewhat worried; 5 = Very worried.
*   `opinion_seas_vacc_effective` - Respondent's opinion about seasonal flu vaccine effectiveness.
  *   1 = Not at all effective; 2 = Not very effective; 3 = Don't know; 4 = Somewhat effective; 5 = Very effective.
*   `opinion_seas_risk` - Respondent's opinion about risk of getting sick with seasonal flu without vaccine.
  *   1 = Very Low; 2 = Somewhat low; 3 = Don't know; 4 = Somewhat high; 5 = Very high.
*   `opinion_seas_sick_from_vacc` - Respondent's worry of getting sick from taking seasonal flu vaccine.
     1 = Not at all worried; 2 = Not very worried; 3 = Don't know; 4 = Somewhat worried; 5 = Very worried.
*   `age_group` - Age group of respondent.
*   `education` - Self-reported education level.
*   `race` - Race of respondent.
*   `sex` - Sex of respondent.
*   `income_poverty` - Household annual income of respondent with respect to 2008 Census poverty thresholds.
*   `marital_status` - Marital status of respondent.
*   `rent_or_own` - Housing situation of respondent.
*   `employment_status` - Employment status of respondent.
*   `census_msa` - Respondent's residence within metropolitan statistical areas (MSA) as defined by the U.S. Census.
*   `household_adults` - Number of other adults in household, top-coded to 3.
*   `household_children` - Number of children in household, top-coded to 3.

# 1. Preparación

En esta fase vamos a cargar y preparar el dataset resultante de la fase de EDA (Exploratory Data Analysis).

Primero importamos las librerías necesarias.

In [203]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report
from sklearn.metrics import roc_curve, roc_auc_score

import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

RANDOM_SEED = 42    # Set a random seed for reproducibility!

In [175]:
def print_metrics(y_test, y_pred_result):
  print("Accuracy:", "{:10.4f}".format(accuracy_score(y_test, y_pred_result, normalize=True)))
  print("Precision:", "{:10.4f}".format(precision_score(y_test, y_pred_result)))
  print("Recall:", "{:10.4f}".format(recall_score(y_test, y_pred_result)))
  print("\n", classification_report(y_test, y_pred_result))

In [176]:
def plot_roc(y_true, y_score, label_name, ax):
    fpr, tpr, thresholds = roc_curve(y_true, y_score)
    ax.plot(fpr, tpr)
    ax.plot([0, 1], [0, 1], color='grey', linestyle='--')
    ax.set_ylabel('TPR')
    ax.set_xlabel('FPR')
    ax.set_title(
        f"{label_name}: AUC = {roc_auc_score(y_true, y_score):.4f}"
)

Cargamos el dataset y exploramos su estructura y datos.

In [177]:
features_df = pd.read_csv('training_set_features_eda_notnulls.csv', index_col="respondent_id")
features_df.head()

Unnamed: 0_level_0,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,doctor_recc_h1n1,...,age_group,education,race,sex,marital_status,rent_or_own,employment_status,census_msa,household_adults,household_children
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,3,1,3,0,1,0,1,2,0.0,0.0
1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,...,1,0,3,1,1,1,0,0,0.0,0.0
2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,2,3,1,1,0,0,0,2.0,0.0
3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,...,4,0,3,0,1,1,1,1,0.0,0.0
4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,...,2,3,3,0,0,0,0,0,1.0,0.0


El dataset tiene 26707 filas y 32 columnas.

In [178]:
features_df.shape

(26707, 31)

Tipos de datos por columna:

In [179]:
features_df.dtypes

h1n1_concern                   float64
h1n1_knowledge                 float64
behavioral_antiviral_meds      float64
behavioral_avoidance           float64
behavioral_face_mask           float64
behavioral_wash_hands          float64
behavioral_large_gatherings    float64
behavioral_outside_home        float64
behavioral_touch_face          float64
doctor_recc_h1n1               float64
doctor_recc_seasonal           float64
chronic_med_condition          float64
child_under_6_months           float64
health_worker                  float64
health_insurance               float64
opinion_h1n1_vacc_effective    float64
opinion_h1n1_risk              float64
opinion_h1n1_sick_from_vacc    float64
opinion_seas_vacc_effective    float64
opinion_seas_risk              float64
opinion_seas_sick_from_vacc    float64
age_group                        int64
education                        int64
race                             int64
sex                              int64
marital_status           

In [180]:
features_df.describe()

Unnamed: 0,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,doctor_recc_h1n1,...,age_group,education,race,sex,marital_status,rent_or_own,employment_status,census_msa,household_adults,household_children
count,26707.0,26707.0,26707.0,26707.0,26707.0,26707.0,26707.0,26707.0,26707.0,26707.0,...,26707.0,26707.0,26707.0,26707.0,26707.0,26707.0,26707.0,26707.0,26707.0,26707.0
mean,1.6198,1.261392,0.048714,0.727749,0.068933,0.825888,0.357472,0.336279,0.678811,0.202494,...,2.186131,1.741117,2.5703,0.406223,0.439735,0.222002,0.491894,0.833489,0.887558,0.529599
std,0.909016,0.617047,0.215273,0.445127,0.253345,0.379213,0.479264,0.472444,0.466942,0.401866,...,1.45732,1.073989,0.923226,0.491136,0.496364,0.4156,0.598964,0.823313,0.74998,0.925264
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,...,2.0,2.0,3.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
75%,2.0,2.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,...,4.0,3.0,3.0,1.0,1.0,0.0,1.0,2.0,1.0,1.0
max,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,4.0,3.0,3.0,1.0,1.0,1.0,2.0,2.0,3.0,3.0


In [181]:
labels_df = pd.read_csv("training_set_labels.csv", index_col="respondent_id")
labels_df.head()

Unnamed: 0_level_0,h1n1_vaccine,seasonal_vaccine
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0,0
1,0,1
2,0,0
3,0,1
4,0,0


# Split training y evaluation set

In [182]:
X_train, X_eval, y_train, y_eval = train_test_split(
    features_df,
    labels_df,
    test_size=0.33,
    shuffle=True,
    stratify=labels_df,
    random_state=RANDOM_SEED
)

y_train_h1n1 = y_train[['h1n1_vaccine']].copy()
y_train_seasonal = y_train[['seasonal_vaccine']].copy()
y_eval_h1n1 = y_eval[['h1n1_vaccine']].copy()
y_eval_seasonal = y_eval[['seasonal_vaccine']].copy()

# 2. Hyperparameters tuning con GridSearchCV

## Naive Bayes

### H1N1

In [183]:
param_grid_nb = {
    'var_smoothing': np.logspace(0,-9, num=100)
}

naive_h1n1_grid = GridSearchCV(estimator=GaussianNB(), param_grid=param_grid_nb, verbose=1, cv=10, n_jobs=-1)
naive_h1n1_grid.fit(X_train, y_train_h1n1.values.ravel())

print(naive_h1n1_grid.best_estimator_)

Fitting 10 folds for each of 100 candidates, totalling 1000 fits
GaussianNB(var_smoothing=0.657933224657568)


In [184]:
naive_params_result = pd.DataFrame({'params':naive_h1n1_grid.best_params_,
                           'score':naive_h1n1_grid.best_score_})
naive_params_result

Unnamed: 0,params,score
var_smoothing,0.657933,0.811435


### Seasonal

In [185]:
param_grid_nb = {
    'var_smoothing': np.logspace(0,-9, num=100)
}

naive_seasonal_grid = GridSearchCV(estimator=GaussianNB(), param_grid=param_grid_nb, verbose=1, cv=10, n_jobs=-1)
naive_seasonal_grid.fit(X_train, y_train_seasonal.values.ravel())

print(naive_seasonal_grid.best_estimator_)

Fitting 10 folds for each of 100 candidates, totalling 1000 fits
GaussianNB(var_smoothing=0.12328467394420659)


In [186]:
naive_params_result = pd.DataFrame({'params':naive_seasonal_grid.best_params_,
                           'score':naive_seasonal_grid.best_score_})
naive_params_result

Unnamed: 0,params,score
var_smoothing,0.123285,0.742972


## Logistic Regression

### H1N1

In [187]:
param_grid = {
    'penalty': ['l1', 'l2','elasticnet'],
    'C': [0.1, 0.2, 0.5, 1, 2, 5, 10, 100],
    'solver': ['lbfgs','saga'],
    'multi_class': ['auto', 'ovr', 'multinomial']
}

lr = LogisticRegression()
lr_h1n1_grid = GridSearchCV(lr, param_grid, cv=10)
lr_h1n1_grid.fit(X_train, y_train_h1n1.values.ravel())

print(lr_h1n1_grid.best_estimator_)

LogisticRegression(C=0.1, solver='saga')


In [188]:
lr_params_result = pd.DataFrame({'params':lr_h1n1_grid.best_params_,
                           'score':lr_h1n1_grid.best_score_})
lr_params_result

Unnamed: 0,params,score
C,0.1,0.850612
multi_class,auto,0.850612
penalty,l2,0.850612
solver,saga,0.850612


### Seasonal

In [192]:
param_grid = {
    'penalty': ['l1', 'l2','elasticnet'],
    'C': [0.1, 0.2, 0.5, 1, 2, 5, 10, 100],
    'solver': ['lbfgs','saga'],
    'multi_class': ['auto', 'ovr', 'multinomial']
}

lr = LogisticRegression()
lr_seasonal_grid = GridSearchCV(lr, param_grid, cv=10)
lr_seasonal_grid.fit(X_train, y_train_seasonal)

print(lr_seasonal_grid.best_estimator_)

LogisticRegression(C=0.1)


In [193]:
lr_params_result = pd.DataFrame({'params':lr_seasonal_grid.best_params_,
                           'score':lr_seasonal_grid.best_score_})
lr_params_result

Unnamed: 0,params,score
C,0.1,0.771699
multi_class,auto,0.771699
penalty,l2,0.771699
solver,lbfgs,0.771699


## SVM

### H1N1

In [194]:
# defining parameter range
param_grid = {'C': [0.1, 1, 10],
              'gamma': [1, 0.1, 0.01],
              'kernel': ['rbf']}

svm_h1n1_grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3)

# fitting the model for grid search
svm_h1n1_grid.fit(X_train, y_train_h1n1)

# print best parameter after tuning
print(svm_h1n1_grid.best_params_)

# print how our model looks after hyper-parameter tuning
print(svm_h1n1_grid.best_estimator_)

# print best_score after hyper-parameter tuning
print(svm_h1n1_grid.best_score_)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.787 total time=  42.6s
[CV 2/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.787 total time=  41.7s
[CV 3/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.788 total time=  42.2s
[CV 4/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.788 total time=  42.3s
[CV 5/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.788 total time=  42.3s
[CV 1/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.830 total time=   9.9s
[CV 2/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.828 total time=   9.7s
[CV 3/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.832 total time=  10.0s
[CV 4/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.824 total time=   9.9s
[CV 5/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.823 total time=   9.9s
[CV 1/5] END .....C=0.1, gamma=0.01, kernel=rbf;, score=0.837 total time=   8.5s
[CV 2/5] END .....C=0.1, gamma=0.01, kernel=rbf;,

In [195]:
svm_params_result = pd.DataFrame({'params':svm_h1n1_grid.best_params_,
                           'score':svm_h1n1_grid.best_score_})
svm_params_result

Unnamed: 0,params,score
C,10,0.852121
gamma,0.01,0.852121
kernel,rbf,0.852121


### Seasonal

In [214]:
# defining parameter range
param_grid = {'C': [0.1, 1, 10],
              'gamma': [1, 0.1, 0.01],
              'kernel': ['rbf']}

svm_seasonal_grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3)

# fitting the model for grid search
svm_seasonal_grid.fit(X_train, y_train_seasonal)

# print best parameter after tuning
print(svm_seasonal_grid.best_params_)

# print how our model looks after hyper-parameter tuning
print(svm_seasonal_grid.best_estimator_)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.534 total time=  33.7s
[CV 2/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.535 total time=  33.8s
[CV 3/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.535 total time=  33.5s
[CV 4/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.534 total time=  33.8s
[CV 5/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.534 total time=  33.9s
[CV 1/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.771 total time=  13.8s
[CV 2/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.761 total time=  13.6s
[CV 3/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.756 total time=  13.6s
[CV 4/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.770 total time=  13.7s
[CV 5/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.771 total time=  13.7s
[CV 1/5] END .....C=0.1, gamma=0.01, kernel=rbf;, score=0.775 total time=  12.5s
[CV 2/5] END .....C=0.1, gamma=0.01, kernel=rbf;,

In [215]:
svm_params_result = pd.DataFrame({'params':svm_seasonal_grid.best_params_,
                           'score':svm_seasonal_grid.best_score_})
svm_params_result

Unnamed: 0,params,score
C,10,0.77494
gamma,0.01,0.77494
kernel,rbf,0.77494


## DecisionTreeClassifier

### H1N1

In [196]:
param_grid = {
    'max_features': ['auto', 'sqrt', 'log2'],
    'ccp_alpha': [0.1, .01, .001],
    'max_depth' : [5, 6, 7, 8, 9],
    'criterion' :['gini', 'entropy']
}

dt = DecisionTreeClassifier(random_state=RANDOM_SEED)
dt_h1n1_grid = GridSearchCV(dt, param_grid, cv=5, verbose=1, n_jobs=-1)
dt_h1n1_grid.fit(X_train, y_train_h1n1)

print(dt_h1n1_grid.best_estimator_)

Fitting 5 folds for each of 90 candidates, totalling 450 fits
DecisionTreeClassifier(ccp_alpha=0.001, criterion='entropy', max_depth=7,
                       max_features='auto', random_state=42)


In [197]:
df_params_result = pd.DataFrame({'params':dt_h1n1_grid.best_params_,
                           'score':dt_h1n1_grid.best_score_})
df_params_result

Unnamed: 0,params,score
ccp_alpha,0.001,0.828592
criterion,entropy,0.828592
max_depth,7,0.828592
max_features,auto,0.828592


### Seasonal

In [198]:
param_grid = {
    'max_features': ['auto', 'sqrt', 'log2'],
    'ccp_alpha': [0.1, .01, .001],
    'max_depth' : [5, 6, 7, 8, 9],
    'criterion' :['gini', 'entropy']
}

dt = DecisionTreeClassifier(random_state=RANDOM_SEED)
dt_seasonal_grid = GridSearchCV(dt, param_grid, cv=5, verbose=1, n_jobs=-1)
dt_seasonal_grid.fit(X_train, y_train_seasonal)

print(dt_seasonal_grid.best_estimator_)

Fitting 5 folds for each of 90 candidates, totalling 450 fits
DecisionTreeClassifier(ccp_alpha=0.001, criterion='entropy', max_depth=8,
                       max_features='log2', random_state=42)


In [199]:
df_params_result = pd.DataFrame({'params':dt_seasonal_grid.best_params_,
                           'score':dt_seasonal_grid.best_score_})
df_params_result

Unnamed: 0,params,score
ccp_alpha,0.001,0.72151
criterion,entropy,0.72151
max_depth,8,0.72151
max_features,log2,0.72151


## KNeighbors

### H1N1

In [204]:
param_grid = {
    'n_neighbors': list(range(1, 31))
}

kn = KNeighborsClassifier()
kn_h1n1_grid = GridSearchCV(kn, param_grid, cv=5, verbose=1, n_jobs=-1)
kn_h1n1_grid.fit(X_train, y_train_h1n1.values.ravel())

print(kn_h1n1_grid.best_estimator_)

Fitting 5 folds for each of 30 candidates, totalling 150 fits


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mo

KNeighborsClassifier(n_neighbors=29)


In [205]:
kn_params_result = pd.DataFrame({'params':kn_h1n1_grid.best_params_,
                           'score':kn_h1n1_grid.best_score_})
kn_params_result

Unnamed: 0,params,score
n_neighbors,29,0.827921


### Seasonal

In [206]:
param_grid = {
    'n_neighbors': list(range(1, 31))
}

kn = KNeighborsClassifier()
kn_seasonal_grid = GridSearchCV(kn, param_grid, cv=5, verbose=1, n_jobs=-1)
kn_seasonal_grid.fit(X_train, y_train_seasonal.values.ravel())

print(kn_seasonal_grid.best_estimator_)

Fitting 5 folds for each of 30 candidates, totalling 150 fits


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mo

KNeighborsClassifier(n_neighbors=28)


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


In [207]:
kn_params_result = pd.DataFrame({'params':kn_seasonal_grid.best_params_,
                           'score':kn_seasonal_grid.best_score_})
kn_params_result

Unnamed: 0,params,score
n_neighbors,28,0.75482


## RandomForest

### H1N1

In [209]:
param_grid = {
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [4, 5, 6, 7, 8],
    'criterion': ['gini', 'entropy']
}

rf = RandomForestClassifier(random_state=RANDOM_SEED)
rf_h1n1_grid = GridSearchCV(rf, param_grid, cv=5, verbose=1, n_jobs=-1)
rf_h1n1_grid.fit(X_train, y_train_h1n1.values.ravel())

print(rf_h1n1_grid.best_estimator_)

Fitting 5 folds for each of 60 candidates, totalling 300 fits
RandomForestClassifier(criterion='entropy', max_depth=8, n_estimators=500,
                       random_state=42)


In [210]:
rf_params_result = pd.DataFrame({'params':rf_h1n1_grid.best_params_,
                           'score':rf_h1n1_grid.best_score_})
rf_params_result

Unnamed: 0,params,score
criterion,entropy,0.847314
max_depth,8,0.847314
max_features,auto,0.847314
n_estimators,500,0.847314


### Seasonal

In [211]:
param_grid = {
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [4, 5, 6, 7, 8],
    'criterion': ['gini', 'entropy']
}

rf = RandomForestClassifier(random_state=RANDOM_SEED)
rf_seasonal_grid = GridSearchCV(rf, param_grid, cv=5, verbose=1, n_jobs=-1)
rf_seasonal_grid.fit(X_train, y_train_seasonal.values.ravel())

print(rf_seasonal_grid.best_estimator_)

Fitting 5 folds for each of 60 candidates, totalling 300 fits
RandomForestClassifier(max_depth=8, n_estimators=500, random_state=42)


In [212]:
rf_params_result = pd.DataFrame({'params':rf_seasonal_grid.best_params_,
                           'score':rf_seasonal_grid.best_score_})
rf_params_result

Unnamed: 0,params,score
criterion,gini,0.775052
max_depth,8,0.775052
max_features,auto,0.775052
n_estimators,500,0.775052


# 3. Training models

**Mejor algoritmo H1N1:**

SVM:
* {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}
* SVC(C=10, gamma=0.01)
* Score: 0.852121

LogisticRegression:
* LogisticRegression(C=0.1, solver='saga')
* Score: 0.850612

**Mejor algoritmo Seasonal:**

RandomForest:
* RandomForestClassifier(max_depth=8, n_estimators=500, random_state=42)
* Score: 0.775052
* {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}

SVM:
* SVC(C=10, gamma=0.01)
* Score: 0.77494

## H1N1

### Algoritmo 1: SVM

### Algoritmo 2: LogisticRegression

## Seasonal

### Algoritmo 1: RandomForest

### Algoritmo 2: SVM

## MultiOutputClassifier

### Algoritmo 1: Logistic Regression

In [None]:
numeric_cols = features_df.columns[features_df.dtypes != "object"].values
print(numeric_cols)

In [None]:
# chain preprocessing into a Pipeline object
# each step is a tuple of (name you chose, sklearn transformer)
numeric_preprocessing_steps = Pipeline([
    ('standard_scaler', StandardScaler()),
    ('simple_imputer', SimpleImputer(strategy='median'))
])

# create the preprocessor stage of final pipeline
# each entry in the transformer list is a tuple of
# (name you choose, sklearn transformer, list of columns)
preprocessor = ColumnTransformer(
    transformers = [
        ("numeric", numeric_preprocessing_steps, numeric_cols)
    ],
    remainder = "drop"
)

In [None]:
estimators = MultiOutputClassifier(
    estimator=LogisticRegression(penalty="l2", C=0.1)
)

full_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("estimators", estimators),
])

full_pipeline

In [None]:
# Train model
full_pipeline.fit(X_train, y_train)

# Predict on evaluation set
preds_lg = full_pipeline.predict_proba(X_eval)
preds_lg

In [None]:
preds_lg[1][:, 1]

In [None]:
# Classification metrics can't handle a mix of binary and continuous targets
print_metrics(y_eval_h1n1, preds_lg[0][:, 1].round())

In [None]:
print_metrics(y_eval_seasonal, preds_lg[0][:, 1].round())

In [None]:
y_preds_lg = pd.DataFrame(
    {
        "h1n1_vaccine": preds_lg[0][:, 1],
        "seasonal_vaccine": preds_lg[1][:, 1],
    },
    index = y_eval.index
)
print("y_preds.shape:", y_preds_lg.shape)
y_preds_lg.head()

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(7, 3.5))

plot_roc(
    y_eval['h1n1_vaccine'],
    y_preds_lg['h1n1_vaccine'],
    'h1n1_vaccine',
    ax=ax[0]
)
plot_roc(
    y_eval['seasonal_vaccine'],
    y_preds_lg['seasonal_vaccine'],
    'seasonal_vaccine',
    ax=ax[1]
)
fig.tight_layout()

### Algoritmo 2: Naive Bayes

In [None]:
estimators = MultiOutputClassifier(
    estimator=GaussianNB(var_smoothing=0.15199110829529336)
)

full_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("estimators", estimators),
])

full_pipeline

In [None]:
# Train model
full_pipeline.fit(X_train, y_train)

# Predict on evaluation set
preds_naive = full_pipeline.predict_proba(X_eval)
preds_naive

In [None]:
print_metrics(y_eval_h1n1, preds_naive[0][:, 1].round())

In [None]:
print_metrics(y_eval_seasonal, preds_naive[0][:, 1].round())

In [None]:
y_preds_naive = pd.DataFrame(
    {
        "h1n1_vaccine": preds_naive[0][:, 1],
        "seasonal_vaccine": preds_naive[1][:, 1],
    },
    index = y_eval.index
)
print("y_preds.shape:", y_preds_naive.shape)
y_preds_naive.head()

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(7, 3.5))

plot_roc(
    y_eval['h1n1_vaccine'],
    y_preds_naive['h1n1_vaccine'],
    'h1n1_vaccine',
    ax=ax[0]
)
plot_roc(
    y_eval['seasonal_vaccine'],
    y_preds_naive['seasonal_vaccine'],
    'seasonal_vaccine',
    ax=ax[1]
)
fig.tight_layout()

### Algoritmo 3: SVM

In [None]:
estimators = MultiOutputClassifier(
    estimator=SVC(C=0.1, kernel='rbf', gamma=0.1, probability=True)
)

full_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("estimators", estimators),
])

full_pipeline

In [None]:
# Train model
full_pipeline.fit(X_train, y_train)

# Predict on evaluation set
preds_svm = full_pipeline.predict_proba(X_eval)
preds_svm

In [None]:
print_metrics(y_eval_h1n1, preds_svm[0][:, 1].round())

In [None]:
print_metrics(y_eval_seasonal, preds_svm[0][:, 1].round())

In [None]:
y_preds_svm = pd.DataFrame(
    {
        "h1n1_vaccine": preds_svm[0][:, 1],
        "seasonal_vaccine": preds_svm[1][:, 1],
    },
    index = y_eval.index
)
print("y_preds.shape:", y_preds_svm.shape)
y_preds_svm.head()

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(7, 3.5))

plot_roc(
    y_eval['h1n1_vaccine'],
    y_preds_svm['h1n1_vaccine'],
    'h1n1_vaccine',
    ax=ax[0]
)
plot_roc(
    y_eval['seasonal_vaccine'],
    y_preds_svm['seasonal_vaccine'],
    'seasonal_vaccine',
    ax=ax[1]
)
fig.tight_layout()