# Naive Bayes (Gaussiano)

In [1]:
%load_ext autoreload
%autoreload 2
from preprocessing import *
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SequentialFeatureSelector
import numpy as np
import pandas as pd

In [2]:
SCORINGS = ["f1", "roc_auc", "accuracy", "recall", "precision"]
METRIC = "roc_auc"

def tabla(grid):
    tabla = pd.DataFrame(grid.cv_results_)
    tabla['param_nb__var_smoothing'] = tabla['param_nb__var_smoothing'].map('{:.3e}'.format)
    tabla.sort_values("rank_test_" + METRIC, inplace = True)
    tabla.reset_index(inplace = True)
    cols = ["param_nb__var_smoothing"] + ["mean_test_" + x for x in SCORINGS]
    return tabla[cols]

## Modelo inicial

Algo a considerar es que Gaussian Naive Bayes contempla más que nada variables continuas. Para este primer modelo, por ahora, vamos a probar a dejar las numéricas que son discretas (nubosidad), pero en este modelo no vamos a utilizar la feature de "barrio" que es de clasificación.

In [3]:
initialize_dataset()
df_features = pd.read_csv("datasets/df_features.csv", low_memory = False, index_col = "id")
df_target = pd.read_csv("datasets/df_target.csv", low_memory = False, index_col = "id")

# No utilizamos estandarización ya que el NB se encarga 
initialize_dataset()
common(df_features, df_target)
viento_trigonometrico(df_features)
drop_categoricas(df_features)
pipe = iterative_imputer()

In [4]:
pipe.steps.append(['nb', GaussianNB()])

grid = GridSearchCV(pipe, param_grid = {'nb__var_smoothing': np.logspace(-15, 1, num=16)},
                    verbose = 1, n_jobs = -1, cv = StratifiedKFold(3), scoring = SCORINGS, refit = METRIC)

grid.fit(df_features, df_target.values.ravel())

grid.best_score_

Fitting 3 folds for each of 16 candidates, totalling 48 fits


0.8330730231196685

In [5]:
tabla(grid)

Unnamed: 0,param_nb__var_smoothing,mean_test_f1,mean_test_roc_auc,mean_test_accuracy,mean_test_recall,mean_test_precision
0,2.929e-08,0.577055,0.833073,0.827181,0.52658,0.638248
1,2.512e-09,0.59013,0.832361,0.805478,0.62548,0.558564
2,2.154e-10,0.589552,0.831417,0.798902,0.645077,0.542829
3,1.848e-11,0.589047,0.831247,0.797895,0.646954,0.540655
4,1.585e-12,0.589051,0.831226,0.797817,0.647216,0.540479
5,1.359e-13,0.588983,0.831225,0.797788,0.647128,0.540425
6,1.166e-14,0.588983,0.831225,0.797788,0.647128,0.540425
7,1e-15,0.588983,0.831225,0.797788,0.647128,0.540425
8,3.415e-07,0.429661,0.829211,0.824562,0.29526,0.789044
9,3.981e-06,0.000349,0.819984,0.776154,0.000175,1.0


El modelo que mejor resultó es el de smoothing del orden de $10^{-8}$. Aún así, no dio muy bien. Tiene un accuracy aceptable de 82,7%, pero no es tanto considerando que podría conseguir alrededor de 77% diciendo siempre no, como pasó con los últimos de la tabla. Ni el recall ni el precision son muy buenos.

Para los próximos modelos vamos a usar valores más cercanos a ese orden de smoothing.

## Dropeando features correlacionadas

En el dataset hay variables fuertemente correlacionadas, lo cual va en contra de las asunciones que hace Naive Bayes. Vamos a probar a dropear algunas de las variables que tenían más alta covarianza con otras y entrenar nuevamente. De todas formas, esto puede dar resultados bastante malos ya que algunas de estas relaciones nos sirvieron al menos en la baseline.

In [6]:
initialize_dataset()
df_features = pd.read_csv("datasets/df_features.csv", low_memory = False, index_col = "id")
df_target = pd.read_csv("datasets/df_target.csv", low_memory = False, index_col = "id")

# No utilizamos estandarización ya que el NB se encarga 
initialize_dataset()
common(df_features, df_target)
viento_trigonometrico(df_features)
drop_correlacionadas(df_features)
drop_categoricas(df_features)
pipe = iterative_imputer()

In [7]:
pipe.steps.append(['nb', GaussianNB()])

grid2 = GridSearchCV(pipe, param_grid = {'nb__var_smoothing': np.logspace(-9, -7, num=8)},
                    verbose = 1, n_jobs = -1, cv = StratifiedKFold(3), scoring = SCORINGS, refit = METRIC)

grid2.fit(df_features, df_target.values.ravel())

grid2.best_score_

Fitting 3 folds for each of 8 candidates, totalling 24 fits


0.8345761011390066

In [8]:
tabla(grid2)

Unnamed: 0,param_nb__var_smoothing,mean_test_f1,mean_test_roc_auc,mean_test_accuracy,mean_test_recall,mean_test_precision
0,1.389e-08,0.58481,0.834576,0.823692,0.5546,0.618525
1,7.197e-09,0.589713,0.834233,0.816754,0.588207,0.591229
2,2.683e-08,0.573784,0.834118,0.828773,0.514796,0.648053
3,3.728e-09,0.590781,0.833492,0.810373,0.611383,0.571522
4,5.179e-08,0.555444,0.832906,0.831607,0.469885,0.679104
5,1.931e-09,0.589791,0.832815,0.805488,0.624564,0.558689
6,1e-09,0.59037,0.832394,0.803064,0.63386,0.552467
7,1e-07,0.526734,0.831241,0.831627,0.418514,0.710441


Dio muy levemente mejor, lo cual es un poco sorprendente considerando que perdió información. Podemos tratar hacer algo mejor transformando esas features que dropeamos.

## Transformando features correlacionadas

Algo que podemos probar es unir las parejas de features altamente correlacionadas en 1 sola (temperatura tarde y máxima, temperatura temprano y minima, presión mañana y tarde). Voy a hacer un PCA lineal de estas parejas.

In [9]:
initialize_dataset()
df_features = pd.read_csv("datasets/df_features.csv", low_memory = False, index_col = "id")
df_target = pd.read_csv("datasets/df_target.csv", low_memory = False, index_col = "id")

# No utilizamos estandarización ya que el NB se encarga 
initialize_dataset()
common(df_features, df_target)
viento_trigonometrico(df_features)
pca(df_features, ["temp_max", "temperatura_tarde"], "temp_altas")
pca(df_features, ["temperatura_temprano", "temp_min"], "temp_bajas")
pca(df_features, ["presion_atmosferica_tarde", "presion_atmosferica_temprano"], "presiones")
drop_categoricas(df_features)
pipe = iterative_imputer()

In [10]:
pares_ord_cov = df_features.corr().abs().unstack().sort_values(ascending=False).drop_duplicates()
pares_ord_cov = pares_ord_cov[pares_ord_cov < 1]
display(pares_ord_cov.to_frame("|Correlación|").head(5))

Unnamed: 0,Unnamed: 1,|Correlación|
temp_altas,temp_bajas,0.819398
horas_de_sol,nubosidad_tarde,0.702506
sin_viento_tarde,sin_rafaga_viento_max_direccion,0.699735
rafaga_viento_max_velocidad,velocidad_viento_tarde,0.685499
nubosidad_temprano,horas_de_sol,0.674629


Aún después de eso, siguen estando bastante correlacionadas.

In [11]:
pipe.steps.append(['nb', GaussianNB()])

grid3 = GridSearchCV(pipe, param_grid = {'nb__var_smoothing': np.logspace(-9, -7, num=8)},
                    verbose = 1, n_jobs = -1, cv = StratifiedKFold(3), scoring = SCORINGS, refit = METRIC)

grid3.fit(df_features, df_target.values.ravel())

grid3.best_score_

Fitting 3 folds for each of 8 candidates, totalling 24 fits


0.8320783491851048

In [12]:
tabla(grid3)

Unnamed: 0,param_nb__var_smoothing,mean_test_f1,mean_test_roc_auc,mean_test_accuracy,mean_test_recall,mean_test_precision
0,1.389e-08,0.582697,0.832078,0.822432,0.553727,0.614882
1,2.683e-08,0.572612,0.831895,0.827747,0.515407,0.644116
2,7.197e-09,0.587626,0.831694,0.815689,0.586549,0.588709
3,5.179e-08,0.556043,0.831331,0.830698,0.473551,0.673342
4,3.728e-09,0.588114,0.831002,0.809093,0.608764,0.568819
5,1e-07,0.531043,0.830661,0.831441,0.426283,0.704074
6,1.931e-09,0.588353,0.83038,0.804618,0.623647,0.556844
7,1e-09,0.588504,0.829995,0.801911,0.632682,0.550097


Dio un poco peor. Probamos a unir todas las temperaturas.

In [13]:
initialize_dataset()
df_features = pd.read_csv("datasets/df_features.csv", low_memory = False, index_col = "id")
df_target = pd.read_csv("datasets/df_target.csv", low_memory = False, index_col = "id")

# No utilizamos estandarización ya que el NB se encarga 
initialize_dataset()
common(df_features, df_target)
viento_trigonometrico(df_features)
pca(df_features, ["temp_max", "temperatura_tarde", "temperatura_temprano", "temp_min"], "temps")
pca(df_features, ["presion_atmosferica_tarde", "presion_atmosferica_temprano"], "presiones")
drop_categoricas(df_features)
pipe = iterative_imputer()

In [14]:
pares_ord_cov = df_features.corr().abs().unstack().sort_values(ascending=False).drop_duplicates()
pares_ord_cov = pares_ord_cov[pares_ord_cov < 1]
display(pares_ord_cov.to_frame("|Correlación|").head(5))

Unnamed: 0,Unnamed: 1,|Correlación|
nubosidad_tarde,horas_de_sol,0.702506
sin_rafaga_viento_max_direccion,sin_viento_tarde,0.699735
rafaga_viento_max_velocidad,velocidad_viento_tarde,0.685499
nubosidad_temprano,horas_de_sol,0.674629
humedad_tarde,humedad_temprano,0.667982


In [15]:
pipe.steps.append(['nb', GaussianNB()])

grid4 = GridSearchCV(pipe, param_grid = {'nb__var_smoothing': np.logspace(-9, -7, num=8)},
                    verbose = 1, n_jobs = -1, cv = StratifiedKFold(3), scoring = SCORINGS, refit = METRIC)

grid4.fit(df_features, df_target.values.ravel())

grid4.best_score_

Fitting 3 folds for each of 8 candidates, totalling 24 fits


0.833071794742604

In [16]:
tabla(grid4)

Unnamed: 0,param_nb__var_smoothing,mean_test_f1,mean_test_roc_auc,mean_test_accuracy,mean_test_recall,mean_test_precision
0,1.389e-08,0.581193,0.833072,0.824259,0.544649,0.623012
1,2.683e-08,0.568719,0.832825,0.828529,0.504975,0.650887
2,7.197e-09,0.587003,0.832592,0.817898,0.578038,0.596254
3,5.179e-08,0.547618,0.832011,0.830112,0.459279,0.678041
4,3.728e-09,0.58796,0.831754,0.811272,0.601432,0.575078
5,1.931e-09,0.588407,0.831007,0.806807,0.616795,0.56252
6,1e-07,0.520787,0.830936,0.830376,0.411705,0.708521
7,1e-09,0.588027,0.830541,0.803778,0.62548,0.554809


No parece que estas transformaciones hayan afectado mucho

## Sin features discretas

Podemos probar a sacar también las features discretas, siendo que al utilizar el Gaussiano considera mejor las continuas y quizás estan perjudicando al modelo.

In [17]:
initialize_dataset()
df_features = pd.read_csv("datasets/df_features.csv", low_memory = False, index_col = "id")
df_target = pd.read_csv("datasets/df_target.csv", low_memory = False, index_col = "id")

# No utilizamos estandarización ya que el NB se encarga 
initialize_dataset()
common(df_features, df_target)
viento_trigonometrico(df_features)
drop_discretas(df_features)
drop_categoricas(df_features)
pipe = iterative_imputer()

In [18]:
pipe.steps.append(['nb', GaussianNB()])

grid5 = GridSearchCV(pipe, param_grid = {'nb__var_smoothing': np.logspace(-9, -7, num=8)},
                    verbose = 1, n_jobs = -1, cv = StratifiedKFold(3), scoring = SCORINGS, refit = METRIC)

grid5.fit(df_features, df_target.values.ravel())

grid5.best_score_

Fitting 3 folds for each of 8 candidates, totalling 24 fits


0.8301420096890225

In [19]:
tabla(grid5)

Unnamed: 0,param_nb__var_smoothing,mean_test_f1,mean_test_roc_auc,mean_test_accuracy,mean_test_recall,mean_test_precision
0,5.179e-08,0.550521,0.830142,0.829604,0.466088,0.672313
1,2.683e-08,0.56476,0.830128,0.827728,0.499214,0.650125
2,1.389e-08,0.574973,0.830109,0.825685,0.526624,0.633105
3,7.197e-09,0.58138,0.830048,0.823692,0.546831,0.620598
4,1e-07,0.528581,0.830024,0.830991,0.423228,0.703783
5,3.728e-09,0.583924,0.829953,0.821572,0.559227,0.610908
6,1.931e-09,0.585875,0.829872,0.819989,0.568741,0.604073
7,1e-09,0.587564,0.829787,0.819168,0.575332,0.600328


Parece que tampoco ayudó mucho.

## Con Feature Selection

Como hacerlo a mano no fue muy bien, podemos probar distintas combinaciones de features y ver cuales dan mejor. Hacemos un Forward Selection con la mitad de las features. Usamos $10^{-8}$ de smoothing. En este caso no acepta pipelines como estimators, asi que vamos a usar el imputer antes para la selección.

In [20]:
initialize_dataset()
df_features = pd.read_csv("datasets/df_features.csv", low_memory = False, index_col = "id")
df_target = pd.read_csv("datasets/df_target.csv", low_memory = False, index_col = "id")

# No utilizamos estandarización ya que el NB se encarga 
initialize_dataset()
common(df_features, df_target)
viento_trigonometrico(df_features)
drop_categoricas(df_features)
imputer = IterativeImputer(random_state = 123)
df = imputer.fit_transform(df_features)

In [21]:
seleccion = SequentialFeatureSelector(GaussianNB(var_smoothing = 1e-8), n_features_to_select = 0.5, direction = "forward",
                                   scoring = METRIC, cv = StratifiedKFold(3), n_jobs = -1)

seleccion.fit_transform(df, df_target.values.ravel())

None

In [22]:
df_features.columns[seleccion.support_]

Index(['dia', 'horas_de_sol', 'humedad_tarde', 'presion_atmosferica_tarde',
       'rafaga_viento_max_velocidad', 'temp_min', 'cos_viento_tarde',
       'sin_viento_tarde', 'sin_viento_temprano',
       'cos_rafaga_viento_max_direccion', 'sin_rafaga_viento_max_direccion'],
      dtype='object')

In [23]:
initialize_dataset()
df_features = pd.read_csv("datasets/df_features.csv", low_memory = False, index_col = "id")
df_target = pd.read_csv("datasets/df_target.csv", low_memory = False, index_col = "id")

# No utilizamos estandarización ya que el NB se encarga 
initialize_dataset()
common(df_features, df_target)
viento_trigonometrico(df_features)
drop_categoricas(df_features)
pipe = iterative_imputer()

#Filtro features

df_features = df_features[df_features.columns[seleccion.support_]]

In [24]:
pipe.steps.append(['nb', GaussianNB()])

# Usamos gridsearch con un solo parametro porque facilita kfold y scores
grid6 = GridSearchCV(pipe, param_grid = {'nb__var_smoothing': [1e-8]},
                    verbose = 1, n_jobs = -1, cv = StratifiedKFold(3), scoring = SCORINGS, refit = METRIC)

grid6.fit(df_features, df_target.values.ravel())

grid6.best_score_

Fitting 3 folds for each of 1 candidates, totalling 3 fits


0.8503238795973561

In [26]:
tabla(grid6)

Unnamed: 0,param_nb__var_smoothing,mean_test_f1,mean_test_roc_auc,mean_test_accuracy,mean_test_recall,mean_test_precision
0,1e-08,0.58564,0.850324,0.834793,0.521473,0.667816


La elección de features mejoró un poco más significativamene que con los cambios anteriores, aunque tampoco mucho. El recall sigue siendo bastante malo, pero aún así es el mejor modelo de Naive Bayes que logramos.