<center> <span style="color:indigo">Machine Learning e Inferencia Bayesiana</span> </center> 

<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/5/5e/Logo-cucea.png" alt="Drawing" style="width: 600px;"/>
</center>
    
<center> <span style="color:DarkBlue">  Tema 10: Probabilidad, Naive Bayes, datos de titanic </span>  </center>
<center> <span style="color:Blue"> M. en C. Iván A. Toledano Juárez </span>  </center>

## Importación de Librerías

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import seaborn as sns

%matplotlib inline

## Importación de datos

In [2]:
df_titanic = pd.read_csv('../data/titanic/titanic.csv', usecols=["survived", "pclass", "sex", "age", "sibsp", "parch"]).dropna()

df_titanic.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch
0,1,1,female,29.0,0,0
1,1,1,male,0.92,1,2
2,1,0,female,2.0,1,2
3,1,0,male,30.0,1,2
4,1,0,female,25.0,1,2


Este dataset contiene información sobre 887 pasajeros del Titanic, con las siguientes variables:

* **Survived**: si sobrevivió no el pasajero (1=sí,0=no)
* **Passenger Class**
* **Gender** (0=masculino, 1=femenino)
* **Age**
* **Number of siblings/spouse aboard**
* **Number of parents/children aboard**
* **Fare**

In [3]:
## Checamos valores únicos por variable
for var in df_titanic.columns:
    df_t = df_titanic.groupby(var).size()
    print(df_t,"\n")

pclass
1    284
2    261
3    501
dtype: int64 

survived
0    619
1    427
dtype: int64 

sex
female    388
male      658
dtype: int64 

age
0.17     1
0.33     1
0.42     1
0.67     1
0.75     3
        ..
70.50    1
71.00    2
74.00    1
76.00    1
80.00    1
Length: 98, dtype: int64 

sibsp
0    685
1    280
2     36
3     16
4     22
5      6
8      1
dtype: int64 

parch
0    768
1    160
2     97
3      8
4      5
5      6
6      2
dtype: int64 



Vemos que hay poca estadística para aquellos valores de Siblings/Spouses con valores de 5 u 8 personas, y también para la variable Parents/Children con 3,4,5,6 integrantes. Se tiene que tener cuidado con estos valores.

Hacemos un tratamiento para variables categóricas. Las pasamos a boolean.

In [4]:
# Generamos valores boolean para cada uno de estas categorías (excepto survived).

categories = ['pclass', 'sex', 'sibsp', 'parch']
df_titanic_dum = df_titanic.copy()

for var in categories:
    df_dummy = pd.get_dummies(df_titanic[var], prefix = var, dtype=int ) #dataframe con variables dummy
    df_titanic_dum = df_titanic_dum.join(df_dummy) #añadimos dataframe dummy al dataframe original
    df_titanic_dum = df_titanic_dum.drop(var,axis=1) #quitamos la columna redundante (axis=1, column)

df_titanic_dum

Unnamed: 0,survived,age,pclass_1,pclass_2,pclass_3,sex_female,sex_male,sibsp_0,sibsp_1,sibsp_2,...,sibsp_4,sibsp_5,sibsp_8,parch_0,parch_1,parch_2,parch_3,parch_4,parch_5,parch_6
0,1,29.00,1,0,0,1,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
1,1,0.92,1,0,0,0,1,0,1,0,...,0,0,0,0,0,1,0,0,0,0
2,0,2.00,1,0,0,1,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
3,0,30.00,1,0,0,0,1,0,1,0,...,0,0,0,0,0,1,0,0,0,0
4,0,25.00,1,0,0,1,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1301,0,45.50,0,0,1,0,1,1,0,0,...,0,0,0,1,0,0,0,0,0,0
1304,0,14.50,0,0,1,1,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
1306,0,26.50,0,0,1,0,1,1,0,0,...,0,0,0,1,0,0,0,0,0,0
1307,0,27.00,0,0,1,0,1,1,0,0,...,0,0,0,1,0,0,0,0,0,0


In [5]:
# Tabla de frecuencias

df_freq = df_titanic_dum.groupby('survived').sum() # se agrupa por variable dependiente, tabla de frecuencias
df_freq

Unnamed: 0_level_0,age,pclass_1,pclass_2,pclass_3,sex_female,sex_male,sibsp_0,sibsp_1,sibsp_2,sibsp_3,sibsp_4,sibsp_5,sibsp_8,parch_0,parch_1,parch_2,parch_3,parch_4,parch_5,parch_6
survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
0,18907.58,103,146,370,96,523,430,133,20,10,19,6,1,498,65,42,3,4,5,2
1,12348.09,181,115,131,292,135,255,147,16,6,3,0,0,270,95,55,5,1,1,0


In [6]:
# Tabla de cuentas por "survived"

df_count = df_titanic_dum.groupby('survived').count()
df_count

Unnamed: 0_level_0,age,pclass_1,pclass_2,pclass_3,sex_female,sex_male,sibsp_0,sibsp_1,sibsp_2,sibsp_3,sibsp_4,sibsp_5,sibsp_8,parch_0,parch_1,parch_2,parch_3,parch_4,parch_5,parch_6
survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
0,619,619,619,619,619,619,619,619,619,619,619,619,619,619,619,619,619,619,619,619
1,427,427,427,427,427,427,427,427,427,427,427,427,427,427,427,427,427,427,427,427


In [7]:
# Probability table
df_prob = df_freq.div(df_count) # Dividimos los dos dataframes anteriores entrada a entrada
df_prob

Unnamed: 0_level_0,age,pclass_1,pclass_2,pclass_3,sex_female,sex_male,sibsp_0,sibsp_1,sibsp_2,sibsp_3,sibsp_4,sibsp_5,sibsp_8,parch_0,parch_1,parch_2,parch_3,parch_4,parch_5,parch_6
survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
0,30.545363,0.166397,0.235864,0.597738,0.155089,0.844911,0.694669,0.214863,0.03231,0.016155,0.030695,0.009693,0.001616,0.804523,0.105008,0.067851,0.004847,0.006462,0.008078,0.003231
1,28.918244,0.423888,0.269321,0.306792,0.683841,0.316159,0.59719,0.344262,0.037471,0.014052,0.007026,0.0,0.0,0.632319,0.222482,0.128806,0.01171,0.002342,0.002342,0.0


In [8]:
# Probability table
# O simplemente calculamos su media por categoría

df_prob = df_titanic_dum.groupby('survived').mean()
df_prob

Unnamed: 0_level_0,age,pclass_1,pclass_2,pclass_3,sex_female,sex_male,sibsp_0,sibsp_1,sibsp_2,sibsp_3,sibsp_4,sibsp_5,sibsp_8,parch_0,parch_1,parch_2,parch_3,parch_4,parch_5,parch_6
survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
0,30.545363,0.166397,0.235864,0.597738,0.155089,0.844911,0.694669,0.214863,0.03231,0.016155,0.030695,0.009693,0.001616,0.804523,0.105008,0.067851,0.004847,0.006462,0.008078,0.003231
1,28.918244,0.423888,0.269321,0.306792,0.683841,0.316159,0.59719,0.344262,0.037471,0.014052,0.007026,0.0,0.0,0.632319,0.222482,0.128806,0.01171,0.002342,0.002342,0.0


Cada una de estas entradas son probabilidades condicionales. Por ejemplo, $\rm P(Pclass_1|survived)$, $\rm P(Pclass_2|survived)$, ..., etc.

La idea del algoritmo de Naive Bayes es calcular mediante el teorema de Bayes, la probabilidades condicionales contrarias, por ejemplo, a

\begin{equation}
\scriptsize
\rm P(survived=1|Pclass_1=1, Pclass_2=0, Pclass_3=0, Sex_0=1, Sex_1=0, \ldots) = \frac{P(Pclass_1=1, Pclass_2=0, Pclass_3=0, Sex_0=1, Sex_1=0, \ldots|survived=1) \times P(survived=1)}{P(Pclass_1=1, Pclass_2=0, Pclass_3=0, Sex_0=1, Sex_1=0, \ldots)}
\end{equation}

\begin{equation}
\scriptsize
\rm P(survived=0|Pclass_1=1, Pclass_2=0, Pclass_3=0, Sex_0=1, Sex_1=0, \ldots) = \frac{P(Pclass_1=1, Pclass_2=0, Pclass_3=0, Sex_0=1, Sex_1=0, \ldots|survived=0) \times P(survived=0)}{P(Pclass_1=1, Pclass_2=0, Pclass_3=0, Sex_0=1, Sex_1=0, \ldots)}
\end{equation}

<br/><br/>


donde se asume que cada una de las variables de entrada son independientes entre sí, de tal forma que,

\begin{equation}
\scriptsize
\rm P(survived=1|Pclass_1=1, Pclass_2=0, Pclass_3=0, \ldots) = \frac{P(Pclass_1=1|survived=1) \times P(Pclass_2=0|survived=1) \times P(Pclass_3=0|survived=1) \times \ldots \times P(survived=1)}{P(Pclass_1=1, Pclass_2=0, Pclass_3=0,\ldots)}
\end{equation}

\begin{equation}
\scriptsize
\rm P(survived=0|Pclass_1=1, Pclass_2=0, Pclass_3=0, \ldots) = \frac{P(Pclass_1=1|survived=0) \times P(Pclass_2=0|survived=0) \times P(Pclass_3=0|survived=0) \times \ldots \times P(survived=0)}{P(Pclass_1=1, Pclass_2=0, Pclass_3=0,\ldots)}
\end{equation}

<br/><br/>

Como el denominador es el mismo en los dos casos, si queremos comparar estas probabilidades sólamente tenemos que calcular el numerador,


\begin{equation}
\scriptsize
\rm P(survived=1|Pclass_1=1, Pclass_2=0, Pclass_3=0, \ldots) \propto P(Pclass_1=1|survived=1) \times P(Pclass_2=0|survived=1) \times P(Pclass_3=0|survived=1) \times \ldots \times P(survived=1)
\end{equation}

\begin{equation}
\scriptsize
\rm P(survived=0|Pclass_1=1, Pclass_2=0, Pclass_3=0, \ldots) \propto P(Pclass_1=1|survived=0) \times P(Pclass_2=0|survived=0) \times P(Pclass_3=0|survived=0) \times \ldots \times P(survived=0)
\end{equation}

<br/><br/>

Al final, solamente tenemos que comparar estas dos probabilidades $\rm P(survived=1|Pclass_1=1, Pclass_2=0, Pclass_3=0, \ldots)$ y $\rm P(survived=0|Pclass_1=1, Pclass_2=0, Pclass_3=0, \ldots)$, y nos quedamos con la etiqueta de la que salga más alta.

In [9]:
## Calculamos la probabilidad condicional para todos los registros del dataframe dummy

categories = df_titanic_dum.columns.drop("survived") # una lista con todas las variables menos survived

pred_list=[] # lista de predicciones

# Probabilidades de survived=1 y survived=0, P(survived=1), P(survived=0), obtenidas del dataset
p_target_0 = df_titanic_dum.groupby('survived').size()[0]/df_titanic_dum.shape[0]
p_target_1 = df_titanic_dum.groupby('survived').size()[1]/df_titanic_dum.shape[0]

# iteramos sobre todos los registros(filas) utilizando método iterrows()
for index, row in df_titanic_dum.iterrows():
    
    p_0 = p_target_0 # inicializamos la probabilidad condicional total (survived=0)
    p_1 = p_target_1 # inicializamos la probabilidad condicional total (survived=1)
    
    #iteracion sobre las variables
    for var in categories:
        #primer caso (survived=0)
        if row[var]==1:
            p_0 *= df_prob[var].loc[0] # se multiplica por su respectiva entrada en la tabla de prob.
            p_1 *= df_prob[var].loc[1]
        elif row[var]==0:
            p_0 *= 1.0-df_prob[var].loc[0] # se multiplica por su respectivo complemento en la tabla de prob.
            p_1 *= 1.0-df_prob[var].loc[1]
    
    pred = np.where(p_1>p_0,1,0) # Se comparan las dos prob condicionales, se elige la etiqueta de la mayor
    pred_list.append(int(pred)) # Se agrega a la lista de predicciones

In [10]:
# Creamos un dataframe que incluya las etiquetas originales
# Añadimos una columna con las etiquetas de la prediccion

df_survived = pd.DataFrame(
{
    'Actual':df_titanic['survived'],
    'Preds':pred_list
})
df_survived

Unnamed: 0,Actual,Preds
0,1,1
1,1,1
2,0,1
3,0,1
4,0,1
...,...,...
1301,0,0
1304,0,1
1306,0,0
1307,0,0


In [11]:
# Creamos una matriz de confusión comparando estas dos variables
conf_mat = pd.crosstab(df_survived.Preds,df_survived.Actual)
accuracy = (conf_mat[0][0]+conf_mat[1][1])/ \
           (conf_mat[0][0]+conf_mat[1][0]+conf_mat[0][1]+conf_mat[1][1])
print('accuracy=',round(accuracy*100,2),'%')
conf_mat

accuracy= 77.72 %


Actual,0,1
Preds,Unnamed: 1_level_1,Unnamed: 2_level_1
0,502,116
1,117,311


## Naive Bayes con Scikit-learn

In [12]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, CategoricalNB
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split 

In [14]:
# Multinomial NB

alpha=0.001 # parámetro de sklearn para naive bayes, smoothing

#variables dependientes e independientes
features = df_titanic_dum.columns.drop("survived")
target = 'survived'

# Se inicializa un modelo con sus respectivos parámetros
model = MultinomialNB(alpha=alpha) 

# Se entrena este modelo con los datos propuestos
model.fit(X=df_titanic_dum[features],
          y=df_titanic_dum[target])

# Se guardan las predicciones con este modelo ya entrenado
preds = model.predict(X=df_titanic_dum[features])

# Se calcula una métrica de interés
accuracy = accuracy_score(y_true=df_titanic_dum[target],
                          y_pred=preds)
print('accuracy=',round(accuracy*100,2),'%')

# Se guarda e imprime la matriz de confusion
conf_mat = confusion_matrix(y_true=df_titanic_dum[target],
                                    y_pred=preds)
pd.DataFrame(data = conf_mat, columns = ['Predicted 0', 'Predicted 1'],
            index = ['Actual 0', 'Actual 1'])


accuracy= 79.06 %


Unnamed: 0,Predicted 0,Predicted 1
Actual 0,519,100
Actual 1,119,308


In [16]:
# Bernoulli NB

alpha=0.001 # parámetro de sklearn para naive bayes, smoothing

#variables dependientes e independientes
features = df_titanic_dum.columns.drop("survived")
target = 'survived'

# Se inicializa un modelo con sus respectivos parámetros
model = BernoulliNB(alpha=alpha) 

# Se entrena este modelo con los datos propuestos
model.fit(X=df_titanic_dum[features],
          y=df_titanic_dum[target])

# Se guardan las predicciones con este modelo ya entrenado
preds = model.predict(X=df_titanic_dum[features])

# Se calcula una métrica de interés
accuracy = accuracy_score(y_true=df_titanic_dum[target],
                          y_pred=preds)
print('accuracy=',round(accuracy*100,2),'%')

# Se guarda e imprime la matriz de confusion
conf_mat = confusion_matrix(y_true=df_titanic_dum[target],
                                    y_pred=preds)
pd.DataFrame(data = conf_mat, columns = ['Predicted 0', 'Predicted 1'],
            index = ['Actual 0', 'Actual 1'])


accuracy= 77.72 %


Unnamed: 0,Predicted 0,Predicted 1
Actual 0,502,117
Actual 1,116,311


In [17]:
# Categorical NB

alpha=0.001 # parámetro de sklearn para naive bayes, smoothing

#variables dependientes e independientes
features = df_titanic_dum.columns.drop("survived")
target = 'survived'

# Se inicializa un modelo con sus respectivos parámetros
model = CategoricalNB(alpha=alpha) 

# Se entrena este modelo con los datos propuestos
model.fit(X=df_titanic_dum[features],
          y=df_titanic_dum[target])

# Se guardan las predicciones con este modelo ya entrenado
preds = model.predict(X=df_titanic_dum[features])

# Se calcula una métrica de interés
accuracy = accuracy_score(y_true=df_titanic_dum[target],
                          y_pred=preds)
print('accuracy=',round(accuracy*100,2),'%')

# Se guarda e imprime la matriz de confusion
conf_mat = confusion_matrix(y_true=df_titanic_dum[target],
                                    y_pred=preds)
pd.DataFrame(data = conf_mat, columns = ['Predicted 0', 'Predicted 1'],
            index = ['Actual 0', 'Actual 1'])


accuracy= 79.45 %


Unnamed: 0,Predicted 0,Predicted 1
Actual 0,513,106
Actual 1,109,318
