# Ejemplo de Regresión Logística
<img src="https://raw.githubusercontent.com/fhernanb/fhernanb.github.io/master/docs/logo_unal_color.png" alt="drawing" width="200"/>

# Objetivo
En este ejemplo se busca crear un clasificador para saber si una persona sobrevive o no en el naufragio del Titanic.

<img src="https://raw.githubusercontent.com/fhernanb/Python-para-estadistica/master/imagenes/titanic.png" alt="drawing" width="900">

In [1]:
import pandas as pd  # Librería con las funciones read_table y read_csv
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

Para cargar los datos del ejemplo se hace lo siguiente:

In [2]:
dt = pd.read_csv("titanic.csv", sep=",")
dt.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


In [3]:
pclass = pd.get_dummies(dt['Pclass'])
dt = pd.concat([dt, pclass], axis=1)
dt.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,1,2,3
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25,0,0,1
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833,1,0,0
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925,0,0,1
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1,1,0,0
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05,0,0,1


In [4]:
sex = pd.get_dummies(dt['Sex'])
dt = pd.concat([dt, sex], axis=1)
dt.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,1,2,3,female,male
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25,0,0,1,0,1
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833,1,0,0,1,0
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925,0,0,1,1,0
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1,1,0,0,1,0
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05,0,0,1,0,1


In [27]:
dt['clase'] = dt['Pclass']
dt['clase'] = dt['clase'].astype(str)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,1,2,3,female,male,clase
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25,0,0,1,0,1,3
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833,1,0,0,1,0,1
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925,0,0,1,1,0,3
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1,1,0,0,1,0,1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05,0,0,1,0,1,3


## Creación de los datos de entrenamiento (train) y de validación (test)
Para particionar los datos originales se usa la función `train_test_split`, para mayores detalles se recomienda consultar los parámetros de la función se recomienda consultar este [enlace](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [6]:
datos = dt[[2, 3, 'male', 'Age', 'Fare']]
respuesta = dt['Survived']
X_train, X_test, y_train, y_test = train_test_split(datos, respuesta, test_size=0.25)

print(X_train.head())
print()
print(y_train.head())

     2  3  male   Age      Fare
635  0  1     0  41.0   39.6875
697  0  0     0  18.0  227.5250
72   0  1     1  26.0   14.4542
40   0  1     0  40.0    9.4750
847  0  1     1  74.0    7.7750

635    0
697    1
72     0
40     0
847    0
Name: Survived, dtype: int64


# Regresión logística usando `sklearn`

In [7]:
mod1 = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X_train, y_train)

In [8]:
mod1.predict_proba(X_train) # Para ver las P(Y=0) y P(Y=1) en train

array([[0.53402564, 0.46597436],
       [0.18449043, 0.81550957],
       [0.76087624, 0.23912376],
       ...,
       [0.77960191, 0.22039809],
       [0.60505502, 0.39494498],
       [0.77120258, 0.22879742]])

In [9]:
mod1.predict_proba(X_test) # Para ver las P(Y=0) y P(Y=1) en test

array([[0.20715007, 0.79284993],
       [0.70802784, 0.29197216],
       [0.8120662 , 0.1879338 ],
       [0.73820124, 0.26179876],
       [0.4333677 , 0.5666323 ],
       [0.62147045, 0.37852955],
       [0.50577729, 0.49422271],
       [0.71687726, 0.28312274],
       [0.76581306, 0.23418694],
       [0.74110784, 0.25889216],
       [0.50962805, 0.49037195],
       [0.85585848, 0.14414152],
       [0.8035856 , 0.1964144 ],
       [0.62855456, 0.37144544],
       [0.22173348, 0.77826652],
       [0.41934346, 0.58065654],
       [0.82070685, 0.17929315],
       [0.80612798, 0.19387202],
       [0.77257386, 0.22742614],
       [0.30009192, 0.69990808],
       [0.31773219, 0.68226781],
       [0.48726548, 0.51273452],
       [0.63209361, 0.36790639],
       [0.54086854, 0.45913146],
       [0.78335517, 0.21664483],
       [0.49884128, 0.50115872],
       [0.48542438, 0.51457562],
       [0.7308513 , 0.2691487 ],
       [0.49747874, 0.50252126],
       [0.50279458, 0.49720542],
       [0.

In [10]:
mod1.predict(X_train)[0:14] # Para ver las predicciones de las primeras 15 obs

array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0], dtype=int64)

In [11]:
mod1.predict(X_test)[0:14] # Para ver las predicciones de las primeras 15 obs

array([1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

In [12]:
mod1.score(X_train, y_train)

0.7969924812030075

In [13]:
mod1.score(X_test, y_test)

0.8063063063063063

# Regresión logística usando `statsmodels`

In [14]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np

  from pandas.core import datetools


## Usando las matrices `X_train` y `y_train`

In [15]:
logit_model = sm.Logit(y_train, X_train)
mod2 = logit_model.fit()
print(mod2.summary2())

Optimization terminated successfully.
         Current function value: 0.491899
         Iterations 6
                        Results: Logit
Model:              Logit            No. Iterations:   6.0000  
Dependent Variable: Survived         Pseudo R-squared: 0.269   
Date:               2018-10-04 08:39 AIC:              664.2258
No. Observations:   665              BIC:              686.7247
Df Model:           4                Log-Likelihood:   -327.11 
Df Residuals:       660              LL-Null:          -447.55 
Converged:          1.0000           Scale:            1.0000  
-----------------------------------------------------------------
        Coef.    Std.Err.      z       P>|z|     [0.025    0.975]
-----------------------------------------------------------------
2       0.6977     0.2345     2.9756   0.0029    0.2382    1.1573
3      -0.3914     0.1756    -2.2284   0.0259   -0.7356   -0.0472
male   -2.1238     0.1997   -10.6373   0.0000   -2.5151   -1.7325
Age     0.0120 

## Usando fórmulas

La documentación para `smf.logit` puede ser consultada [aquí](https://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.fit.html).

In [31]:
mod3 = smf.logit(formula='Survived ~ Age + Fare + Sex + clase', data=dt).fit(method='newton')
print(mod3.params)

Optimization terminated successfully.
         Current function value: 0.451835
         Iterations 6
Intercept      3.593388
Sex[T.male]   -2.585109
clase[T.2]    -1.173931
clase[T.3]    -2.426139
Age           -0.034085
Fare           0.000414
dtype: float64


In [32]:
mod3.summary2()

0,1,2,3
Model:,Logit,No. Iterations:,6.0
Dependent Variable:,Survived,Pseudo R-squared:,0.322
Date:,2018-10-04 08:46,AIC:,813.5552
No. Observations:,887,BIC:,842.2823
Df Model:,5,Log-Likelihood:,-400.78
Df Residuals:,881,LL-Null:,-591.38
Converged:,1.0000,Scale:,1.0

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,3.5934,0.4261,8.4328,0.0000,2.7582,4.4286
Sex[T.male],-2.5851,0.1879,-13.7580,0.0000,-2.9534,-2.2168
clase[T.2],-1.1739,0.2912,-4.0319,0.0001,-1.7446,-0.6033
clase[T.3],-2.4261,0.2937,-8.2615,0.0000,-3.0017,-1.8506
Age,-0.0341,0.0072,-4.7184,0.0000,-0.0482,-0.0199
Fare,0.0004,0.0021,0.1966,0.8441,-0.0037,0.0045
