# Ejemplo de Regresión Logística
<img src="https://raw.githubusercontent.com/fhernanb/fhernanb.github.io/master/docs/logo_unal_color.png" alt="drawing" width="200"/>

# Objetivo
En este ejemplo se busca crear un clasificador para saber si una persona sobrevive o no en el naufragio del Titanic.

<img src="https://raw.githubusercontent.com/fhernanb/Python-para-estadistica/master/imagenes/titanic.png" alt="drawing" width="900">

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

Para cargar los datos del ejemplo se hace lo siguiente:

In [2]:
dt = pd.read_csv("titanic.csv", sep=",")
dt.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


Vamos a crear la dummy para `Pclass`.

In [3]:
pclass = pd.get_dummies(dt['Pclass'])
dt = pd.concat([dt, pclass], axis=1)
dt.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,1,2,3
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25,0,0,1
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833,1,0,0
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925,0,0,1
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1,1,0,0
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05,0,0,1


Vamos a crear la dummy para `Sex`.

In [4]:
sex = pd.get_dummies(dt['Sex'])
dt = pd.concat([dt, sex], axis=1)
dt.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,1,2,3,female,male
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25,0,0,1,0,1
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833,1,0,0,1,0
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925,0,0,1,1,0
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1,1,0,0,1,0
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05,0,0,1,0,1


Vamos a convertir `Pclass` numérica en la variable `class` que es __cualitativa__.

In [5]:
dt['clase'] = dt['Pclass']
dt['clase'] = dt['clase'].astype(str)

## Creación de los datos de entrenamiento (train) y de validación (test)
Para particionar los datos originales se usa la función `train_test_split`, para mayores detalles se recomienda consultar los parámetros de la función en este [enlace](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [6]:
datos = dt[[2, 3, 'male', 'Age', 'Fare']]
respuesta = dt['Survived']
X_train, X_test, y_train, y_test = train_test_split(datos, respuesta, test_size=0.25)

print(X_train.head())
print()
print(y_train.head())

     2  3  male   Age     Fare
536  0  0     0  22.0  49.5000
539  0  1     0  11.0  31.2750
635  0  1     0  41.0  39.6875
220  1  0     1  27.0  13.0000
723  0  1     0  28.0   7.7375

536    1
539    0
635    0
220    0
723    1
Name: Survived, dtype: int64


Vamos a usar otra forma para crear los objetos de entrenamiento y validación, una forma más sencilla pero útil cuando queremos tener control de las observaciones de cada conjunto.

In [7]:
import numpy as np

p = 0.75
n = dt.shape[0]
indices = np.random.randint(low=0, high=n-1, size=round(n*p))
dt_train = dt.iloc[indices]
dt_test  = dt.drop(indices)

X_train = dt_train[[2, 3, 'male', 'Age', 'Fare']]
X_test  = dt_test[[2, 3, 'male', 'Age', 'Fare']]

y_train = dt_train['Survived']
y_test  = dt_test['Survived']

# Regresión logística usando `sklearn`

In [8]:
mod1 = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X_train, y_train)

In [9]:
mod1.predict_proba(X_train) # Para ver las P(Y=0) y P(Y=1) en train

array([[0.53967217, 0.46032783],
       [0.33654968, 0.66345032],
       [0.75002451, 0.24997549],
       ...,
       [0.62042542, 0.37957458],
       [0.37047003, 0.62952997],
       [0.46832841, 0.53167159]])

In [10]:
mod1.predict_proba(X_test) # Para ver las P(Y=0) y P(Y=1) en test

array([[0.74324457, 0.25675543],
       [0.44961721, 0.55038279],
       [0.78497079, 0.21502921],
       [0.58553696, 0.41446304],
       [0.66793266, 0.33206734],
       [0.45366518, 0.54633482],
       [0.28757457, 0.71242543],
       [0.73628989, 0.26371011],
       [0.43908741, 0.56091259],
       [0.66708154, 0.33291846],
       [0.47062122, 0.52937878],
       [0.43207821, 0.56792179],
       [0.67462841, 0.32537159],
       [0.50027117, 0.49972883],
       [0.75665113, 0.24334887],
       [0.40598171, 0.59401829],
       [0.74658231, 0.25341769],
       [0.52667973, 0.47332027],
       [0.41456624, 0.58543376],
       [0.78425714, 0.21574286],
       [0.72936791, 0.27063209],
       [0.73974561, 0.26025439],
       [0.41338164, 0.58661836],
       [0.51181531, 0.48818469],
       [0.32132261, 0.67867739],
       [0.7356002 , 0.2643998 ],
       [0.73976854, 0.26023146],
       [0.25124991, 0.74875009],
       [0.32864153, 0.67135847],
       [0.29944589, 0.70055411],
       [0.

In [11]:
mod1.predict(X_train)[0:14] # Para ver las predicciones de las primeras 15 obs

array([0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1], dtype=int64)

In [12]:
mod1.predict(X_test)[0:14] # Para ver las predicciones de las primeras 15 obs

array([0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0], dtype=int64)

In [13]:
mod1.score(X_train, y_train)

0.8135338345864662

In [14]:
mod1.score(X_test, y_test)

0.7897196261682243

# Regresión logística usando `statsmodels`

In [15]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np

  from pandas.core import datetools


## Usando las matrices `X_train` y `y_train`
Los detalles de la función `sm.Logit` puede ser consultada [aquí](https://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.html)

In [16]:
logit_model = sm.Logit(endog=y_train, exog=X_train)
mod2 = logit_model.fit()
print(mod2.summary2())

Optimization terminated successfully.
         Current function value: 0.486373
         Iterations 6
                        Results: Logit
Model:              Logit            No. Iterations:   6.0000  
Dependent Variable: Survived         Pseudo R-squared: 0.283   
Date:               2018-10-04 14:06 AIC:              656.8758
No. Observations:   665              BIC:              679.3748
Df Model:           4                Log-Likelihood:   -323.44 
Df Residuals:       660              LL-Null:          -450.95 
Converged:          1.0000           Scale:            1.0000  
-----------------------------------------------------------------
        Coef.    Std.Err.      z       P>|z|     [0.025    0.975]
-----------------------------------------------------------------
2       0.5012     0.2303     2.1761   0.0295    0.0498    0.9527
3      -0.1106     0.1757    -0.6298   0.5288   -0.4549    0.2337
male   -2.3498     0.2052   -11.4489   0.0000   -2.7520   -1.9475
Age     0.0086 

## Usando fórmulas

La documentación para `smf.logit` puede ser consultada [aquí](https://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.fit.html).

In [17]:
mod3 = smf.logit(formula='Survived ~ Age + Fare + Sex + clase', data=dt).fit(method='newton')

Optimization terminated successfully.
         Current function value: 0.451835
         Iterations 6


In [18]:
mod3.summary2()

0,1,2,3
Model:,Logit,No. Iterations:,6.0
Dependent Variable:,Survived,Pseudo R-squared:,0.322
Date:,2018-10-04 14:06,AIC:,813.5552
No. Observations:,887,BIC:,842.2823
Df Model:,5,Log-Likelihood:,-400.78
Df Residuals:,881,LL-Null:,-591.38
Converged:,1.0000,Scale:,1.0

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,3.5934,0.4261,8.4328,0.0000,2.7582,4.4286
Sex[T.male],-2.5851,0.1879,-13.7580,0.0000,-2.9534,-2.2168
clase[T.2],-1.1739,0.2912,-4.0319,0.0001,-1.7446,-0.6033
clase[T.3],-2.4261,0.2937,-8.2615,0.0000,-3.0017,-1.8506
Age,-0.0341,0.0072,-4.7184,0.0000,-0.0482,-0.0199
Fare,0.0004,0.0021,0.1966,0.8441,-0.0037,0.0045
