# Ejemplo de Regresión Logística

<img src="https://raw.githubusercontent.com/fhernanb/fhernanb.github.io/master/my_docs/logo_unal_color.png" alt="drawing" width="200"/>

# Objetivo
En este ejemplo se busca crear un clasificador para saber si una persona sobrevive o no en el naufragio del Titanic.

<img src="https://raw.githubusercontent.com/fhernanb/Python-para-estadistica/master/imagenes/titanic.png" alt="drawing" width="600">

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

Para cargar los datos del ejemplo se hace lo siguiente:

In [2]:
dt = pd.read_csv("titanic.csv", sep=",")
dt.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


Vamos a crear las variables dummy para `Pclass`.

In [3]:
pclass = pd.get_dummies(dt['Pclass'], prefix='class')
dt = pd.concat([dt, pclass], axis=1)
dt.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,class_1,class_2,class_3
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25,False,False,True
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833,True,False,False
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925,False,False,True
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1,True,False,False
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05,False,False,True


Vamos a crear las variables dummy para `Sex`.

In [4]:
sex = pd.get_dummies(dt['Sex'])
dt = pd.concat([dt, sex], axis=1)
dt.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,class_1,class_2,class_3,female,male
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25,False,False,True,False,True
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833,True,False,False,True,False
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925,False,False,True,True,False
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1,True,False,False,True,False
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05,False,False,True,False,True


Vamos a convertir `Pclass` numérica en la variable `clase` que es __cualitativa__.

In [5]:
dt['clase'] = dt['Pclass']
dt['clase'] = dt['clase'].astype(str)
dt.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,class_1,class_2,class_3,female,male,clase
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25,False,False,True,False,True,3
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833,True,False,False,True,False,1
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925,False,False,True,True,False,3
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1,True,False,False,True,False,1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05,False,False,True,False,True,3


## Creación de los datos de entrenamiento (train) y de validación (test)
Para particionar los datos originales se usa la función `train_test_split`, para mayores detalles se recomienda consultar los parámetros de la función en este [enlace](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [6]:
datos = dt[['class_2', 'class_3', 'male', 'Age', 'Fare']]
respuesta = dt['Survived']
X_train, X_test, y_train, y_test = train_test_split(datos, respuesta, test_size=0.25)

print('Matriz X')
print(X_train.head())

print('\nVector y')
print(y_train.head())

Matriz X
     class_2  class_3  male   Age      Fare
695    False    False  True  49.0  110.8833
829    False     True  True  23.0    7.8542
634     True    False  True  31.0   26.2500
663     True    False  True  25.0   13.0000
477    False     True  True   9.0   46.9000

Vector y
695    0
829    0
634    0
663    0
477    0
Name: Survived, dtype: int64


Vamos a usar otra forma para crear los objetos de entrenamiento y validación, una forma más sencilla pero útil cuando queremos tener control de las observaciones de cada conjunto.

In [7]:
import numpy as np

p = 0.75
n = dt.shape[0]
indices = np.random.randint(low=0, high=n-1, size=round(n*p))
dt_train = dt.iloc[indices]
dt_test  = dt.drop(indices)

X_train = dt_train[['class_2', 'class_3', 'male', 'Age', 'Fare']]
X_test  = dt_test[['class_2', 'class_3', 'male', 'Age', 'Fare']]

y_train = dt_train['Survived']
y_test  = dt_test['Survived']

# Regresión logística usando `sklearn`

In [8]:
mod1 = LogisticRegression(random_state=0, solver='lbfgs')
mod1.fit(X_train, y_train)

In [9]:
mod1.predict_proba(X_train) # Para ver las P(Y=0) y P(Y=1) en train

array([[0.42348808, 0.57651192],
       [0.92036557, 0.07963443],
       [0.90880494, 0.09119506],
       ...,
       [0.90374238, 0.09625762],
       [0.05797335, 0.94202665],
       [0.89003842, 0.10996158]])

In [10]:
mod1.predict_proba(X_test) # Para ver las P(Y=0) y P(Y=1) en test

array([[0.10976325, 0.89023675],
       [0.49212601, 0.50787399],
       [0.93059848, 0.06940152],
       [0.91353436, 0.08646564],
       [0.89568141, 0.10431859],
       [0.9358972 , 0.0641028 ],
       [0.5255885 , 0.4744115 ],
       [0.4627302 , 0.5372698 ],
       [0.75374301, 0.24625699],
       [0.35765333, 0.64234667],
       [0.57224237, 0.42775763],
       [0.35973475, 0.64026525],
       [0.12949918, 0.87050082],
       [0.43315372, 0.56684628],
       [0.8872051 , 0.1127949 ],
       [0.89011253, 0.10988747],
       [0.42949727, 0.57050273],
       [0.40300751, 0.59699249],
       [0.59443065, 0.40556935],
       [0.18664899, 0.81335101],
       [0.89845893, 0.10154107],
       [0.14502405, 0.85497595],
       [0.19469813, 0.80530187],
       [0.86130403, 0.13869597],
       [0.90121221, 0.09878779],
       [0.61236608, 0.38763392],
       [0.83831765, 0.16168235],
       [0.75082455, 0.24917545],
       [0.85237155, 0.14762845],
       [0.19825692, 0.80174308],
       [0.

In [11]:
mod1.predict(X_train)[0:14] # Para ver las predicciones de las primeras 15 obs

array([1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0], dtype=int64)

In [12]:
mod1.predict(X_test)[0:14] # Para ver las predicciones de las primeras 15 obs

array([1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1], dtype=int64)

In [13]:
mod1.score(X_train, y_train)

0.8030075187969925

In [14]:
mod1.score(X_test, y_test)

0.808252427184466

# Regresión logística usando `statsmodels`

In [15]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np

## Usando las matrices `X_train` y `y_train`
Los detalles de la función `sm.Logit` puede ser consultada [aquí](https://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.html)

In [17]:
Xtrain = dt[['Age', 'Fare']]
ytrain = dt['Survived']


logit_model = sm.Logit(endog=ytrain, exog=Xtrain)
mod2 = logit_model.fit()
print(mod2.summary2())

Optimization terminated successfully.
         Current function value: 0.628371
         Iterations 6
                         Results: Logit
Model:              Logit            Method:           MLE       
Dependent Variable: Survived         Pseudo R-squared: 0.058     
Date:               2024-12-29 08:51 AIC:              1118.7302 
No. Observations:   887              BIC:              1128.3059 
Df Model:           1                Log-Likelihood:   -557.37   
Df Residuals:       885              LL-Null:          -591.38   
Converged:          1.0000           LLR p-value:      1.6023e-16
No. Iterations:     6.0000           Scale:            1.0000    
-------------------------------------------------------------------
           Coef.    Std.Err.      z      P>|z|     [0.025    0.975]
-------------------------------------------------------------------
Age       -0.0288     0.0030   -9.5706   0.0000   -0.0347   -0.0229
Fare       0.0148     0.0022    6.7117   0.0000    0.0105 

## Usando fórmulas

La documentación para `smf.logit` puede ser consultada [aquí](https://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.fit.html).

In [18]:
mod3 = smf.logit(formula='Survived ~ Age + Fare + Sex + clase', data=dt).fit(method='newton')

Optimization terminated successfully.
         Current function value: 0.451835
         Iterations 6


In [19]:
mod3.summary2()

0,1,2,3
Model:,Logit,Method:,MLE
Dependent Variable:,Survived,Pseudo R-squared:,0.322
Date:,2024-12-29 08:51,AIC:,813.5552
No. Observations:,887,BIC:,842.2823
Df Model:,5,Log-Likelihood:,-400.78
Df Residuals:,881,LL-Null:,-591.38
Converged:,1.0000,LLR p-value:,3.3133e-80
No. Iterations:,6.0000,Scale:,1.0000

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,3.5934,0.4261,8.4328,0.0000,2.7582,4.4286
Sex[T.male],-2.5851,0.1879,-13.7580,0.0000,-2.9534,-2.2168
clase[T.2],-1.1739,0.2912,-4.0319,0.0001,-1.7446,-0.6033
clase[T.3],-2.4261,0.2937,-8.2615,0.0000,-3.0017,-1.8506
Age,-0.0341,0.0072,-4.7184,0.0000,-0.0482,-0.0199
Fare,0.0004,0.0021,0.1966,0.8441,-0.0037,0.0045
