# Desafío - Clasificación desde la econometría
__Descripción__
En esta sesión trabajaremos el dataset south african heart, el cual contiene las siguientes variables:
- sbp : Presión Sanguínea Sistólica.
- tobacco : Promedio tabaco consumido por día.
- ldl : Lipoproteína de baja densidad.
- adiposity : Adiposidad.
- famhist : Antecedentes familiares de enfermedades cardiácas. (Binaria)
- types : Personalidad tipo A
- obesity : Obesidad.
- alcohol : Consumo actual de alcohol.
- age : edad.
- chd : Enfermedad coronaria. (dummy)

## Desafío 1: Preparar el ambiente de trabajo


In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve

In [2]:
df = pd.read_csv('southafricanheart.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd
0,1,160,12.0,5.73,23.11,Present,49,25.3,97.2,52,1
1,2,144,0.01,4.41,28.61,Absent,55,28.87,2.06,63,1
2,3,118,0.08,3.48,32.28,Present,52,29.14,3.81,46,0
3,4,170,7.5,6.41,38.03,Present,51,31.99,24.26,58,1
4,5,134,13.6,3.5,27.78,Present,60,25.99,57.34,49,1


In [4]:
df.drop(columns = ['Unnamed: 0'], inplace = True)

In [5]:
df.columns

Index(['sbp', 'tobacco', 'ldl', 'adiposity', 'famhist', 'typea', 'obesity',
       'alcohol', 'age', 'chd'],
      dtype='object')

In [6]:
df.describe(include='all')

Unnamed: 0,sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd
count,462.0,462.0,462.0,462.0,462,462.0,462.0,462.0,462.0,462.0
unique,,,,,2,,,,,
top,,,,,Absent,,,,,
freq,,,,,270,,,,,
mean,138.32684,3.635649,4.740325,25.406732,,53.103896,26.044113,17.044394,42.816017,0.34632
std,20.496317,4.593024,2.070909,7.780699,,9.817534,4.21368,24.481059,14.608956,0.476313
min,101.0,0.0,0.98,6.74,,13.0,14.7,0.0,15.0,0.0
25%,124.0,0.0525,3.2825,19.775,,47.0,22.985,0.51,31.0,0.0
50%,134.0,2.0,4.34,26.115,,53.0,25.805,7.51,45.0,0.0
75%,148.0,5.5,5.79,31.2275,,60.0,28.4975,23.8925,55.0,1.0


## Desafío 2
1. Recodifique famhist a dummy, asignando 1 a la categoría minoritaria.
2. Utilice smf.logit para estimar el modelo.
3. Implemente una función inverse_logit que realice el mapeo de log-odds a probabilidad.
4. Con el modelo estimado, responda lo siguiente:
    - ¿Cuál es la probabilidad de un individuo con antecedentes familiares de tener una enfermedad coronaria?
    - ¿Cuál es la probabilidad de un individuo sin antecedentes familiares de tener una enfermedad coronaria?
    - ¿Cuál es la diferencia en la probabilidad entre un individuo con antecedentes y otro sin antecedentes?
    - Replique el modelo con smf.ols y comente las similitudes entre los coeficientes estimados.
Tip: Utilice β/4

In [7]:
df['famhist'].value_counts()

Absent     270
Present    192
Name: famhist, dtype: int64

In [8]:
df['famhist_recod'] = np.where(df['famhist']=='Present', 1, 0)

In [9]:
df.head()

Unnamed: 0,sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd,famhist_recod
0,160,12.0,5.73,23.11,Present,49,25.3,97.2,52,1,1
1,144,0.01,4.41,28.61,Absent,55,28.87,2.06,63,1,0
2,118,0.08,3.48,32.28,Present,52,29.14,3.81,46,0,1
3,170,7.5,6.41,38.03,Present,51,31.99,24.26,58,1,1
4,134,13.6,3.5,27.78,Present,60,25.99,57.34,49,1,1


In [10]:
modelo_logit = smf.logit('chd ~ famhist_recod', df).fit()

modelo_logit.summary()

Optimization terminated successfully.
         Current function value: 0.608111
         Iterations 5


0,1,2,3
Dep. Variable:,chd,No. Observations:,462.0
Model:,Logit,Df Residuals:,460.0
Method:,MLE,Df Model:,1.0
Date:,"Fri, 20 Nov 2020",Pseudo R-squ.:,0.0574
Time:,22:07:23,Log-Likelihood:,-280.95
converged:,True,LL-Null:,-298.05
Covariance Type:,nonrobust,LLR p-value:,4.937e-09

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-1.1690,0.143,-8.169,0.000,-1.449,-0.889
famhist_recod,1.1690,0.203,5.751,0.000,0.771,1.567


In [11]:
media_famhist = df['famhist_recod'].dropna().mean()

estimate_y = modelo_logit.params['Intercept'] + (modelo_logit.params['famhist_recod'] * media_famhist)

In [12]:
def inverse_logit(estimado):
    return 1 / (1+np.exp(-estimado))

In [13]:
inverse_logit(estimate_y)

0.3355524250930946

##### ¿Cuál es la probabilidad de un individuo con antecedentes familiares de tener una enfermedad coronaria?

In [14]:
y_antecedentes = modelo_logit.params['Intercept']+(modelo_logit.params['famhist_recod'] * 1)
p_antecedentes = round(inverse_logit(y_antecedentes),2)
p_antecedentes

0.5

__R:__

##### ¿Cuál es la probabilidad de un individuo sin antecedentes familiares de tener una enfermedad coronaria?

In [15]:
y_sin_antecedentes = modelo_logit.params['Intercept']+(modelo_logit.params['famhist_recod'] * 0)
p_sin_antecedentes = round(inverse_logit(y_sin_antecedentes),2)
p_sin_antecedentes

0.24

__R:__

##### ¿Cuál es la diferencia en la probabilidad entre un individuo con antecedentes y otro sin antecedentes?

In [16]:
diff = p_antecedentes - p_sin_antecedentes
diff

0.26

__R:__

Replique el modelo con smf.ols y comente las similitudes entre los coeficientes estimados. Tip: Utilice β/4

In [17]:
modelo_ols = smf.ols('chd ~ famhist_recod', df).fit()

In [18]:
modelo_ols.summary()

0,1,2,3
Dep. Variable:,chd,R-squared:,0.074
Model:,OLS,Adj. R-squared:,0.072
Method:,Least Squares,F-statistic:,36.86
Date:,"Fri, 20 Nov 2020",Prob (F-statistic):,2.66e-09
Time:,22:07:33,Log-Likelihood:,-294.59
No. Observations:,462,AIC:,593.2
Df Residuals:,460,BIC:,601.4
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.2370,0.028,8.489,0.000,0.182,0.292
famhist_recod,0.2630,0.043,6.071,0.000,0.178,0.348

0,1,2,3
Omnibus:,768.898,Durbin-Watson:,1.961
Prob(Omnibus):,0.0,Jarque-Bera (JB):,58.778
Skew:,0.579,Prob(JB):,1.72e-13
Kurtosis:,1.692,Cond. No.,2.47


In [19]:
print("\nOLS\n",'B0:', round(modelo_ols.params['Intercept'],3), '\n','B1:', round(modelo_ols.params['famhist_recod'],3) )
print("\nLogit\n",'B0:', round(modelo_logit.params['Intercept'],3), '\n','B1:', round(modelo_logit.params['famhist_recod'],3) )


OLS
 B0: 0.237 
 B1: 0.263

Logit
 B0: -1.169 
 B1: 1.169


Dividimos el coeficiente B1 de la regresión logistica en 4

In [20]:
modelo_logit.params['famhist_recod']/4

0.2922482713574772

Dado el valor  B1/4 = 0.29, se puede decir que es una aproximación razonable del coeficiente estimado en el modelo LMP

## Desafío 3: Estimación completa

- Depure el modelo manteniendo las variables con significancia estadística al 95%.
- Compare los estadísticos de bondad de ajuste entre ambos.
- Reporte de forma sucinta el efecto de las variables en el log-odds de tener una enfermedad coronaria.

In [21]:
modelo_logit_complete = smf.logit('chd ~ sbp + tobacco + ldl + adiposity + famhist_recod + typea + obesity + alcohol + age', df).fit()
modelo_logit_complete.summary()

Optimization terminated successfully.
         Current function value: 0.510974
         Iterations 6


0,1,2,3
Dep. Variable:,chd,No. Observations:,462.0
Model:,Logit,Df Residuals:,452.0
Method:,MLE,Df Model:,9.0
Date:,"Fri, 20 Nov 2020",Pseudo R-squ.:,0.208
Time:,22:07:35,Log-Likelihood:,-236.07
converged:,True,LL-Null:,-298.05
Covariance Type:,nonrobust,LLR p-value:,2.0549999999999998e-22

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-6.1507,1.308,-4.701,0.000,-8.715,-3.587
sbp,0.0065,0.006,1.135,0.256,-0.005,0.018
tobacco,0.0794,0.027,2.984,0.003,0.027,0.132
ldl,0.1739,0.060,2.915,0.004,0.057,0.291
adiposity,0.0186,0.029,0.635,0.526,-0.039,0.076
famhist_recod,0.9254,0.228,4.061,0.000,0.479,1.372
typea,0.0396,0.012,3.214,0.001,0.015,0.064
obesity,-0.0629,0.044,-1.422,0.155,-0.150,0.024
alcohol,0.0001,0.004,0.027,0.978,-0.009,0.009


Depurando el modelo, manteniedno las variables con 95% de significancia estadística, se deja de lado las siguientes variables:
sbp, adiposity, obesity, alcohol 

In [22]:
modelo_logit_complete_2 = smf.logit('chd ~ tobacco + ldl + famhist_recod + typea + age', df).fit()
modelo_logit_complete_2.summary()

Optimization terminated successfully.
         Current function value: 0.514811
         Iterations 6


0,1,2,3
Dep. Variable:,chd,No. Observations:,462.0
Model:,Logit,Df Residuals:,456.0
Method:,MLE,Df Model:,5.0
Date:,"Fri, 20 Nov 2020",Pseudo R-squ.:,0.202
Time:,22:07:36,Log-Likelihood:,-237.84
converged:,True,LL-Null:,-298.05
Covariance Type:,nonrobust,LLR p-value:,2.554e-24

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-6.4464,0.921,-7.000,0.000,-8.251,-4.642
tobacco,0.0804,0.026,3.106,0.002,0.030,0.131
ldl,0.1620,0.055,2.947,0.003,0.054,0.270
famhist_recod,0.9082,0.226,4.023,0.000,0.466,1.351
typea,0.0371,0.012,3.051,0.002,0.013,0.061
age,0.0505,0.010,4.944,0.000,0.030,0.070


## Desafío 4: Estimación de perfiles
A partir del modelo depurado, genere las estimaciones en log-odds y posteriormente transfórmelas a probabilidades con inverse_logit . Los perfiles a estimar son los siguientes:
- La probabilidad de tener una enfermedad coronaria para un individuo con características similares a la muestra.
- La probabilidad de tener una enfermedad coronaria para un individuo con altos niveles de lipoproteína de baja densidad, manteniendo todas las demás características constantes.
- La probabilidad de tener una enfermedad coronaria para un individuo con bajos niveles de lipoproteína de baja densidad, manteniendo todas las demás características constantes.