### Hito3 para Determinantes del ingreso
*"Modelación descriptiva, que busca definir cuáles son los principales determinantes del objeto de estudio. En base a esta sección se podrá construir o depurar el modelo predictivo."*

####  Carga de librerias e importación de .csv obtenido en el hito anterior

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import statsmodels.formula as sm
import statsmodels.formula.api as smf

#from graficos import *
import funciones as fun

import warnings
warnings.filterwarnings('ignore')

#plt.style.use('seaborn') # Gráficos estilo seaborn
#plt.rcParams["figure.figsize"] = (8,6) # Tamaño gráficos (5, 3)
#plt.rcParams["figure.dpi"] = 75 # resolución gráficos 100

In [2]:
df = pd.read_csv('income_ready.csv')
print(df.shape)
df.head()

(48842, 15)


Unnamed: 0,age,workclass_recod,fnlwgt,educ_recod,educational-num,civstatus,collars,relationship,race,gender,capital-gain,capital-loss,hours-per-week,region,income
0,25,private,226802,high-school,7,never-married,blue-collar,Own-child,Black,Male,0,0,40,America,0
1,38,private,89814,high-school,9,married,blue-collar,Husband,White,Male,0,0,50,America,0
2,28,state-level-gov,336951,college,12,married,blue-collar,Husband,White,Male,0,0,40,America,1
3,44,private,160323,college,10,married,blue-collar,Husband,Black,Male,7688,0,40,America,1
4,18,,103497,college,10,never-married,,Own-child,White,Female,0,0,30,America,0


###  Modelación descriptiva

- Transformacion de variables categoricas con Binary Encoding.
- Se resetea el indice del dataframe


In [3]:
df_dummies = pd.get_dummies(df.dropna(), drop_first=True).reset_index(drop=True)
df_dummies.columns = [col.replace('-', '_') for col in df_dummies.columns]
print('Tamaño del dataframe con dummies:',df_dummies.shape)


Tamaño del dataframe con dummies: (46033, 34)


Se estima el modelo econométricos no lineal binario con Logit, usando df_dummies:

In [4]:
base = fun.mf(df_dummies, var_obj='income')             # Uso de la funcion para construir la base del modelo
model_logit = smf.logit(base, data=df_dummies).fit()    # Ajuste del modelo
model_logit.summary()

Optimization terminated successfully.
         Current function value: 0.332089
         Iterations 11


0,1,2,3
Dep. Variable:,income,No. Observations:,46033.0
Model:,Logit,Df Residuals:,45999.0
Method:,MLE,Df Model:,33.0
Date:,"Sat, 09 Jul 2022",Pseudo R-squ.:,0.4073
Time:,11:20:57,Log-Likelihood:,-15287.0
converged:,True,LL-Null:,-25791.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-7.6740,0.309,-24.805,0.000,-8.280,-7.068
age,0.0254,0.001,18.931,0.000,0.023,0.028
fnlwgt,7.33e-07,1.39e-07,5.280,0.000,4.61e-07,1.01e-06
educational_num,0.2715,0.016,17.171,0.000,0.241,0.302
capital_gain,0.0003,8.56e-06,37.179,0.000,0.000,0.000
capital_loss,0.0007,3.05e-05,21.720,0.000,0.001,0.001
hours_per_week,0.0296,0.001,22.654,0.000,0.027,0.032
workclass_recod_private,-0.4532,0.074,-6.102,0.000,-0.599,-0.308
workclass_recod_self_employed,-0.7927,0.081,-9.753,0.000,-0.952,-0.633


#### Primer analisis de resultados:

- El <code>Pseudo R-squ es de 0.4073</code>. Con lo cual el modelo actual explica casi el 41% de la VO (income).

- El <code>Log-Likelihood ratio indica que el modelo es significativo</code>, rechazando la hipótesis nula que no existen variables significativas. Esto se complementa con los resultados de máxima verosimilitud.

- Por medio de los intervalos de confianza([0.025 0.975]) las siguientes variables no son significativas, ya que pasan por cero:<code> workclass_recod_unemployed, educ_record (todas sus clases), civstatus_separated, civstatus_widowed, race_Black, race_Other, region_Asia, y region_Europe</code>.

Eliminaremos las variables no significativas del modelo:

In [5]:
clear_var_ns = ['workclass_recod_unemployed', 'educ_recod_elementary_school', 'educ_recod_high_school', 'educ_recod_preschool', 'educ_recod_university', 'civstatus_separated', 'civstatus_widowed', 'race_Black', 'race_Other', 'region_Asia', 'region_Europe']
df_mod = df_dummies.drop(clear_var_ns, axis=1)
print('Tamaño del dataframe eliminando variables no significativas:',df_mod.shape)

Tamaño del dataframe eliminando variables no significativas: (46033, 23)


Se realiza la nueva modelación con el df modificado:

In [6]:
base = fun.mf(df_mod, var_obj='income')             # Uso de la funcion para construir la base del modelo
model_logit2 = smf.logit(base, data=df_mod).fit()   # Ajuste del modelo
model_logit2.summary()

Optimization terminated successfully.
         Current function value: 0.332362
         Iterations 9


0,1,2,3
Dep. Variable:,income,No. Observations:,46033.0
Model:,Logit,Df Residuals:,46010.0
Method:,MLE,Df Model:,22.0
Date:,"Sat, 09 Jul 2022",Pseudo R-squ.:,0.4068
Time:,11:20:57,Log-Likelihood:,-15300.0
converged:,True,LL-Null:,-25791.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-7.7315,0.209,-37.042,0.000,-8.141,-7.322
age,0.0256,0.001,19.346,0.000,0.023,0.028
fnlwgt,7.485e-07,1.38e-07,5.422,0.000,4.78e-07,1.02e-06
educational_num,0.2958,0.007,41.308,0.000,0.282,0.310
capital_gain,0.0003,8.55e-06,37.199,0.000,0.000,0.000
capital_loss,0.0007,3.05e-05,21.729,0.000,0.001,0.001
hours_per_week,0.0296,0.001,22.653,0.000,0.027,0.032
workclass_recod_private,-0.4338,0.074,-5.889,0.000,-0.578,-0.289
workclass_recod_self_employed,-0.7735,0.081,-9.585,0.000,-0.932,-0.615


#### Segundo analisis de resultados

Este modelo es mas adecuado debido a:

- El <code>Pseudo R-squ es de 0.4068</code>. Con lo cual el modelo sigue explicando casi el 41% de la VO (income).
- El Log-Likelihood ratio sigue indicando que el modelo es significativo,<code> rechazando la hipótesis nula </code>(que no existen variables significativas).

#### Definición de las estrategias de Modelación predictiva:

Luego de la revision de la data y de la modelación descriptiva, se realizará una predicción mediante regresión logística.

- Se usará un módulo de <code>regresión logística</code> de la librería <code>scikit-learn</code>.
    - Se generarán conjuntos de entrenamiento y validación.
    - Se estandarizarán los datos ya que existen variables con valores que podrían sesgar el modelo (por ejemplo "fnlwgt").
    - Se ajustarán los datos, se realizarán predicciones y se compararán con nuevos modelos generados con los resultados que se vayan generando.
    

Se usará el dataframe obtenido del segundo analisis de este hito (<code>df_mod </code>, con las variables depuradas obtenidas con Logit).

In [7]:
#Generacion de .csv procesado
df_mod.to_csv('income_mod.csv', index=False)