# EBAC - Regressão II - regressão múltipla

## Tarefa I

#### Previsão de renda

Vamos trabalhar com a base 'previsao_de_renda.csv', que é a base do seu próximo projeto. Vamos usar os recursos que vimos até aqui nesta base.

|variavel|descrição|
|-|-|
|data_ref                | Data de referência de coleta das variáveis |
|index                   | Código de identificação do cliente|
|sexo                    | Sexo do cliente|
|posse_de_veiculo        | Indica se o cliente possui veículo|
|posse_de_imovel         | Indica se o cliente possui imóvel|
|qtd_filhos              | Quantidade de filhos do cliente|
|tipo_renda              | Tipo de renda do cliente|
|educacao                | Grau de instrução do cliente|
|estado_civil            | Estado civil do cliente|
|tipo_residencia         | Tipo de residência do cliente (própria, alugada etc)|
|idade                   | Idade do cliente|
|tempo_emprego           | Tempo no emprego atual|
|qt_pessoas_residencia   | Quantidade de pessoas que moram na residência|
|renda                   | Renda em reais|

In [1]:
import patsy
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv('previsao_de_renda.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             15000 non-null  int64  
 1   data_ref               15000 non-null  object 
 2   id_cliente             15000 non-null  int64  
 3   sexo                   15000 non-null  object 
 4   posse_de_veiculo       15000 non-null  bool   
 5   posse_de_imovel        15000 non-null  bool   
 6   qtd_filhos             15000 non-null  int64  
 7   tipo_renda             15000 non-null  object 
 8   educacao               15000 non-null  object 
 9   estado_civil           15000 non-null  object 
 10  tipo_residencia        15000 non-null  object 
 11  idade                  15000 non-null  int64  
 12  tempo_emprego          12427 non-null  float64
 13  qt_pessoas_residencia  15000 non-null  float64
 14  renda                  15000 non-null  float64
dtypes:

1. Ajuste um modelo para prever log(renda) considerando todas as covariáveis disponíveis.
    - Utilizando os recursos do Patsy, coloque as variáveis qualitativas como *dummies*.
    - Mantenha sempre a categoria mais frequente como casela de referência
    - Avalie os parâmetros e veja se parecem fazer sentido prático.




    

In [16]:
# 1.

# Criar a variável resposta logarítmica utilizando o Patsy para transformar variáveis categóricas em dummies
y, X = patsy.dmatrices('np.log(renda) ~ sexo + posse_de_veiculo + posse_de_imovel + qtd_filhos + tipo_renda + educacao + estado_civil + tipo_residencia + idade + tempo_emprego + qt_pessoas_residencia', data=df, return_type='dataframe')


# Verifica as primeiras linhas para ter certeza de que tudo foi codificado corretamente
print(X.head())


   Intercept  sexo[T.M]  posse_de_veiculo[T.True]  posse_de_imovel[T.True]  \
0        1.0        0.0                       0.0                      1.0   
1        1.0        1.0                       1.0                      1.0   
2        1.0        0.0                       1.0                      1.0   
3        1.0        0.0                       0.0                      1.0   
4        1.0        1.0                       1.0                      0.0   

   tipo_renda[T.Bolsista]  tipo_renda[T.Empresário]  \
0                     0.0                       1.0   
1                     0.0                       0.0   
2                     0.0                       1.0   
3                     0.0                       0.0   
4                     0.0                       0.0   

   tipo_renda[T.Pensionista]  tipo_renda[T.Servidor público]  \
0                        0.0                             0.0   
1                        0.0                             0.0   
2       

In [17]:
modelo = sm.OLS(y,X).fit()

modelo.summary()

0,1,2,3
Dep. Variable:,np.log(renda),R-squared:,0.357
Model:,OLS,Adj. R-squared:,0.356
Method:,Least Squares,F-statistic:,287.5
Date:,"Tue, 12 Mar 2024",Prob (F-statistic):,0.0
Time:,20:17:22,Log-Likelihood:,-13568.0
No. Observations:,12427,AIC:,27190.0
Df Residuals:,12402,BIC:,27370.0
Df Model:,24,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.5847,0.235,28.006,0.000,6.124,7.046
sexo[T.M],0.7874,0.015,53.723,0.000,0.759,0.816
posse_de_veiculo[T.True],0.0441,0.014,3.119,0.002,0.016,0.072
posse_de_imovel[T.True],0.0829,0.014,5.926,0.000,0.055,0.110
tipo_renda[T.Bolsista],0.2209,0.241,0.916,0.360,-0.252,0.694
tipo_renda[T.Empresário],0.1551,0.015,10.387,0.000,0.126,0.184
tipo_renda[T.Pensionista],-0.3087,0.241,-1.280,0.201,-0.782,0.164
tipo_renda[T.Servidor público],0.0576,0.022,2.591,0.010,0.014,0.101
educacao[T.Pós graduação],0.1071,0.159,0.673,0.501,-0.205,0.419

0,1,2,3
Omnibus:,0.858,Durbin-Watson:,2.023
Prob(Omnibus):,0.651,Jarque-Bera (JB):,0.839
Skew:,0.019,Prob(JB):,0.657
Kurtosis:,3.012,Cond. No.,2180.0


In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

modelo = sm.OLS(y_train, X_train).fit()
modelo.summary()

0,1,2,3
Dep. Variable:,np.log(renda),R-squared:,0.355
Model:,OLS,Adj. R-squared:,0.353
Method:,Least Squares,F-statistic:,227.4
Date:,"Tue, 12 Mar 2024",Prob (F-statistic):,0.0
Time:,20:18:12,Log-Likelihood:,-10864.0
No. Observations:,9941,AIC:,21780.0
Df Residuals:,9916,BIC:,21960.0
Df Model:,24,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.5867,0.244,26.940,0.000,6.107,7.066
sexo[T.M],0.7873,0.016,48.058,0.000,0.755,0.819
posse_de_veiculo[T.True],0.0364,0.016,2.310,0.021,0.006,0.067
posse_de_imovel[T.True],0.0922,0.016,5.878,0.000,0.061,0.123
tipo_renda[T.Bolsista],0.2146,0.242,0.888,0.375,-0.259,0.688
tipo_renda[T.Empresário],0.1649,0.017,9.828,0.000,0.132,0.198
tipo_renda[T.Pensionista],-0.3045,0.256,-1.188,0.235,-0.807,0.198
tipo_renda[T.Servidor público],0.0727,0.025,2.922,0.003,0.024,0.121
educacao[T.Pós graduação],0.1032,0.184,0.562,0.574,-0.257,0.463

0,1,2,3
Omnibus:,0.943,Durbin-Watson:,2.002
Prob(Omnibus):,0.624,Jarque-Bera (JB):,0.908
Skew:,0.019,Prob(JB):,0.635
Kurtosis:,3.027,Cond. No.,2010.0


2. Remova a variável menos significante e analise:
    - Observe os indicadores que vimos, e avalie se o modelo melhorou ou piorou na sua opinião.
    - Observe os parâmetros e veja se algum se alterou muito.

In [21]:
# 2.
# Removendo a variável menos significante (tipo_residencia)

y, X = patsy.dmatrices('np.log(renda) ~ sexo + posse_de_veiculo + posse_de_imovel + qtd_filhos + tipo_renda + educacao + estado_civil + idade + tempo_emprego + qt_pessoas_residencia', data=df, return_type='dataframe')

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

modelo = sm.OLS(y_train, X_train).fit()
modelo.summary()

0,1,2,3
Dep. Variable:,np.log(renda),R-squared:,0.355
Model:,OLS,Adj. R-squared:,0.353
Method:,Least Squares,F-statistic:,287.1
Date:,"Tue, 12 Mar 2024",Prob (F-statistic):,0.0
Time:,20:28:20,Log-Likelihood:,-10866.0
No. Observations:,9941,AIC:,21770.0
Df Residuals:,9921,BIC:,21920.0
Df Model:,19,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.5158,0.237,27.505,0.000,6.051,6.980
sexo[T.M],0.7892,0.016,48.321,0.000,0.757,0.821
posse_de_veiculo[T.True],0.0360,0.016,2.284,0.022,0.005,0.067
posse_de_imovel[T.True],0.0918,0.015,5.969,0.000,0.062,0.122
tipo_renda[T.Bolsista],0.2164,0.242,0.896,0.370,-0.257,0.690
tipo_renda[T.Empresário],0.1650,0.017,9.853,0.000,0.132,0.198
tipo_renda[T.Pensionista],-0.3071,0.256,-1.198,0.231,-0.809,0.195
tipo_renda[T.Servidor público],0.0736,0.025,2.964,0.003,0.025,0.122
educacao[T.Pós graduação],0.1104,0.184,0.601,0.548,-0.249,0.470

0,1,2,3
Omnibus:,0.857,Durbin-Watson:,2.002
Prob(Omnibus):,0.651,Jarque-Bera (JB):,0.825
Skew:,0.019,Prob(JB):,0.662
Kurtosis:,3.024,Cond. No.,1990.0


3. Siga removendo as variáveis menos significantes, sempre que o *p-value* for menor que 5%. Compare o modelo final com o inicial. Observe os indicadores e conclua se o modelo parece melhor. 

In [27]:
# 3.
# Removendo a variável menos significante (posse_de_veiculo, educacao)

y_novo, X_novo = patsy.dmatrices('np.log(renda) ~ sexo + posse_de_imovel + qtd_filhos + tipo_renda + estado_civil + idade + tempo_emprego + qt_pessoas_residencia', data=df, return_type='dataframe')


In [28]:
X_train_novo, X_test_novo, y_train_novo, y_test_novo = train_test_split(X_novo, y_novo, test_size=0.2, random_state=42)

modelo_novo = sm.OLS(y_train_novo, X_train_novo).fit()
modelo_novo.summary()

0,1,2,3
Dep. Variable:,np.log(renda),R-squared:,0.351
Model:,OLS,Adj. R-squared:,0.35
Method:,Least Squares,F-statistic:,383.0
Date:,"Tue, 12 Mar 2024",Prob (F-statistic):,0.0
Time:,20:38:29,Log-Likelihood:,-10897.0
No. Observations:,9941,AIC:,21820.0
Df Residuals:,9926,BIC:,21930.0
Df Model:,14,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.5648,0.225,29.219,0.000,6.124,7.005
sexo[T.M],0.7961,0.016,51.101,0.000,0.766,0.827
posse_de_imovel[T.True],0.0965,0.015,6.261,0.000,0.066,0.127
tipo_renda[T.Bolsista],0.2815,0.242,1.163,0.245,-0.193,0.756
tipo_renda[T.Empresário],0.1766,0.017,10.563,0.000,0.144,0.209
tipo_renda[T.Pensionista],-0.2517,0.257,-0.980,0.327,-0.755,0.252
tipo_renda[T.Servidor público],0.0928,0.025,3.748,0.000,0.044,0.141
estado_civil[T.Separado],0.3371,0.115,2.933,0.003,0.112,0.562
estado_civil[T.Solteiro],0.2850,0.112,2.539,0.011,0.065,0.505

0,1,2,3
Omnibus:,0.873,Durbin-Watson:,1.999
Prob(Omnibus):,0.646,Jarque-Bera (JB):,0.842
Skew:,0.02,Prob(JB):,0.656
Kurtosis:,3.022,Cond. No.,1960.0


In [29]:
y_novo, X_novo = patsy.dmatrices('np.log(renda) ~ sexo + posse_de_imovel + idade + tempo_emprego + qt_pessoas_residencia', data=df, return_type='dataframe')


In [30]:
X_train_novo, X_test_novo, y_train_novo, y_test_novo = train_test_split(X_novo, y_novo, test_size=0.2, random_state=42)

modelo_novo = sm.OLS(y_train_novo, X_train_novo).fit()
modelo_novo.summary()

0,1,2,3
Dep. Variable:,np.log(renda),R-squared:,0.342
Model:,OLS,Adj. R-squared:,0.342
Method:,Least Squares,F-statistic:,1033.0
Date:,"Tue, 12 Mar 2024",Prob (F-statistic):,0.0
Time:,20:41:08,Log-Likelihood:,-10963.0
No. Observations:,9941,AIC:,21940.0
Df Residuals:,9935,BIC:,21980.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,7.2279,0.043,166.881,0.000,7.143,7.313
sexo[T.M],0.7807,0.016,50.344,0.000,0.750,0.811
posse_de_imovel[T.True],0.0998,0.015,6.450,0.000,0.069,0.130
idade,0.0047,0.001,5.491,0.000,0.003,0.006
tempo_emprego,0.0605,0.001,52.377,0.000,0.058,0.063
qt_pessoas_residencia,0.0151,0.008,1.871,0.061,-0.001,0.031

0,1,2,3
Omnibus:,1.093,Durbin-Watson:,1.993
Prob(Omnibus):,0.579,Jarque-Bera (JB):,1.068
Skew:,0.024,Prob(JB):,0.586
Kurtosis:,3.016,Cond. No.,255.0


- Não houveram mudanças significativas entre o modelo inicial e o modelo final em relação a proporção da variabilidade na variável de resposta explicada pelas variáveis independentes, apesar disso devido ao modelo final ser um modelo mais simples fica mais fácil de interpretar overfitting, mesmo tendo poder de previsibilidade menor do que o modelo inicial (34%).