# Regressão IV - regressão múltipla


#### Previsão de renda

Vamos trabalhar com a base 'previsao_de_renda.csv', que é a base do seu próximo projeto. Vamos usar os recursos que vimos até aqui nesta base.

|variavel|descrição|
|-|-|
|data_ref                | Data de referência de coleta das variáveis |
|index                   | Código de identificação do cliente|
|sexo                    | Sexo do cliente|
|posse_de_veiculo        | Indica se o cliente possui veículo|
|posse_de_imovel         | Indica se o cliente possui imóvel|
|qtd_filhos              | Quantidade de filhos do cliente|
|tipo_renda              | Tipo de renda do cliente|
|educacao                | Grau de instrução do cliente|
|estado_civil            | Estado civil do cliente|
|tipo_residencia         | Tipo de residência do cliente (própria, alugada etc)|
|idade                   | Idade do cliente|
|tempo_emprego           | Tempo no emprego atual|
|qt_pessoas_residencia   | Quantidade de pessoas que moram na residência|
|renda                   | Renda em reais|

In [1]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

import statsmodels.formula.api as smf
import statsmodels.api as sm
import patsy

In [2]:
df = pd.read_csv('previsao_de_renda.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             15000 non-null  int64  
 1   data_ref               15000 non-null  object 
 2   id_cliente             15000 non-null  int64  
 3   sexo                   15000 non-null  object 
 4   posse_de_veiculo       15000 non-null  bool   
 5   posse_de_imovel        15000 non-null  bool   
 6   qtd_filhos             15000 non-null  int64  
 7   tipo_renda             15000 non-null  object 
 8   educacao               15000 non-null  object 
 9   estado_civil           15000 non-null  object 
 10  tipo_residencia        15000 non-null  object 
 11  idade                  15000 non-null  int64  
 12  tempo_emprego          12427 non-null  float64
 13  qt_pessoas_residencia  15000 non-null  float64
 14  renda                  15000 non-null  float64
dtypes:

1. Ajuste um modelo para prever log(renda) considerando todas as covariáveis disponíveis.
    - Utilizando os recursos do Patsy, coloque as variáveis qualitativas como *dummies*.
    - Mantenha sempre a categoria mais frequente como casela de referência
    - Avalie os parâmetros e veja se parecem fazer sentido prático.

In [24]:
y, x = patsy.dmatrices('np.log(renda) ~ sexo + C(posse_de_veiculo) + C(posse_de_imovel) + C(posse_de_imovel)+ qtd_filhos + C(tipo_renda, Treatment(2)) + educacao + estado_civil + tipo_residencia + idade + tempo_emprego + qt_pessoas_residencia', data = df)
x

DesignMatrix with shape (12427, 25)
  Columns:
    ['Intercept',
     'sexo[T.M]',
     'C(posse_de_veiculo)[T.True]',
     'C(posse_de_imovel)[T.True]',
     'C(tipo_renda, Treatment(2))[T.Assalariado]',
     'C(tipo_renda, Treatment(2))[T.Bolsista]',
     'C(tipo_renda, Treatment(2))[T.Pensionista]',
     'C(tipo_renda, Treatment(2))[T.Servidor público]',
     'educacao[T.Pós graduação]',
     'educacao[T.Secundário]',
     'educacao[T.Superior completo]',
     'educacao[T.Superior incompleto]',
     'estado_civil[T.Separado]',
     'estado_civil[T.Solteiro]',
     'estado_civil[T.União]',
     'estado_civil[T.Viúvo]',
     'tipo_residencia[T.Casa]',
     'tipo_residencia[T.Com os pais]',
     'tipo_residencia[T.Comunitário]',
     'tipo_residencia[T.Estúdio]',
     'tipo_residencia[T.Governamental]',
     'qtd_filhos',
     'idade',
     'tempo_emprego',
     'qt_pessoas_residencia']
  Terms:
    'Intercept' (column 0)
    'sexo' (column 1)
    'C(posse_de_veiculo)' (column 2)
    '

In [26]:
sm.OLS(y, x).fit().summary()

0,1,2,3
Dep. Variable:,np.log(renda),R-squared:,0.357
Model:,OLS,Adj. R-squared:,0.356
Method:,Least Squares,F-statistic:,287.5
Date:,"Tue, 14 Mar 2023",Prob (F-statistic):,0.0
Time:,08:26:02,Log-Likelihood:,-13568.0
No. Observations:,12427,AIC:,27190.0
Df Residuals:,12402,BIC:,27370.0
Df Model:,24,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.7399,0.235,28.665,0.000,6.279,7.201
sexo[T.M],0.7874,0.015,53.723,0.000,0.759,0.816
C(posse_de_veiculo)[T.True],0.0441,0.014,3.119,0.002,0.016,0.072
C(posse_de_imovel)[T.True],0.0829,0.014,5.926,0.000,0.055,0.110
"C(tipo_renda, Treatment(2))[T.Assalariado]",-0.1551,0.015,-10.387,0.000,-0.184,-0.126
"C(tipo_renda, Treatment(2))[T.Bolsista]",0.0657,0.241,0.272,0.785,-0.407,0.539
"C(tipo_renda, Treatment(2))[T.Pensionista]",-0.4639,0.241,-1.922,0.055,-0.937,0.009
"C(tipo_renda, Treatment(2))[T.Servidor público]",-0.0976,0.024,-4.054,0.000,-0.145,-0.050
educacao[T.Pós graduação],0.1071,0.159,0.673,0.501,-0.205,0.419

0,1,2,3
Omnibus:,0.858,Durbin-Watson:,2.023
Prob(Omnibus):,0.651,Jarque-Bera (JB):,0.839
Skew:,0.019,Prob(JB):,0.657
Kurtosis:,3.012,Cond. No.,2180.0


Aparentemente, segundo o p-valor de cada explicativa, as variáveis tipo_residencia e educação não são significativas para o modelo, considerando 95% de confiança.

2. Remova a variável menos significante e analise:
    - Observe os indicadores que vimos, e avalie se o modelo melhorou ou piorou na sua opinião.
    - Observe os parâmetros e veja se algum se alterou muito.
    

In [27]:
y, x = patsy.dmatrices('np.log(renda) ~ sexo + C(posse_de_veiculo) + C(posse_de_imovel) + C(posse_de_imovel)+ qtd_filhos + C(tipo_renda, Treatment(2)) + estado_civil + idade + tempo_emprego + qt_pessoas_residencia', data = df)
sm.OLS(y, x).fit().summary()

0,1,2,3
Dep. Variable:,np.log(renda),R-squared:,0.354
Model:,OLS,Adj. R-squared:,0.353
Method:,Least Squares,F-statistic:,453.1
Date:,"Tue, 14 Mar 2023",Prob (F-statistic):,0.0
Time:,08:27:23,Log-Likelihood:,-13603.0
No. Observations:,12427,AIC:,27240.0
Df Residuals:,12411,BIC:,27360.0
Df Model:,15,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.7606,0.219,30.928,0.000,6.332,7.189
sexo[T.M],0.7819,0.015,53.480,0.000,0.753,0.811
C(posse_de_veiculo)[T.True],0.0535,0.014,3.789,0.000,0.026,0.081
C(posse_de_imovel)[T.True],0.0848,0.014,6.172,0.000,0.058,0.112
"C(tipo_renda, Treatment(2))[T.Assalariado]",-0.1655,0.015,-11.120,0.000,-0.195,-0.136
"C(tipo_renda, Treatment(2))[T.Bolsista]",0.1343,0.242,0.556,0.579,-0.340,0.608
"C(tipo_renda, Treatment(2))[T.Pensionista]",-0.4195,0.242,-1.735,0.083,-0.894,0.055
"C(tipo_renda, Treatment(2))[T.Servidor público]",-0.0886,0.024,-3.681,0.000,-0.136,-0.041
estado_civil[T.Separado],0.3241,0.112,2.907,0.004,0.106,0.543

0,1,2,3
Omnibus:,1.149,Durbin-Watson:,2.024
Prob(Omnibus):,0.563,Jarque-Bera (JB):,1.121
Skew:,0.021,Prob(JB):,0.571
Kurtosis:,3.019,Cond. No.,2130.0


Ao retirar as variáveis, percebemos que elas realmente não são significativas neste modelo, visto que o R2 permanece praticamente o mesmo

3. Siga removendo as variáveis menos significantes, sempre que o *p-value* for menor que 5%. Compare o modelo final com o inicial. Observe os indicadores e conclua se o modelo parece melhor. 

In [29]:
y, x = patsy.dmatrices('np.log(renda) ~ sexo + C(posse_de_veiculo) + C(posse_de_imovel) + C(posse_de_imovel)+ qtd_filhos + idade + tempo_emprego + qt_pessoas_residencia', data = df)
sm.OLS(y, x).fit().summary()

0,1,2,3
Dep. Variable:,np.log(renda),R-squared:,0.347
Model:,OLS,Adj. R-squared:,0.346
Method:,Least Squares,F-statistic:,940.8
Date:,"Tue, 14 Mar 2023",Prob (F-statistic):,0.0
Time:,08:33:27,Log-Likelihood:,-13672.0
No. Observations:,12427,AIC:,27360.0
Df Residuals:,12419,BIC:,27420.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,7.2377,0.043,166.875,0.000,7.153,7.323
sexo[T.M],0.7694,0.015,52.676,0.000,0.741,0.798
C(posse_de_veiculo)[T.True],0.0569,0.014,4.022,0.000,0.029,0.085
C(posse_de_imovel)[T.True],0.0866,0.014,6.275,0.000,0.060,0.114
qtd_filhos,0.0338,0.019,1.735,0.083,-0.004,0.072
idade,0.0049,0.001,6.408,0.000,0.003,0.006
tempo_emprego,0.0610,0.001,59.075,0.000,0.059,0.063
qt_pessoas_residencia,-0.0092,0.016,-0.566,0.572,-0.041,0.023

0,1,2,3
Omnibus:,1.24,Durbin-Watson:,2.025
Prob(Omnibus):,0.538,Jarque-Bera (JB):,1.213
Skew:,0.022,Prob(JB):,0.545
Kurtosis:,3.019,Cond. No.,300.0


Tirando as variáveis estado civel e tipo de renda o modelo teve uma leve piora, mas nada impactante. Porém agora a variável qt_pessoas_residencia não é mais significativa

In [30]:
y, x = patsy.dmatrices('np.log(renda) ~ sexo + C(posse_de_veiculo) + C(posse_de_imovel) + C(posse_de_imovel)+ qtd_filhos + idade + tempo_emprego', data = df)
sm.OLS(y, x).fit().summary()

0,1,2,3
Dep. Variable:,np.log(renda),R-squared:,0.347
Model:,OLS,Adj. R-squared:,0.346
Method:,Least Squares,F-statistic:,1098.0
Date:,"Tue, 14 Mar 2023",Prob (F-statistic):,0.0
Time:,08:34:43,Log-Likelihood:,-13673.0
No. Observations:,12427,AIC:,27360.0
Df Residuals:,12420,BIC:,27410.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,7.2223,0.034,213.646,0.000,7.156,7.289
sexo[T.M],0.7688,0.015,52.768,0.000,0.740,0.797
C(posse_de_veiculo)[T.True],0.0561,0.014,3.986,0.000,0.029,0.084
C(posse_de_imovel)[T.True],0.0866,0.014,6.278,0.000,0.060,0.114
qtd_filhos,0.0239,0.009,2.767,0.006,0.007,0.041
idade,0.0049,0.001,6.399,0.000,0.003,0.006
tempo_emprego,0.0610,0.001,59.084,0.000,0.059,0.063

0,1,2,3
Omnibus:,1.243,Durbin-Watson:,2.026
Prob(Omnibus):,0.537,Jarque-Bera (JB):,1.216
Skew:,0.022,Prob(JB):,0.545
Kurtosis:,3.02,Cond. No.,223.0
