# Regressão II - regressão múltipla


#### Previsão de renda II

Vamos continuar trabalhando com a base 'previsao_de_renda.csv', que é a base do seu próximo projeto. Vamos usar os recursos que vimos até aqui nesta base.

|variavel|descrição|
|-|-|
|data_ref                | Data de referência de coleta das variáveis |
|index                   | Código de identificação do cliente|
|sexo                    | Sexo do cliente|
|posse_de_veiculo        | Indica se o cliente possui veículo|
|posse_de_imovel         | Indica se o cliente possui imóvel|
|qtd_filhos              | Quantidade de filhos do cliente|
|tipo_renda              | Tipo de renda do cliente|
|educacao                | Grau de instrução do cliente|
|estado_civil            | Estado civil do cliente|
|tipo_residencia         | Tipo de residência do cliente (própria, alugada etc)|
|idade                   | Idade do cliente|
|tempo_emprego           | Tempo no emprego atual|
|qt_pessoas_residencia   | Quantidade de pessoas que moram na residência|
|renda                   | Renda em reais|

In [211]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

import statsmodels.formula.api as smf
import statsmodels.api as sm
import patsy

%matplotlib inline

In [212]:
df = pd.read_csv('previsao_de_renda.csv')

In [213]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             15000 non-null  int64  
 1   data_ref               15000 non-null  object 
 2   id_cliente             15000 non-null  int64  
 3   sexo                   15000 non-null  object 
 4   posse_de_veiculo       15000 non-null  bool   
 5   posse_de_imovel        15000 non-null  bool   
 6   qtd_filhos             15000 non-null  int64  
 7   tipo_renda             15000 non-null  object 
 8   educacao               15000 non-null  object 
 9   estado_civil           15000 non-null  object 
 10  tipo_residencia        15000 non-null  object 
 11  idade                  15000 non-null  int64  
 12  tempo_emprego          12427 non-null  float64
 13  qt_pessoas_residencia  15000 non-null  float64
 14  renda                  15000 non-null  float64
dtypes:

1. Separe a base em treinamento e teste (25% para teste, 75% para treinamento).


Sendo conservador, incluiremos a média nos dados faltantes de tempo de emprego.

In [214]:
df.tempo_emprego.fillna(df.tempo_emprego.mean(), inplace = True)

In [215]:
y = df.renda
X = df.drop(['Unnamed: 0', 'data_ref', 'id_cliente', 'renda'], axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

2. Rode uma regularização *ridge* com alpha = [0, 0.001, 0.005, 0.01, 0.05, 0.1] e avalie o $R^2$ na base de testes. Qual o melhor modelo?

In [216]:
dados_treino = pd.concat([X_train, y_train], axis = 1)

alphas = [0, 0.001, 0.005, 0.01, 0.05, 0.1] 

for alpha in alphas:
    modelo = 'renda ~ sexo + C(posse_de_veiculo) + C(posse_de_imovel) + C(posse_de_imovel)+ qtd_filhos + idade + tempo_emprego'
    md = smf.ols(modelo, data = dados_treino)
    ridge_treino = md.fit_regularized(method = 'elastic_net' 
                             , refit = True
                             , L1_wt = 0.001
                             , alpha = alpha)
    print(ridge_treino.summary())

                            OLS Regression Results                            
Dep. Variable:                  renda   R-squared:                       0.252
Model:                            OLS   Adj. R-squared:                  0.252
Method:                 Least Squares   F-statistic:                     540.9
Date:                Sat, 18 Mar 2023   Prob (F-statistic):               0.00
Time:                        17:22:36   Log-Likelihood:            -1.1583e+05
No. Observations:               11250   AIC:                         2.317e+05
Df Residuals:                   11243   BIC:                         2.317e+05
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept         

                            OLS Regression Results                            
Dep. Variable:                  renda   R-squared:                       0.252
Model:                            OLS   Adj. R-squared:                  0.252
Method:                 Least Squares   F-statistic:                     540.9
Date:                Sat, 18 Mar 2023   Prob (F-statistic):               0.00
Time:                        17:22:37   Log-Likelihood:            -1.1583e+05
No. Observations:               11250   AIC:                         2.317e+05
Df Residuals:                   11243   BIC:                         2.317e+05
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept         

In [217]:
dados_teste = pd.concat([X_test, y_test], axis = 1)

for alpha in alphas:
    modelo = 'renda ~ sexo + C(posse_de_veiculo) + C(posse_de_imovel) + C(posse_de_imovel)+ qtd_filhos + idade + tempo_emprego'
    md = smf.ols(modelo, data = dados_teste)
    ridge_teste = md.fit_regularized(method = 'elastic_net' 
                             , refit = True
                             , L1_wt = 0.001
                             , alpha = alpha)
    print(ridge_teste.summary())

                            OLS Regression Results                            
Dep. Variable:                  renda   R-squared:                       0.264
Model:                            OLS   Adj. R-squared:                  0.262
Method:                 Least Squares   F-statistic:                     191.4
Date:                Sat, 18 Mar 2023   Prob (F-statistic):          3.72e-243
Time:                        17:22:38   Log-Likelihood:                -38545.
No. Observations:                3750   AIC:                         7.711e+04
Df Residuals:                    3743   BIC:                         7.716e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept         

                            OLS Regression Results                            
Dep. Variable:                  renda   R-squared:                       0.264
Model:                            OLS   Adj. R-squared:                  0.262
Method:                 Least Squares   F-statistic:                     191.4
Date:                Sat, 18 Mar 2023   Prob (F-statistic):          3.72e-243
Time:                        17:22:39   Log-Likelihood:                -38545.
No. Observations:                3750   AIC:                         7.711e+04
Df Residuals:                    3743   BIC:                         7.716e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept         

In [218]:
ridge_teste

<statsmodels.regression.linear_model.OLSResults at 0x1adb388bd30>

In [219]:
pred_ridge = ridge_teste.predict(X_test)
r2_score(y_test,pred_ridge)

0.2635513969545923

Todos modelos tem R2 similares

3. Faça o mesmo que no passo 2, com uma regressão *LASSO*. Qual método chega a um melhor resultado?


In [220]:
for alpha in alphas:
    modelo = 'renda ~ sexo + C(posse_de_veiculo) + C(posse_de_imovel) + C(posse_de_imovel)+ qtd_filhos + idade + tempo_emprego'
    md = smf.ols(modelo, data = dados_treino)
    lasso_treino = md.fit_regularized(method = 'elastic_net' 
                             , refit = True
                             , L1_wt = 1
                             , alpha = alpha)
    print(reg.summary())

                            OLS Regression Results                            
Dep. Variable:                  renda   R-squared:                       0.264
Model:                            OLS   Adj. R-squared:                  0.262
Method:                 Least Squares   F-statistic:                     191.4
Date:                Sat, 18 Mar 2023   Prob (F-statistic):          3.72e-243
Time:                        17:22:39   Log-Likelihood:                -38545.
No. Observations:                3750   AIC:                         7.711e+04
Df Residuals:                    3743   BIC:                         7.716e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept         

                            OLS Regression Results                            
Dep. Variable:                  renda   R-squared:                       0.264
Model:                            OLS   Adj. R-squared:                  0.262
Method:                 Least Squares   F-statistic:                     191.4
Date:                Sat, 18 Mar 2023   Prob (F-statistic):          3.72e-243
Time:                        17:22:41   Log-Likelihood:                -38545.
No. Observations:                3750   AIC:                         7.711e+04
Df Residuals:                    3743   BIC:                         7.716e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept         

In [221]:
for alpha in alphas:
    modelo = 'renda ~ sexo + C(posse_de_veiculo) + C(posse_de_imovel) + C(posse_de_imovel)+ qtd_filhos + idade + tempo_emprego'
    md = smf.ols(modelo, data = dados_teste)
    lasso_teste = md.fit_regularized(method = 'elastic_net' 
                             , refit = True
                             , L1_wt = 1
                             , alpha = alpha)
    print(reg.summary())
    

                            OLS Regression Results                            
Dep. Variable:                  renda   R-squared:                       0.264
Model:                            OLS   Adj. R-squared:                  0.262
Method:                 Least Squares   F-statistic:                     191.4
Date:                Sat, 18 Mar 2023   Prob (F-statistic):          3.72e-243
Time:                        17:22:41   Log-Likelihood:                -38545.
No. Observations:                3750   AIC:                         7.711e+04
Df Residuals:                    3743   BIC:                         7.716e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept         

                            OLS Regression Results                            
Dep. Variable:                  renda   R-squared:                       0.264
Model:                            OLS   Adj. R-squared:                  0.262
Method:                 Least Squares   F-statistic:                     191.4
Date:                Sat, 18 Mar 2023   Prob (F-statistic):          3.72e-243
Time:                        17:22:42   Log-Likelihood:                -38545.
No. Observations:                3750   AIC:                         7.711e+04
Df Residuals:                    3743   BIC:                         7.716e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept         

In [222]:
pred_lasso = lasso_teste.predict(X_test)
r2_score(y_test,pred_lasso)

0.2635513969545923

O melho resultado seria para alfa = 0

4. Rode um modelo *stepwise*. Avalie o $R^2$ na vase de testes. Qual o melhor resultado?


In [223]:
def stepwise_selection(X, y,
                       initial_list=[], 
                       threshold_in=0.05, 
                       threshold_out = 0.05, 
                       verbose=True):
    included = list(initial_list)
    while True:
        changed=False
        # forward step
        excluded = list(set(X.columns)-set(included))
        new_pval = pd.Series(index=excluded, dtype=np.dtype('float64'))
        for new_column in excluded:
#             y, x = patsy.dmatrices('np.log(renda) ~ sexo + C(posse_de_veiculo) + C(posse_de_imovel) + C(posse_de_imovel)+ qtd_filhos + C(tipo_renda, Treatment(2)) + educacao + estado_civil + tipo_residencia + idade + tempo_emprego + qt_pessoas_residencia', data = df)
#             model = smf.ols(y, x, data = df)
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included+[new_column]]))).fit()
            new_pval[new_column] = model.pvalues[new_column]
        best_pval = new_pval.min()
        if best_pval < threshold_in:
            best_feature = new_pval.index[new_pval.argmin()]
            included.append(best_feature)
            changed=True
            if verbose:
                 print('Add  {:30} with p-value {:.6}'.format(best_feature, best_pval))

        # backward step
        print("#############")
        print(included)
        model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
        # use all coefs except intercept
        pvalues = model.pvalues.iloc[1:]
        worst_pval = pvalues.max() # null if pvalues is empty
        if worst_pval > threshold_out:
            changed=True
            worst_feature = pvalues.argmax()
            included.remove(worst_feature)
            if verbose:
                print('Drop {:30} with p-value {:.6}'.format(worst_feature, worst_pval))
        if not changed:
            break
    return included, model

In [224]:
X_train_dummies = pd.get_dummies(X_train)
zero_um = {True : 1, False : 0}
X_train_dummies['posse_de_veiculo'] = X_train_dummies['posse_de_veiculo'].map(zero_um)
X_train_dummies.posse_de_imovel = X_train_dummies.posse_de_imovel.map(zero_um)
X_train_dummies.head()

Unnamed: 0,posse_de_veiculo,posse_de_imovel,qtd_filhos,idade,tempo_emprego,qt_pessoas_residencia,sexo_F,sexo_M,tipo_renda_Assalariado,tipo_renda_Bolsista,...,estado_civil_Separado,estado_civil_Solteiro,estado_civil_União,estado_civil_Viúvo,tipo_residencia_Aluguel,tipo_residencia_Casa,tipo_residencia_Com os pais,tipo_residencia_Comunitário,tipo_residencia_Estúdio,tipo_residencia_Governamental
7410,1,1,0,49,5.682192,2.0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
10894,1,1,0,42,1.076712,2.0,0,1,1,0,...,0,0,0,0,0,1,0,0,0,0
1934,1,0,0,43,13.865753,2.0,0,1,1,0,...,0,0,0,0,0,1,0,0,0,0
11539,1,1,2,38,1.156164,3.0,1,0,1,0,...,0,1,0,0,0,1,0,0,0,0
2952,1,1,0,59,7.722635,2.0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [225]:
var, model = stepwise_selection(X_train_dummies, y_train)

  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd

Add  tempo_emprego                  with p-value 0.0
#############
['tempo_emprego']
Add  sexo_F                         with p-value 0.0
#############
['tempo_emprego', 'sexo_F']


  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd

Add  sexo_M                         with p-value 0.0
#############
['tempo_emprego', 'sexo_F', 'sexo_M']


  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd

Add  tipo_renda_Empresário          with p-value 3.91573e-11
#############
['tempo_emprego', 'sexo_F', 'sexo_M', 'tipo_renda_Empresário']


  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)


Add  educacao_Superior completo     with p-value 1.52119e-06
#############
['tempo_emprego', 'sexo_F', 'sexo_M', 'tipo_renda_Empresário', 'educacao_Superior completo']


  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)


Add  tipo_renda_Pensionista         with p-value 3.98185e-05
#############
['tempo_emprego', 'sexo_F', 'sexo_M', 'tipo_renda_Empresário', 'educacao_Superior completo', 'tipo_renda_Pensionista']


  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)


Add  idade                          with p-value 3.37147e-06
#############
['tempo_emprego', 'sexo_F', 'sexo_M', 'tipo_renda_Empresário', 'educacao_Superior completo', 'tipo_renda_Pensionista', 'idade']


  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)


Add  estado_civil_Casado            with p-value 0.0191784
#############
['tempo_emprego', 'sexo_F', 'sexo_M', 'tipo_renda_Empresário', 'educacao_Superior completo', 'tipo_renda_Pensionista', 'idade', 'estado_civil_Casado']
#############
['tempo_emprego', 'sexo_F', 'sexo_M', 'tipo_renda_Empresário', 'educacao_Superior completo', 'tipo_renda_Pensionista', 'idade', 'estado_civil_Casado']


  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)
  x = pd.concat(x[::order], 1)


In [226]:
var

['tempo_emprego',
 'sexo_F',
 'sexo_M',
 'tipo_renda_Empresário',
 'educacao_Superior completo',
 'tipo_renda_Pensionista',
 'idade',
 'estado_civil_Casado']

In [227]:
y, x = patsy.dmatrices('renda ~ tempo_emprego + sexo + tipo_renda + educacao + idade + estado_civil', data = dados_teste)
sw = sm.OLS(y, x).fit()
sw.summary()

0,1,2,3
Dep. Variable:,renda,R-squared:,0.271
Model:,OLS,Adj. R-squared:,0.268
Method:,Least Squares,F-statistic:,92.48
Date:,"Sat, 18 Mar 2023",Prob (F-statistic):,1.48e-242
Time:,17:22:45,Log-Likelihood:,-38527.0
No. Observations:,3750,AIC:,77090.0
Df Residuals:,3734,BIC:,77180.0
Df Model:,15,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-2740.0457,1358.067,-2.018,0.044,-5402.671,-77.420
sexo[T.M],5984.9757,255.641,23.412,0.000,5483.766,6486.185
tipo_renda[T.Bolsista],-1969.5439,4065.529,-0.484,0.628,-9940.419,6001.331
tipo_renda[T.Empresário],619.9958,291.295,2.128,0.033,48.883,1191.109
tipo_renda[T.Pensionista],-1936.2940,411.729,-4.703,0.000,-2743.530,-1129.058
tipo_renda[T.Servidor público],190.0174,412.185,0.461,0.645,-618.113,998.148
educacao[T.Pós graduação],-418.5494,2647.575,-0.158,0.874,-5609.383,4772.285
educacao[T.Secundário],51.3215,1233.501,0.042,0.967,-2367.080,2469.723
educacao[T.Superior completo],664.9472,1240.026,0.536,0.592,-1766.247,3096.142

0,1,2,3
Omnibus:,5072.527,Durbin-Watson:,1.987
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2137386.184
Skew:,7.419,Prob(JB):,0.0
Kurtosis:,119.013,Cond. No.,1620.0


In [228]:
pred_sw = sw.predict(x)
r2_score(y,pred_sw)

0.27086905468459643

5. Compare os parâmetros e avalie eventuais diferenças. Qual modelo você acha o melhor de todos?

Analisando todos os R2, o modelo de lasso para teste teve os melhores resultados.

6. Partindo dos modelos que você ajustou, tente melhorar o $R^2$ na base de testes. Use a criatividade, veja se consegue inserir alguma transformação ou combinação de variáveis.

In [229]:
for alpha in alphas:
    modelo = 'np.log(renda) ~ sexo + C(posse_de_veiculo) + C(posse_de_imovel) + C(posse_de_imovel)+ qtd_filhos + idade + tempo_emprego'
    md = smf.ols(modelo, data = dados_teste)
    lasso2 = md.fit_regularized(method = 'elastic_net' 
                             , refit = True
                             , L1_wt = 1
                             , alpha = alpha)
    print(reg.summary())

                            OLS Regression Results                            
Dep. Variable:                  renda   R-squared:                       0.264
Model:                            OLS   Adj. R-squared:                  0.262
Method:                 Least Squares   F-statistic:                     191.4
Date:                Sat, 18 Mar 2023   Prob (F-statistic):          3.72e-243
Time:                        17:22:45   Log-Likelihood:                -38545.
No. Observations:                3750   AIC:                         7.711e+04
Df Residuals:                    3743   BIC:                         7.716e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept         

                            OLS Regression Results                            
Dep. Variable:                  renda   R-squared:                       0.264
Model:                            OLS   Adj. R-squared:                  0.262
Method:                 Least Squares   F-statistic:                     191.4
Date:                Sat, 18 Mar 2023   Prob (F-statistic):          3.72e-243
Time:                        17:22:46   Log-Likelihood:                -38545.
No. Observations:                3750   AIC:                         7.711e+04
Df Residuals:                    3743   BIC:                         7.716e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept         

7. Ajuste uma árvore de regressão e veja se consegue um $R^2$ melhor com ela.

In [230]:
dados = df.copy()
y_dados = dados.renda
x_dados = dados.drop(['Unnamed: 0', 'data_ref', 'id_cliente', 'renda'], axis = 1)

x_dados = pd.get_dummies(dados)
x_dados['posse_de_veiculo'] = x_dados['posse_de_veiculo'].map(zero_um)
x_dados.posse_de_imovel = x_dados.posse_de_imovel.map(zero_um)

X_train, X_test, y_train, y_test = train_test_split(x_dados, y_dados, test_size=0.25, random_state=42)

In [233]:
from sklearn.tree import DecisionTreeRegressor
regr_1 = DecisionTreeRegressor()
regr_1.fit(X_train, y_train)
prev1 = regr_1.predict(X_test)
r2_score(y_test,prev1)

0.9989871380237538