# House Prices - Predição

Após realizar uma análise exploratória de dados detalhada e filtrar o máximo possível, sempre na intenção de conter o máximo de informação útil possível.

Agora, vamos realizar uma bateria de modelos de predição com os dados que filtrei.

## Inicialização

Puxando as bibliotecas necessárias para a realização do relatório e do banco de dados.

In [1]:
# bibliotecas
import numpy as np
import pandas as pd
import random as rd
import sklearn.metrics as metrics

from sklearn.linear_model import LinearRegression
from sklearn. model_selection import train_test_split

# banco de dados

# Dados
dados = pd.read_csv("./dados_tratados.csv")

Verificando a integridade dos dados

In [2]:
dados

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,1,0,Reg,Lvl,AllPub,Inside,...,0,0,0.0,0,0,2,2008,WD,Normal,208500
1,20,RL,80.0,9600,1,0,Reg,Lvl,AllPub,FR2,...,0,0,0.0,0,0,5,2007,WD,Normal,181500
2,60,RL,68.0,11250,1,0,IR1,Lvl,AllPub,Inside,...,0,0,0.0,0,0,9,2008,WD,Normal,223500
3,70,RL,60.0,9550,1,0,IR1,Lvl,AllPub,Corner,...,0,0,0.0,0,0,2,2006,WD,Abnorml,140000
4,60,RL,84.0,14260,1,0,IR1,Lvl,AllPub,FR2,...,0,0,0.0,0,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1442,60,RL,62.0,7917,1,0,Reg,Lvl,AllPub,Inside,...,0,0,0.0,0,0,8,2007,WD,Normal,175000
1443,20,RL,85.0,13175,1,0,Reg,Lvl,AllPub,Inside,...,0,0,0.0,0,0,2,2010,WD,Normal,210000
1444,70,RL,66.0,9042,1,0,Reg,Lvl,AllPub,Inside,...,0,0,0.0,1,2500,5,2010,WD,Normal,266500
1445,20,RL,68.0,9717,1,0,Reg,Lvl,AllPub,Inside,...,0,0,0.0,0,0,4,2010,WD,Normal,142125


### Separando os dados de treino e teste

Os dados presente ja fazem parte de uma separação treino/teste. Contudo vamos separar novamente para averiguar a performance dos modelos testados. Vamos também selecionar apenas as covariáveis que impactam diretamente na variável resposta, estas foram definidas no relatório de análise exploratória e tratamento desses dados.

In [3]:
# fixando uma seed para reprodutibilidade dos testes
rd.seed(10)

# existem apenas 8 colunas de interesse que possuem valores numéricos + variável resp.
numerics = ['OverallQual', 
            'GrLivArea',
            'ExterQual',
            'KitchenQual',
            'GarageArea', 
            'GarageCars', 
            'TotalBsmtSF',
            '1stFlrSF',
           'SalePrice']

# inserindo essas colunas em um novo dataframe
novo_dados = dados[numerics]

Vamos agora "*Dummieficar*" as variáveis categóricas, para que possamos trabalhar em um modelo de regressão de forma precisa. 

In [4]:
# selecionando colunas com variaveis categoricas
idx = (dados.applymap(type) == str).all(0) 

# inserindo as variaveis categoricas em um banco de dados auxiliar
df_new = dados[dados.columns[idx]]
df_new = pd.get_dummies(df_new)

# Inserindo as variáveis dummieficadas em nosso banco original
novo_dados = novo_dados.join(df_new)

In [5]:
novo_dados

Unnamed: 0,OverallQual,GrLivArea,ExterQual,KitchenQual,GarageArea,GarageCars,TotalBsmtSF,1stFlrSF,SalePrice,MSZoning_C (all),...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,7,1710,4,4,548,2,856,856,208500,0,...,0,0,0,1,0,0,0,0,1,0
1,6,1262,3,3,460,2,1262,1262,181500,0,...,0,0,0,1,0,0,0,0,1,0
2,7,1786,4,4,608,2,920,920,223500,0,...,0,0,0,1,0,0,0,0,1,0
3,7,1717,3,4,642,3,756,961,140000,0,...,0,0,0,1,1,0,0,0,0,0
4,8,2198,4,4,836,3,1145,1145,250000,0,...,0,0,0,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1442,6,1647,3,3,460,2,953,953,175000,0,...,0,0,0,1,0,0,0,0,1,0
1443,6,2073,3,3,500,2,1542,2073,210000,0,...,0,0,0,1,0,0,0,0,1,0
1444,7,2340,5,4,252,1,1152,1188,266500,0,...,0,0,0,1,0,0,0,0,1,0
1445,5,1078,3,4,240,1,1078,1078,142125,0,...,0,0,0,1,0,0,0,0,1,0


In [18]:
# selecionando as colunas com covariáveis
X = novo_dados.loc[:, novo_dados.columns != 'SalePrice']

# selecionando a coluna com a variável resposta
y = novo_dados.loc[:, novo_dados.columns == 'SalePrice']

# separando os dados entre treino e teste
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=42)

In [19]:
modelo1 = LinearRegression()
modelo1.fit(X_train, y_train)

y_pred = modelo1.predict(X_test)

# função que analisa o desempenho do modelo
def regression_results(y_true, y_pred):

    # Regression metrics
    explained_variance=metrics.explained_variance_score(y_true, y_pred)
    mean_absolute_error=metrics.mean_absolute_error(y_true, y_pred) 
    mse = metrics.mean_squared_error(y_true, y_pred) 
    median_absolute_error=metrics.median_absolute_error(y_true, y_pred)
    r2=metrics.r2_score(y_true, y_pred)

    print('explained_variance: ', round(explained_variance,4))    
    print('r2: ', round(r2,4))
    print('MAE: ', round(mean_absolute_error,4))
    print('MSE: ', round(mse,4))
    print('RMSE: ', round(np.sqrt(mse),4))
    
regression_results(y_test, y_pred)

explained_variance:  0.7278
r2:  0.7276
MAE:  28694.728
MSE:  1615303414.2665
RMSE:  40190.8374
