<a href="https://colab.research.google.com/github/audrey-siqueira/Desafio_Keycash/blob/main/Desafio2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Instalando o PYCARET

Nosso projeto será baseado na biblioteca PYCARET, devido essa ser de fácil manipulação e ágil para conseguir o objetivo de precificar os imóveis.

In [None]:
# instalando a biblioteca
!pip install pycaret -q

In [162]:
# configurando PyCaret para o Colab
from pycaret.utils import enable_colab 
enable_colab()

Colab mode enabled.


## Coletando os dados


O arquivo .csv contendo os dados é importado.

Os valores do cabeçalho do banco de dados e como são distribuídos podem ser visualizados na imagem abaixo:

In [163]:
# importando os dados
dataset= pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Keycash/Desafio2_input.csv',sep = ';')
dataset.head()

Unnamed: 0,Id,Zona,Area,Qualidade,AnoConstrucao,QualidadeAquecimento,Banheiros,Quartos_t1,Quartos_t2,Comodos,Lareiras,Garagem,Preco
0,1,RL,9600,6,1976,Ex,2,3,4,6,1,2,181500
1,2,RL,14115,5,1993,Ex,1,1,2,5,0,2,143000
2,3,RL,11200,5,1965,Ex,1,3,4,5,0,1,129500
3,4,RL,12968,5,1962,TA,1,2,3,4,0,1,144000
4,5,RL,10920,6,1960,TA,1,2,3,5,1,1,157000


In [164]:
# verificando o número de linhas e colunas do dataset
dataset.shape

(629, 13)

## Dividindo os dados 

A fim de demonstrar a função predict_model () em dados ocultos, uma amostra de 63 registros foi retida do conjunto de dados original para ser usada para previsões. 

Isso não deve ser confundido com uma divisão de treinamento / teste, pois essa divisão em particular é realizada para simular um cenário da vida real. 

Outra maneira de pensar sobre isso é que esses 63 registros não estão disponíveis no momento em que o experimento de aprendizado de máquina foi realizado.

In [165]:
# separando os dados em teste e treino
data = dataset.sample(frac=0.9)
data_unseen = dataset.drop(data.index)

data.reset_index(drop=True, inplace=True)
data_unseen.reset_index(drop=True, inplace=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))

Data for Modeling: (566, 13)
Unseen Data For Predictions: (63, 13)


## Configurando os dados no PyCaret


Importando os pacotes necessários para se usar o PYCARET para Regressões.

In [166]:
# importando os pacotes necessários
from pycaret.regression import *

Como estamos objetivando a precificação dos imóveis, a coluna de Preço será nossa varíavel dependente. 

A coluna de ID será excluída por não ter relevância no Preço para o treinamento dos modelos.

As demais colunas com variáveis independentes foram incluidas no treinamento,  obtendo o conjunto de dados final.

O PYCARET se encarrega de dividir as variáveis entre numéricas e categóricas, incluir valores faltantes, além de registrar outros diversos parâmetros do dataset, como listado abaixo:

In [167]:
reg = setup(data = data, target = 'Preco', ignore_features = ['Id']	)

Unnamed: 0,Description,Value
0,session_id,4302
1,Target,Preco
2,Original Data,"(566, 13)"
3,Missing Values,False
4,Numeric Features,2
5,Categorical Features,9
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(396, 29)"


## Comparando os modelos para escolher o melhor

O PYCARET possui diversos modelos de regressão (aproximadamente 20 diferentes tipos de algoritmos de regressão).

Comparando os modelos baseados em suas acuracidades envolvendo desde R quadrado (R2) até Erro Absoluto Médio (MAE), o PYCARET divide os modelos em uma lista de ordem de possível eficiência para nosso projeto, como demonstrado na lista abaixo.


In [168]:
# best = compare_models()
best = compare_models(sort='R2')

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
ridge,Ridge Regression,11425.2,219550700.0,14652.58,0.6925,0.1043,0.0814,0.013
llar,Lasso Least Angle Regression,11478.03,220835900.0,14705.43,0.6903,0.1046,0.0818,0.018
lasso,Lasso Regression,11479.11,220996800.0,14710.06,0.69,0.1047,0.0818,0.019
lr,Linear Regression,11481.31,221120400.0,14714.49,0.6898,0.1047,0.0818,0.013
rf,Random Forest Regressor,11634.52,236265000.0,15226.7,0.6676,0.1098,0.084,0.486
catboost,CatBoost Regressor,11752.97,240912800.0,15365.51,0.6619,0.1095,0.0842,1.041
gbr,Gradient Boosting Regressor,11730.98,245149600.0,15495.6,0.6545,0.1105,0.0843,0.069
lightgbm,Light Gradient Boosting Machine,12054.49,245969500.0,15541.54,0.652,0.1105,0.0862,0.036
ada,AdaBoost Regressor,12968.54,278107600.0,16543.77,0.6162,0.1178,0.0919,0.089
en,Elastic Net,13723.6,295299600.0,17024.58,0.5939,0.1223,0.0984,0.015


## Criando o modelo escolhido

O algoritmo de regressão que obteve o melhor desempenho baseado no R quadrado foi o de Ridge Regression.

A função abaixo treina e avalia o desempenho do determinado modelo escolhido. 

É possível visualizar o comportamento do modelo para diferentes Folds.

In [186]:
# treinando o modelo Gradient Boosting Regressor
modelo = create_model('ridge')

Unnamed: 0,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,9433.6387,164071100.0,12809.0225,0.7402,0.0901,0.0665
1,10067.2715,186414400.0,13653.3652,0.817,0.1087,0.0782
2,8944.5986,130331800.0,11416.2939,0.7356,0.076,0.0591
3,10470.2969,156729200.0,12519.1514,0.7806,0.0887,0.0734
4,11298.667,215836100.0,14691.3623,0.737,0.1037,0.0836
5,13446.3096,269379100.0,16412.7734,0.5744,0.1129,0.0918
6,13394.6924,363901900.0,19076.2109,0.4384,0.142,0.1071
7,11613.4277,235576200.0,15348.4902,0.7737,0.115,0.085
8,10805.9971,187701100.0,13700.4062,0.663,0.0902,0.071
9,14777.0977,285566200.0,16898.7031,0.6651,0.1155,0.0986


Os valores dos parâmetros do algoritmo escolhido para o treinamento podem visualizados abaixo:


In [187]:
# verificando os parâmetros
print(modelo)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=4302, solver='auto', tol=0.001)


## Otimizando o modelo com o ajuste de hiperparâmetros


O PyCaret também é extremamente fácil fazer a otimização dos parâmetros para melhorar o modelo. Basta utilizar o método `tune_model` passando como parâmetro o modelo a ser otimizado e a métrica pela qual otimizar. 

In [188]:
# tuning de parâmetros
tuned_modelo = tune_model(modelo)

Unnamed: 0,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,9316.0215,160215900.0,12657.6406,0.7463,0.0886,0.0656
1,10196.1094,188893600.0,13743.8584,0.8146,0.1095,0.0792
2,8771.3125,127630700.0,11297.375,0.7411,0.0755,0.0581
3,10418.1982,155427000.0,12467.0352,0.7825,0.0883,0.073
4,11157.6777,214345400.0,14640.5391,0.7388,0.1036,0.0829
5,13420.5293,269114800.0,16404.7188,0.5748,0.1128,0.0916
6,13337.373,361444300.0,19011.6895,0.4422,0.1416,0.1066
7,11585.4219,234110500.0,15300.6709,0.7751,0.1149,0.0851
8,10846.4004,189748500.0,13774.9248,0.6593,0.0905,0.0711
9,14919.5674,291196100.0,17064.4688,0.6585,0.1162,0.0994


Os valores dos parâmetros do algoritmo escolhido após a otimização do modelo podem visualizados abaixo:

In [189]:
# veriificando os parâmetros
print(tuned_modelo)

Ridge(alpha=2.09, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=4302, solver='auto', tol=0.001)


## Plotar o Modelo

Representação gráfica do modelo para diferentes tipos de métricas.

In [190]:
# avaliando o modelo
evaluate_model(tuned_modelo)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…

## Prevendo os resultados do conjunto de testes com o modelo otimizado

Agora que já conseguimos criar, treinar e avaliar nosso modelo, é hora de fazer previsões no nosso conjunto de testes.

In [191]:
# fazendo previsões
predict_model(tuned_modelo)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Ridge Regression,11830.207031,241328928.0,15534.766602,0.7188,0.1136,0.0876


Unnamed: 0,Area,AnoConstrucao,Zona_RL,Qualidade_4,Qualidade_5,Qualidade_6,Qualidade_7,Qualidade_8,QualidadeAquecimento_Ex,QualidadeAquecimento_Fa,QualidadeAquecimento_Gd,QualidadeAquecimento_TA,Banheiros_1,Quartos_t2_2,Quartos_t2_3,Quartos_t2_4,Comodos_3,Comodos_4,Comodos_5,Comodos_6,Comodos_7,Comodos_8,Lareiras_0,Lareiras_1,Lareiras_2,Garagem_0,Garagem_1,Garagem_2,Garagem_3,Preco,Label
0,3072.0,2004.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,178740.0,165518.7500
1,3922.0,2006.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,172500.0,177387.3125
2,1680.0,1971.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,91500.0,101301.0000
3,8461.0,2005.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,163990.0,176402.3125
4,15578.0,2006.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,172785.0,186536.0625
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165,12886.0,1963.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,175000.0,164954.1250
166,11310.0,1954.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,140000.0,151135.9375
167,14000.0,1950.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,158500.0,148397.6875
168,8335.0,1954.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,93000.0,116397.1250


## Finalizar o Modelo


A finalização do modelo é a última etapa do experimento. 

Esse fluxo acabará nos levando  ao melhor modelo para uso ao fazer previsões sobre dados novos e ocultos. 

A função finalize_model () ajusta o modelo ao conjunto de dados completo, incluindo a amostra de teste (30% neste caso).
 
Uma vez que o modelo é finalizado usando finalize_model (), todo o conjunto de dados incluindo o conjunto de teste é usado para treinamento. 

Como tal, se esse modelo for usado para previsões com os dados de teste após o uso de finalize_model (), as informações podem ser enganosas, pois você está tentando prever os mesmos dados que foram usados para modelagem. 

A fim de demonstrar este ponto apenas, usaremos os dados ocultos no início do tutorial para testar esse modelo finalizado.

In [192]:
# finalizando o modelo
final_modelo = finalize_model(tuned_modelo)
print(final_modelo)

Ridge(alpha=2.09, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=4302, solver='auto', tol=0.001)


Abaixo um exemplo que pode nos levar ao engano. 

Usar os dados de teste para validar esse modelo finalizado, que usou os dados de teste para treinamento, pode causar um engano, por isso no próximo tópico vamos usar os dados ocultos para validar esse modelo.

In [193]:
# fazendo previsões com conjunto de testes a partir do modelo finalizado
predict_model(final_modelo)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Ridge Regression,11226.986328,221727344.0,14890.511719,0.7417,0.1085,0.0822


Unnamed: 0,Area,AnoConstrucao,Zona_RL,Qualidade_4,Qualidade_5,Qualidade_6,Qualidade_7,Qualidade_8,QualidadeAquecimento_Ex,QualidadeAquecimento_Fa,QualidadeAquecimento_Gd,QualidadeAquecimento_TA,Banheiros_1,Quartos_t2_2,Quartos_t2_3,Quartos_t2_4,Comodos_3,Comodos_4,Comodos_5,Comodos_6,Comodos_7,Comodos_8,Lareiras_0,Lareiras_1,Lareiras_2,Garagem_0,Garagem_1,Garagem_2,Garagem_3,Preco,Label
0,3072.0,2004.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,178740.0,166173.9375
1,3922.0,2006.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,172500.0,175883.0000
2,1680.0,1971.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,91500.0,100524.5625
3,8461.0,2005.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,163990.0,176954.7500
4,15578.0,2006.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,172785.0,183849.1875
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165,12886.0,1963.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,175000.0,164873.4375
166,11310.0,1954.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,140000.0,151441.9375
167,14000.0,1950.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,158500.0,149128.8750
168,8335.0,1954.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,93000.0,116397.9375


##Prevendo os resultados do conjunto de dados ocultos com o modelo finalizado

A função predict_model () também é usada para prever o conjunto de dados ocultos. 

data_unseen é a variável criada no início do tutorial e contém 10% do conjunto de dados original que nunca foi exposto ao PyCaret. 

In [194]:
unseen_predictions = predict_model(final_modelo, data=data_unseen)
unseen_predictions.head()

Unnamed: 0,Id,Zona,Area,Qualidade,AnoConstrucao,QualidadeAquecimento,Banheiros,Quartos_t1,Quartos_t2,Comodos,Lareiras,Garagem,Preco,Label
0,11,RL,7200,5,1951,TA,1,3,4,5,0,2,134800,124606.1875
1,15,RL,8532,5,1954,Gd,1,3,4,5,1,2,153000,139627.875
2,16,RL,7922,5,1953,TA,1,3,4,5,0,1,109000,116721.5625
3,23,RL,13869,6,1997,Gd,2,3,4,6,0,2,177000,178458.4375
4,28,RL,13072,6,2004,Ex,1,3,4,5,0,2,158000,163585.875


In [195]:
from pycaret.utils import check_metric
check_metric(unseen_predictions.Preco, unseen_predictions.Label, 'R2')

0.6457

## Salvando o Modelo


Com nossa fase de experimentos finalizada, o PyCaret também facilita a nossa vida na hora do deploy, sendo bastante simples o processo de salvar o modelo para uso posterior.

Para isso, vamos utilizar o método `save_model`, passando como parâmetro o modelo e o nome do arquivo a ser salvo.

In [196]:
# salvando o modelo
save_model(final_modelo, "Modelo (Precificando Imóveis")

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True,
                                       features_todrop=['Id'], id_columns=[],
                                       ml_usecase='regression',
                                       numerical_features=[], target='Preco',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_strate...
                 ('dummy', Dummify(target='Preco')),
                 ('fix_perfect', Remove_100(target='Preco')),
                 ('clean_names', Clean_Colum_Names()),
                 ('feature_select', 'passthrough'), ('fix_multi', 'passthrough'),
                 ('dfs

## Carregando um Modelo

Para carregar o modelo salvo previamente no PYCARET vamos utilizar o método `load_model`, passando como parâmetro o nome do arquivo do modelo.

In [197]:
#carregando o modelo
saved_final_modelo = load_model('Modelo (Precificando Imóveis')

Transformation Pipeline and Model Successfully Loaded


Com o modelo carregado, é fácil fazer novas previsões com ele, assim como fizemos ao finalizar nosso modelo.

In [198]:
# fazendo previsões
new_prediction = predict_model(saved_final_modelo, data=data_unseen)
new_prediction.head()

Unnamed: 0,Id,Zona,Area,Qualidade,AnoConstrucao,QualidadeAquecimento,Banheiros,Quartos_t1,Quartos_t2,Comodos,Lareiras,Garagem,Preco,Label
0,11,RL,7200,5,1951,TA,1,3,4,5,0,2,134800,124606.1875
1,15,RL,8532,5,1954,Gd,1,3,4,5,1,2,153000,139627.875
2,16,RL,7922,5,1953,TA,1,3,4,5,0,1,109000,116721.5625
3,23,RL,13869,6,1997,Gd,2,3,4,6,0,2,177000,178458.4375
4,28,RL,13072,6,2004,Ex,1,3,4,5,0,2,158000,163585.875


Observe que os resultados de unseen_predictions e new_prediction são idênticos.

In [199]:
# avaliando o modelo
from pycaret.utils import check_metric
check_metric(new_prediction.Preco, new_prediction.Label, 'R2')

0.6457