## Estimating house 🏠 prices 💸 in Iowa's (USA) Ames residential area
#### AutoML 🏎️💨 approach

#### Goal:
to estimate house prices using the AutoML regression approach

#### Data:
Ames house features and prices, originally prepared by Dean De Cock, later modified by Kaggle. Data are accessible at: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

#### Processing steps:
1. Downloading the source data using Kaggle API and unzipping it 
2. Loading the training part into Pandas 
3. Performing AutoML
4. Analyzing the results
5. Evaluating the final model on test data

#### Sources: 
[1] Kaggle API usage advices: https://stackoverflow.com/questions/55934733/documentation-for-kaggle-api-within-python <br>
[2] AutoML mljar-supervised. API documentation. fit() https://supervised.mljar.com/api/#supervised.automl.AutoML.fit <br>
[3] AutoML mljar-supervised. Features documentation. https://supervised.mljar.com/features/automl/<br>

### Step 0: Importing packages

In [15]:
import copy
import os
from math import sqrt
import zipfile

from kaggle.api.kaggle_api_extended import KaggleApi
import pandas as pd
from sklearn.metrics import mean_squared_error
from supervised.automl import AutoML

### Step 1: Downloading the source data using Kaggle API and unzipping it

As the first step, we need to create the Kaggle account and download the Kaggle API token following instructions in [1]. Then, we are ready to download the source data using the API: 

In [2]:
api = KaggleApi()

In [3]:
api.authenticate()



In [4]:
competition_name = 'house-prices-advanced-regression-techniques'

In [5]:
api.competition_download_files(competition_name)

The downloaded *.zip* file can be unzipped with the code:

In [6]:
with zipfile.ZipFile(competition_name + '.zip', 'r') as zip_ref:
    zip_ref.extractall(competition_name)

### Step 2: Loading the training part into Pandas

Let's check what is included in the unzipped folder:

In [7]:
for entry in os.scandir(competition_name):
    if entry.is_file():
        print(entry.name)

sample_submission.csv
data_description.txt
test.csv
train.csv


In [8]:
train_df = pd.read_csv(os.path.join(competition_name, 'train.csv'))

In [9]:
train_df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


### Step 3: Performing AutoML 

Training with MLjar's AutoML requires to input X (features) and y (target) in a form of Pandas DataFrame/Series or Numpy Array [2]:

In [10]:
X_train_df = train_df.drop(['Id', 'SalePrice'], axis=1)
y_train_df = train_df[['SalePrice']]

In order to start training, we first need to define an AutoML object with parameters of choice. Then *fit()* method needs to be applied:

In [17]:
automl_explain = AutoML(mode="Explain", results_path="AutoML_Ames_house_prices_explain")

In [12]:
automl_explain.fit(X_train_df, y_train_df)

AutoML directory: AutoML_Ames_house_prices_explain
The task is regression with evaluation metric rmse
AutoML will use algorithms: ['Baseline', 'Linear', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'ensemble']
* Step simple_algorithms will try to check up to 3 models
1_Baseline rmse 71466.688145 trained in 1.0 seconds
2_DecisionTree rmse 39466.587016 trained in 17.37 seconds
3_Linear rmse 42927.088618 trained in 8.37 seconds
* Step default_algorithms will try to check up to 3 models
4_Default_Xgboost rmse 25655.843849 trained in 10.97 seconds
5_Default_NeuralNetwork rmse 26341.418205 trained in 89.79 seconds
6_Default_RandomForest rmse 30556.950933 trained in 22.01 seconds
* Step ensemble will try to check up to 1 model
Ensemble rmse 22919.375107 trained in 0.41 seconds
AutoML fit time: 162.44 seconds
AutoML best model: Ensemble


AutoML(results_path='AutoML_Ames_house_prices_explain')

In the experiment above I assigned *"Explain"* to *mode* parameter in order to explore the learning process in more details. As the next step I would like to focus on finding the most accurate approach to estimating house prices. I therefore select the *"Compete"* mode as it allows to evaluate more algorithms, perform more advanced feature engineering and fine-tuning. I also specified *validation_strategy* to 5-fold cross-validation in order to gain an understanding of the consistency of the models' results:

In [16]:
automl_compete = AutoML(mode="Compete", results_path="AutoML_Ames_house_prices_compete", 
                        total_time_limit=86400, 
                        algorithms=["Random Forest", 
                                    "Extra Trees", 
                                    "Xgboost", 
                                    "CatBoost", 
                                    "Neural Network"],
                        validation_strategy={
                            "validation_type": "kfold", 
                            "k_folds": 5, 
                            "shuffle": True,
                            "stratify": True,
                            "random_seed": 1234
                        }
)

In [14]:
automl_compete.fit(X_train_df, y_train_df)

AutoML directory: AutoML_Ames_house_prices_compete
The task is regression with evaluation metric rmse
AutoML will use algorithms: ['Random Forest', 'Extra Trees', 'Xgboost', 'CatBoost', 'Neural Network']
AutoML will stack models
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'not_so_random', 'mix_encoding', 'golden_features', 'kmeans_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'boost_on_errors', 'ensemble', 'stack', 'ensemble_stacked']
Skip simple_algorithms because no parameters were generated.
* Step default_algorithms will try to check up to 5 models
1_Default_Xgboost rmse 28423.938295 trained in 16.87 seconds
2_Default_CatBoost rmse 27074.351564 trained in 44.18 seconds
3_Default_NeuralNetwork rmse 40392.537148 trained in 473.92 seconds
4_Default_RandomForest rmse 34860.934331 trained in 12.29 seconds
5_Default_ExtraTrees rmse 36730.560323 trained in 10.46 seconds
* Step not_so_rand

AutoML(algorithms=['Random Forest', 'Extra Trees', 'Xgboost', 'CatBoost',
                   'Neural Network'],
       mode='Compete', results_path='AutoML_Ames_house_prices_compete',
       total_time_limit=86400,
       validation_strategy={'k_folds': 5, 'random_seed': 1234, 'shuffle': True,
                            'stratify': True, 'validation_type': 'kfold'})

The computation results have a form of a folder saved to '*results_path*', which was defined as a parameter of the AutoML object. <br>
The folder includes:
- subfolders for tested algorithms, including the '*Baseline*' (naive); in the case of the '*compete*' mode we can observe that there are a lot more subfolders as not only algorithms are being compared, but also the hyperparameters - ones related to the algorithm's design which can be found in '*framework.json*' and feature selection/extraction approaches applied which are denoted in the subfolder's name
- '*.json*' files informing about input data characteristics ('*data.json*'), 'Features with lower importance than random_feature' [3] ('*drop_features.json*'), characteristics of the new features created ('*golden_features.json*'), parameters of the AutoML process started ('*params.json*') and completed ('*progress.json*')
- '*.png*' and '*.csv*' files presenting the summary results enabling to compare algorithms
- '*.npy*' files showing which samples were used for training and validation in each of the folds 

<img src="image/mljar_explain_results.jpg" width="400"/> <img src="image/mljar_compete_results.jpg" width="400"/>

### Step 4: Analyzing the results

First, let's have a look at selected summary results - files named '*ldb_performance.png*'. The Y-axis show RMSE values, while X-axis the consecutive algorithms tested (name '*#iteration*' may be a bit misleading, but it just denotes the order of AutoML calculations). The plot on the left ('*explain*' mode) is a bit easier to interpret as there are only 7 results to compare. We can observe that '*Ensemble*' model (in this case a combination of Xgboost and Neural Networks) works best. Decision tree and Linear models are performing the worst. The plot on the right ('*compete*' mode), because of the wide scale of RMSE values, does not allow to compare best performing algorithms: 

<img src="AutoML_Ames_house_prices_explain/ldb_performance.png" width="500"/> <img src="AutoML_Ames_house_prices_compete/ldb_performance.png" width="500"/>

We can however check this informatio by looking into '*leaderboard.csv*':

In [18]:
lb_compete_df = pd.read_csv(os.path.join('AutoML_Ames_house_prices_compete', 'leaderboard.csv'))

In [19]:
lb_compete_df.sort_values(by='metric_value', inplace=True)

In [20]:
lb_compete_df.head()

Unnamed: 0,name,model_type,metric_type,metric_value,train_time
152,Ensemble_Stacked,Ensemble,rmse,24488.359141,87.89
102,Ensemble,Ensemble,rmse,24805.691622,40.36
84,72_CatBoost_GoldenFeatures_SelectedFeatures,CatBoost,rmse,25248.397843,124.72
58,15_CatBoost_GoldenFeatures_SelectedFeatures,CatBoost,rmse,25252.068649,133.33
106,41_ExtraTrees_SelectedFeatures_Stacked,Extra Trees,rmse,25286.78969,19.9


We can see that similarly to '*explain*' mode's results, results got for '*compete*' indicate '*Ensemble*' model as best. This time however it is created from 8 elements, not 2. In order to check what is the absolute value of the difference between best '*explain*' and '*compete*' modes' validation results, we can do a quick calculation:

In [21]:
lb_explain_df = pd.read_csv(os.path.join('AutoML_Ames_house_prices_explain', 'leaderboard.csv'))

In [22]:
lb_explain_df.sort_values(by='metric_value', inplace=True)

In [23]:
abs_diff = lb_compete_df.iloc[0]['metric_value'] - lb_explain_df.iloc[0]['metric_value']
round(abs_diff, 2)

1568.98

The difference is not huge when one is comparing it to absolute sale prices, but probably significant for '*compete*' mode's users taking part in Kaggle challenges.

MLjar also provides useful plots for in-detail analysis the results got for each algoritm. We can view the learning curves (on the left - CatBoost, on the right - RandomForest):

<img src="AutoML_Ames_house_prices_compete/15_CatBoost/learning_curves.png" width="500"/> <img src="AutoML_Ames_house_prices_compete/31_RandomForest/learning_curves.png" width="500"/> 

Residuals (for Random Forest visibly more dispersed):

<img src="AutoML_Ames_house_prices_compete/15_CatBoost/predicted_vs_residuals.png" width="500"/> <img src="AutoML_Ames_house_prices_compete/31_RandomForest/predicted_vs_residuals.png" width="500"/> 

Relationship between target and predicted values (for Random Forest, predicted values between 10.000 and 20.000 seem to be a lowered as compared to the true ones):

<img src="AutoML_Ames_house_prices_compete/15_CatBoost/true_vs_predicted.png" width="500"/> <img src="AutoML_Ames_house_prices_compete/31_RandomForest/true_vs_predicted.png" width="500"/> 

The '*explain*' mode also offers feature importance plots (here permutation-based, for Xgboost algorithm). The plot shows the the '*Overallqual*' ('*Rates the overall material and finish of the house*'), '*GrLivArea*' ('*Above grade (ground) living area square feet*') and '*TotalBsmSF*' ('*Total square feet of basement area*') are the top three important predictors of the sale price:

<img src="AutoML_Ames_house_prices_explain/4_Default_Xgboost/permutation_importance.png" width="500"/> 

### Step 5: Evaluating the final model on test data 

At first, we need to prepare the test features and predict the sale price using the trained AutoML objects:

In [24]:
test_df = pd.read_csv(os.path.join(competition_name, 'test.csv'))

In [25]:
test_df.shape # almost equal to the size of the training data

(1459, 80)

In [26]:
X_test_df = test_df.drop('Id', axis=1)

In [28]:
preds_explain = automl_explain.predict(X_test_df)



In [29]:
preds_compete = automl_compete.predict(X_test_df)



In [33]:
type(preds_explain), preds_explain.shape

(numpy.ndarray, (1459,))

In [32]:
preds_explain

array([125835.13331867, 160868.93409775, 165978.22316025, ...,
       166623.73097275, 123589.31567258, 231148.82614762])

The resultant predictions have a form of Numpy Arrays. They need to be transformed into Pandas DataFrames with '*Id*' and '*SalePrice*' columns to fit the submission format:

In [40]:
preds_explain_df = copy.deepcopy(test_df)
preds_compete_df = copy.deepcopy(test_df)

In [41]:
preds_explain_df['SalePrice'] = list(preds_explain)
preds_compete_df['SalePrice'] = list(preds_compete)

In [43]:
preds_explain_df = preds_explain_df[['Id', 'SalePrice']]
preds_compete_df = preds_compete_df[['Id', 'SalePrice']]

In [46]:
preds_explain_df.to_csv('test_predictions_explain.csv', index=False)
preds_compete_df.to_csv('test_predictions_compete.csv', index=False)

Finally, we can upload the '*explain*' and '*compete*' predictions to Kaggle for evaluation:

<img src="kaggle_screen.png" width="700"/> 

The obtained score's value does not look familiar, as it is the RMSE between log-scaled predicted and true values. The usage of AutoML compete mode allowed to get a result being in 16th percentile of the whole challenge's leaderboard 🎉