In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.feature_selection import RFECV

In [2]:
df = pd.read_csv('apartamentos_clean.csv')

First of all, let's get rid of outliers

In [3]:
con = ['aluguel', 'condominio', 'area']
df = df[(np.abs(stats.zscore(df[con])) < 3).all(axis=1)]
df.head()

Unnamed: 0,aluguel,condominio,area,bairro,quartos,banheiros,garagem,anunciante
0,4700.0,750.0,82.0,Chácara Santo Antônio,2.0,2.0,2.0,quintoandar
1,3720.0,970.0,50.0,Vila Olímpia,1.0,1.0,0.0,quintoandar
2,3300.0,443.0,85.0,Vila das Mercês,2.0,2.0,1.0,quintoandar
3,1250.0,752.0,35.0,Vila Mascote,1.0,1.0,1.0,quintoandar
4,2835.0,740.0,72.0,Saúde,3.0,2.0,1.0,quintoandar


## Applying the model

#### The simplest of applications:
In the function below we fit a very basic, non-optimized random forest regressor to our data and make predictions on the train and test sets.

We also define a function to print out the predictions' MAE.

We choose the mean absolute error because the price still varies a lot and the mean squared error penalizes too much this variation -- we could very well restrict the price to a price that fits one's particular budget, but its best not to modify further the data.

In [4]:
def simple_rf(X_train, X_test, y_train, y_test):
    rf = RandomForestRegressor(n_estimators=20, random_state=1)
    rf.fit(X_train, y_train)

    predictions_train = rf.predict(X_train)
    predictions_test = rf.predict(X_test)
    return (predictions_train, predictions_test)

In [5]:
def print_mae(y_test, y_train, y_p_test, y_p_train):
    mae_train = mean_absolute_error(y_train, y_p_train)
    mae_test = mean_absolute_error(y_test, y_p_test)
    spread = np.abs(mae_train - mae_test)
    print("MAE train: {:0.2f}".format(mae_train))
    print("-"*100)
    print("MAE test: {:0.2f}".format(mae_test))
    print("-"*100)
    print("Spread: {:0.2f}".format(spread))

In [6]:
x = df.drop(['aluguel', 'bairro', 'anunciante'], axis=1)
y = df['aluguel']
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state = 1)


predictions_train, predictions_test = simple_rf(X_train, X_test, y_train, y_test)
print_mae(y_test, y_train, predictions_test, predictions_train)

MAE train: 423.35
----------------------------------------------------------------------------------------------------
MAE test: 998.22
----------------------------------------------------------------------------------------------------
Spread: 574.87


In [9]:
y_test.mad()

1237.794132922796

We see the model is strongly overfitting. But at least it beat the barests of benchmarks: the MAE is way below the mean absolute deviation.

#### With smaller area:

Like in the linear regression model, we restrict the apartaments by the area: only areas below 200 m^2 are considered.

In [10]:
smaller = df[df['area'] <= 200]

x = smaller.drop(['aluguel', 'bairro', 'anunciante'], axis=1)
y = smaller['aluguel']
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state = 1)

predictions_train, predictions_test = simple_rf(X_train, X_test, y_train, y_test)
print_mae(y_test, y_train, predictions_test, predictions_train)

MAE train: 390.16
----------------------------------------------------------------------------------------------------
MAE test: 883.83
----------------------------------------------------------------------------------------------------
Spread: 493.67


Now with the target as the total price:

In [11]:
smaller = df[df['area'] <= 200]

x = smaller.drop(['aluguel', 'bairro', 'anunciante', 'condominio'], axis=1)
y = smaller['aluguel'] + smaller['condominio']
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state = 1)

predictions_train, predictions_test = simple_rf(X_train, X_test, y_train, y_test)
print_mae(y_test, y_train, predictions_test, predictions_train)

MAE train: 815.31
----------------------------------------------------------------------------------------------------
MAE test: 1013.56
----------------------------------------------------------------------------------------------------
Spread: 198.25


On the linear regression model this was the best model we found. Let's compare the two models:

In [12]:
features = ['area', 'quartos']
x = smaller[features]
y = smaller['aluguel'] + smaller['condominio']
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state = 1)

lr = LinearRegression()
lr.fit(X_train, y_train)
prediction_test_lr = lr.predict(X_test)
predictions_train_lr = lr.predict(X_train)

print("Linear regression MAE: {:0.2f}".format(mean_absolute_error(y_test, prediction_test_lr)))

Linear regression MAE: 1078.33


We see that even though our model is not yet optimized -- is strongly overfitting, we're only using `20` trees, and so on -- we were able to reach a (if only slightly) better result with it than with our linear regression.

This goes to show that random forests (and models that involve trees, like xgboost) are much better, because they don't require a linear relationship between the variables. The downside is that they are much slower -- this isn't really an issue here, but with larger datasets it can be quite cumbersome.

#### With the neighborhoods

Now, as a least tweak, let's see what is the result if we include the neighborhoods as variables:

In [13]:
df_b = pd.get_dummies(df, columns=['bairro'])

In [14]:
smaller_b = df_b[df_b['area'] <= 200]

x = smaller_b.drop(['aluguel', 'anunciante'], axis=1)
y = smaller_b['aluguel']
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state = 1)

predictions_train, predictions_test = simple_rf(X_train, X_test, y_train, y_test)
print_mae(y_test, y_train, predictions_test, predictions_train)

MAE train: 293.67
----------------------------------------------------------------------------------------------------
MAE test: 749.83
----------------------------------------------------------------------------------------------------
Spread: 456.16


The model performs a bit better, and the overfitting is not much different from the one with smaller area but with no neighborhoods as variables.

With the neighborhoods targeting the total price:

In [15]:
smaller_b = df_b[df_b['area'] <= 200]

x = smaller_b.drop(['aluguel', 'anunciante', 'condominio'], axis=1)
y = smaller_b['aluguel'] + smaller_b['condominio']
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state = 1)

predictions_train, predictions_test = simple_rf(X_train, X_test, y_train, y_test)
print_mae(y_test, y_train, predictions_test, predictions_train)

MAE train: 413.55
----------------------------------------------------------------------------------------------------
MAE test: 812.13
----------------------------------------------------------------------------------------------------
Spread: 398.58


Comparing only the appraoch of targeting the total price we achieve the best result here, considering also the neighborhoods.

There's no escaping it: the `condominio` does relates with the `aluguel` somehow, so excluding it as an independent variable will affect our results. But in the end its the total price we're interested in knowing.

## Features Selection

We'll optimize our model in two steps: first we'll select the features that are most relevant to the regression and then we'll go on tuning the hyperparameters.

For feature selection we'll use `RFECV` from `sklearn.feature_selection`: it simply does a recursive feature elimination, trying each combination, and in each combination performing a cross validation.

First, let's define again our `x` and `y` and split them between test and train sets:

In [16]:
x = smaller_b.drop(['aluguel', 'anunciante', 'condominio'], axis=1)
y = smaller_b['aluguel'] + smaller_b['condominio']
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state = 1)

Then we initiate the model we want (to wit, a random forest regressor), initiate the `RFECV`, using as criteria the MAE (we already explained the reason, vide supra).

We get the features with the `selector.support_` selecting (as a boolean selector) the columns of `x`.

Since this selection takes some time we did it on another computer and saved the results to `optimized_columns.data`. So the cell below only contains the unexecuted code.

In [None]:
rf = RandomForestRegressor(random_state=1)

selector = RFECV(rf, scoring='neg_mean_absolute_error', cv=3)
selector.fit(x, y)

optimized_columns = X_train.columns[selector.support_]

import pickle

with open('optimized_columns.data', 'wb') as filehandle:
    pickle.dump(optimized_columns, filehandle)

Now we load the `optimized_columns`:

In [18]:
import pickle
with open('optimized_columns.data', 'rb') as filehandle:
    optimized_columns = pickle.load(filehandle)
    
print(optimized_columns)

Index(['area', 'quartos', 'banheiros', 'garagem', 'bairro_Aclimação',
       'bairro_Alto da Boa Vista', 'bairro_Alto da Lapa',
       'bairro_Alto da Mooca', 'bairro_Alto de Pinheiros',
       'bairro_Americanópolis',
       ...
       'bairro_Vila do Bosque', 'bairro_Vila do Castelo',
       'bairro_Vila do Encontro', 'bairro_Vila dos Remédios',
       'bairro_Várzea da Barra Funda', 'bairro_Várzea de Baixo',
       'bairro_Água Branca', 'bairro_Água Fria', 'bairro_Água Funda',
       'bairro_Água Rasa'],
      dtype='object', length=453)


In [19]:
X_train = X_train[optimized_columns]
X_test = X_test[optimized_columns]

In [20]:
rf2 = RandomForestRegressor(random_state=1)
rf2.fit(X_train, y_train)
predictions_train = rf2.predict(X_train)
predictions_test = rf2.predict(X_test)

print_mae(y_test, y_train, predictions_test, predictions_train)

MAE train: 405.88
----------------------------------------------------------------------------------------------------
MAE test: 804.09
----------------------------------------------------------------------------------------------------
Spread: 398.22


## Tuning Hyperparameters

Now that we see what we can do and have a baisc benchmark, let's do some optimizations: we'll do a grid search cross validation (using `GridSearchCV`).

Basically, we'll set a grid of parameters and the `GridSearchCV` method will make a cross validation using each combination.

In the end we should arrive, given our initial grid, the best combination of parameters.

Let's work with the following variables: smaller areas, optimized features and target is the total price.

In [21]:
x = smaller_b[optimized_columns]
y = smaller_b['aluguel'] + smaller_b['condominio']
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state = 1)

The cell below contains the code for the grid search, but since it takes quite a while to run the code and we did it on another computer, we'll just put the code here with the results.

In [None]:
bootstrap = [True, False]
max_depth = [5, 10, 100, None]
min_samples_split = [2,5,10]
min_samples_leaf = [2,3,5]
max_features = ['log2', 'sqrt', 'auto']
n_estimators = [200, 500, 1000]
min_impurity_decrease = [0.0, 0.25, 0.5]


grid = {'n_estimators': n_estimators,
        'max_features': max_features,
        'max_depth': max_depth,
        'min_samples_split': min_samples_split,
        'min_samples_leaf': min_samples_leaf,
        'bootstrap': bootstrap,
        'min_impurity_decrease': min_impurity_decrease
       }

rf = RandomForestRegressor(criterion='mae')
rf_gridsearch = GridSearchCV(estimator=rf,
                             param_grid=grid,
                             scoring='neg_mean_absolute_error', 
                             cv = 3,
                             verbose=2)

rf_gridsearch.fit(x, y)
rf_gridsearch.best_params_

# This gives:

# {'n_estimators': 500,
#         'max_features': 'sqrt',
#         'max_depth': None,
#         'min_samples_split': 10,
#         'min_samples_leaf': 3,
#         'bootstrap': False,
#         'min_impurity_decrease': 0.25
#        }

In [22]:
rf = RandomForestRegressor(n_estimators=500,
                           min_samples_split=10,
                           min_samples_leaf=3,
                           max_features='sqrt',
                           max_depth=None,
                           bootstrap=False,
                           min_impurity_decrease=0.25,
                           random_state=1)
rf.fit(X_train, y_train)

predictions_train = rf.predict(X_train)
predictions_test = rf.predict(X_test)

print_mae(y_test, y_train, predictions_test, predictions_train)

MAE train: 772.70
----------------------------------------------------------------------------------------------------
MAE test: 827.73
----------------------------------------------------------------------------------------------------
Spread: 55.02


We see that even though our results were marginally worse than the non-optimized one (around R$`15` or R$`20`, depending on which results we are comparing -- optimized features/non-optimized features), we decreased the overfitting very much: it went from a spread of `398.22` to a spread of `55.02`.

This is not trivial. Decision trees models are very easily overfitted -- they might just pick up the particularity of the training set and construct a sort of monster with them -- and a drastic decrease like (of about `86%`) this means our model performs better in the real world.

This is much more realistic.

## A particular case

Say we want to know if the price of a particular apartament is above or below the market price for similar apartaments. 

A naive person will just take the mean of the price for apartaments sharing similar attributes. To make the case concrete let's say the person is evaluating an apartament in Pinheiros, of about 85 squared meters, with 2 bedrooms, 1 bathroom and 1 garage:

In [23]:
df_eval = df[(df['bairro'] == 'Pinheiros') & (df['area'].between(80.0,90.0)) & (df['quartos'] == 2) & (df['banheiros'] == 1) & (df['garagem'] == 1)]

y_eval = df_eval['aluguel'] + df_eval['condominio']

print(y_eval.mean())

4229.0


Even if the person uses all these attributes (not going into the strategy of using all attributes, which is not immediately obvious why it should be the best way to evaluate) to compute the mean to establish if the apartament is a good buy or not, still the mean absolute deviation is too high (compared to the MAE of our model):

In [24]:
print(y_eval.mad())

934.0


That's a difference of a little more than `11%`.

Now, to evaluate the price of such an apartament we pass it to our model:

In [25]:
case = pd.DataFrame({'aluguel':[np.nan], 'condominio':[np.nan], 'area':[85.0], 'bairro':['Pinheiros'], 'quartos':[2.0], 'banheiros':[1.0], 'garagem':[1.0], 'anunciante':[np.nan]})
df_case = df.copy()
df_case = df_case.append(case)
df_case = pd.get_dummies(df_case, columns=['bairro'])

In [26]:
df_case.iloc[-1]

aluguel                   NaN
condominio                NaN
area                       85
quartos                     2
banheiros                   1
                         ... 
bairro_Várzea de Baixo      0
bairro_Água Branca          0
bairro_Água Fria            0
bairro_Água Funda           0
bairro_Água Rasa            0
Name: 0, Length: 480, dtype: object

In [27]:
predict_case = rf.predict(pd.DataFrame(df_case[optimized_columns].iloc[-1]).transpose())
print(predict_case)

[3766.22325437]


## Conclusion

We were able to improve not only upon the benchmark of mean plus absolute deviation (in a *very* restrict case), but also upon our previous model using linear regression.

We could improve the model further, with further feature engineering and a more fine tuning with the hyperparameters (and still making combination of both). But we choose (not only because optimizing these tree models are quite time consuming) to go on to exploring the `xgboost` algorithm (on another notebook), which we already know, beforehand, that the non-optimized model gives a better MAE than the Random Forest.