# Modeling house prices

In this notebook, we model our data with two diferent models, RandomForestRegressor and XGBRegressor, and compare them.

In [1]:
from matplotlib import pyplot as plt
import pandas as pd
#import pylab as pl
import numpy as np

from sklearn import linear_model

from sklearn.metrics import r2_score, mean_squared_error,mean_absolute_error


from sklearn.model_selection import RepeatedKFold
from math import sqrt
from numpy import random
#from scipy import stats as stats

from sklearn import preprocessing
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2


from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import KFold

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate

from sklearn.feature_selection import SelectFromModel

# Import data

Here, we import the data that was prepared in the last notebook.

In [2]:
X=pd.read_csv('X')
y=pd.read_csv('y')
X_test=pd.read_csv('X_test')

Then we split the data in to train and validation sets.

In [3]:
SEED= 0
random.seed(SEED)

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.2)

# Random Forest model

Now we need to model our data with some kind of regression. First we try the random forest regressor model that is a robust model that doesn't require feature slection or scaling. For a quick result let's first a train and validation sets obtained by train_test_split.

In [4]:
model = RandomForestRegressor(n_estimators=300, min_samples_leaf=1, n_jobs=-1, criterion='absolute_error', random_state=0)
model.fit(X_train, y_train.values.ravel())
predictions = model.predict(X_valid)


score2 = r2_score(y_valid, predictions)
print("The accuracy of our model is {}%".format(round(score2, 2) *100))
print('Square Root of Mean Squared Error (MSE):', mean_squared_error(y_valid, predictions)**0.5)

The accuracy of our model is 85.0%
Square Root of Mean Squared Error (MSE): 32609.166470683933


## Cross validation analysis for Random Forest model

When we use the train_test_split(), we split our data in just one way. Therefore the result will be influenced by this choice. In order to correct this problem, we will use the kfold method that will generate diferent trains and test sets.

Furthermore, in order to optimize our model, we use a randomized search to estimate the best parameters.

In [5]:
SEED =0
np.random.seed(SEED)

model = RandomForestRegressor()

# Define the grid of hyperparameters to search
espaço_de_parametros = {'bootstrap': [True, False],
                        'n_estimators': [50, 100, 200, 300],
                        'n_jobs': [-1],
                        'min_samples_leaf': [1, 4, 8, 16, 32]
                        }

# Set up the random search
busca = RandomizedSearchCV(estimator=model,
                    param_distributions=espaço_de_parametros,
                    cv=KFold(n_splits=10, shuffle = True),
                    n_iter=16,
                    random_state=SEED)
busca.fit(X,y.values.ravel())
resultados=pd.DataFrame(busca.cv_results_)
resultados.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_jobs,param_n_estimators,param_min_samples_leaf,param_bootstrap,params,split0_test_score,...,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,4.874007,0.62003,0.042301,0.009151,-1,200,1,False,"{'n_jobs': -1, 'n_estimators': 200, 'min_sampl...",0.674091,...,0.605937,0.769375,0.748209,0.748716,0.789028,0.842214,0.753493,0.726958,0.069806,14
1,1.207811,0.070562,0.016749,0.003105,-1,50,1,False,"{'n_jobs': -1, 'n_estimators': 50, 'min_sample...",0.728395,...,0.595029,0.779399,0.747285,0.747771,0.790695,0.843079,0.759461,0.731981,0.072529,13
2,1.437687,0.07322,0.022536,0.002781,-1,100,4,False,"{'n_jobs': -1, 'n_estimators': 100, 'min_sampl...",0.592751,...,0.594053,0.600902,0.754432,0.755361,0.817732,0.825108,0.763688,0.706145,0.097671,15
3,0.523662,0.003473,0.016317,0.00355,-1,50,4,True,"{'n_jobs': -1, 'n_estimators': 50, 'min_sample...",0.82701,...,0.871283,0.77196,0.913901,0.86531,0.846454,0.866505,0.890038,0.848279,0.048624,2
4,1.400132,0.007517,0.036546,0.003224,-1,200,8,True,"{'n_jobs': -1, 'n_estimators': 200, 'min_sampl...",0.839225,...,0.858819,0.773061,0.906319,0.854013,0.814119,0.867496,0.883276,0.843836,0.045159,3


After the search we select the best parameters for our model.

In [6]:
melhor = busca.best_estimator_
melhor

Then we define a function that returns the mean, maximum and minimum estimator performance. In this way we can compare the performance of differnt models.

In [7]:
def imprime_resultados(results):
    media = results['test_score'].mean() * 100
    minimum = results['test_score'].min() * 100
    maximum = results['test_score'].max() * 100
    print("Accuracy médio %.2f" % media)
    print("Intervalo [%.2f, %.2f]" % (minimum, maximum))

And finaly we aply the function imprime_results to see the mean, maximum and minimum performance of the best estimator.

In [8]:
model = RandomForestRegressor(n_estimators=200, n_jobs=-1)

cv = KFold(n_splits=10, shuffle = True)
results = cross_validate(model, X, y.values.ravel(), cv = cv)
imprime_resultados(results)


Accuracy médio 84.10
Intervalo [66.22, 90.76]


# Random Forest Test

Here we retrain our best model with the full dataset since more data tends to improve the model. 

Then, we use this model to estimate the prices of the houses in the test set. 

By submiting this results to the kaggle site and get the position 1135 from a total of 3975. 

In [9]:
model.fit(X, y.values.ravel())

In [10]:
predictions = model.predict(X_test)

In [11]:
sub = pd.Series(predictions, index=X_test['Id'], name='SalePrice')
sub.shape

(1459,)

In [12]:
sub.to_csv("Teste_RandomForest.csv", header=True)

# XGBoost model

Now we will try the XGBoost regressor model, that is a robust model that doen't require feature slection or scaling. For a quick result let's first a train and validation sets obtained by train_test_split.

In [13]:
model2 = XGBRegressor(random_state=0,
                     learning_rate = 0.20,
                     n_estimators=800,
                     max_depth=2,
                     min_child_weight=4,
                     #colsample_bynode=1
                     #subsample=0.75
                     #num_parallel_tree=9
                     objective='reg:squarederror'
                     )
model2.fit(X_train, y_train)
predictions = model2.predict(X_valid)

from sklearn.metrics import r2_score
score2 = r2_score(y_valid, predictions)
print("The accuracy of our model is {}%".format(round(score2, 2) *100))
from sklearn import metrics
print('Square Root of Mean Squared Error (MSE):', metrics.mean_squared_error(y_valid, predictions)**0.5)

The accuracy of our model is 84.0%
Square Root of Mean Squared Error (MSE): 32774.91089271422


## Cross validation analises for XGBust model

When we use the train_test_split(), we split our data in just one way. Therefore the result will be influenced by this choice. In order to correct this problem, we will use the kfold method that will generate diferent trains and test sets.

Furthermore, in order to optimize our model, we use a randomized search to estimate the best parameters.

In [14]:
SEED =0
np.random.seed(SEED)

model2 = XGBRegressor()

# Define the grid of hyperparameters to search
espaço_de_parametros = {'learning_rate': [0.01, 0.1, 0.2, 0.4, 1],
                        'n_estimators': [100, 400, 800],
                        'max_depth': [1, 2, 4, 8, 16, 32],
                        'min_child_weight': [1, 4, 8, 16, 32]
                        }

# Set up the random search
busca2 = RandomizedSearchCV(estimator=model2,
                    param_distributions=espaço_de_parametros,
                    cv=KFold(n_splits=10, shuffle = True),
                    n_iter=16,
                    random_state=SEED)
busca2.fit(X,y)
resultados2=pd.DataFrame(busca2.cv_results_)
resultados2.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_estimators,param_min_child_weight,param_max_depth,param_learning_rate,params,split0_test_score,...,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,1.163052,0.133422,0.015923,0.003924,400,4,4,0.1,"{'n_estimators': 400, 'min_child_weight': 4, '...",0.842831,...,0.909502,0.74622,0.922053,0.91322,0.897265,0.893884,0.900577,0.874103,0.054425,1
1,0.769298,0.082086,0.015345,0.002356,100,16,8,0.01,"{'n_estimators': 100, 'min_child_weight': 16, ...",0.677934,...,0.731954,0.683863,0.738298,0.698984,0.662822,0.704534,0.717376,0.706028,0.028204,14
2,2.811425,0.255358,0.017312,0.000508,400,32,32,0.2,"{'n_estimators': 400, 'min_child_weight': 32, ...",0.826064,...,0.88542,0.767914,0.885236,0.885132,0.90014,0.860625,0.870948,0.859266,0.038163,6
3,1.660806,0.164682,0.016202,0.002833,800,8,2,0.4,"{'n_estimators': 800, 'min_child_weight': 8, '...",0.840929,...,0.905594,0.751263,0.892737,0.906858,0.901477,0.865407,0.87559,0.865715,0.04477,5
4,5.354753,0.862188,0.020411,0.006902,800,4,8,0.2,"{'n_estimators': 800, 'min_child_weight': 4, '...",0.832811,...,0.865989,0.808525,0.867671,0.861132,0.893028,0.887432,0.89718,0.855499,0.03975,7


After the search we select the best parameters for our model.

In [15]:
melhor = busca2.best_estimator_
melhor

We aply the function imprime_results to see the mean, maximum and minimum performance of the best estimator.

In [16]:
model2 = XGBRegressor(random_state=0,
                     learning_rate = 0.1,
                     n_estimators=800,
                     max_depth=4,
                     min_child_weight=32
                     )


cv = KFold(n_splits=10, shuffle = True)
results2 = cross_validate(model2, X, y, cv = cv)
imprime_resultados(results2)

Accuracy médio 83.63
Intervalo [37.48, 93.84]


# Teste do XGBoost

Here we retrain our best model with the full dataset since more data tends to improve the model. 

Then, we use this model to estimate the prices of the houses in the test set. 

By submiting this results to the kaggle site and get the position 1133 from a total of 3975. 

In [17]:
model2.fit(X, y)

In [18]:
predictions2 = model2.predict(X_test)

In [19]:
sub2 = pd.Series(predictions2, index=X_test['Id'], name='SalePrice')
sub2.shape

(1459,)

In [20]:
sub2.to_csv("Teste_XGBoost.csv", header=True)

# Conclusions

We tested two difererent models: Random Forest and XGBoost. For the Random Forest, the mean accuracy is 84.10% and interval is [66.22%, 90.76%] for the validation set. For XGBoost, the mean accuracy 83.63%, and interval is [37.48, 93.84] for the validation set. It seens that Random Forest has a slightly better mean value and is more stable, althogh in the test set, the XGBoost performs better than the Random Forest.