
## 1. Predicting housing prices using Regression
Simple example of supervised learning using decision trees and random forests.  From "Introduction to Machine Learning"

In [111]:
import pandas as pd
import numpy as np
from pandas import DataFrame as df
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

In [25]:
home_file_path = '~/Documents/ML_course/train.csv'
home_data = pd.read_csv(home_file_path) 
home_data.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


1. Predicting values using DecisionTreeRegressor, model accuracy using mean_absolute_error

In [73]:
y = home_data.SalePrice
feature_columns = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[feature_columns]

# Define model
home_model = DecisionTreeRegressor()
home_model.fit(X, y)
mean_value = df.mean(home_data.SalePrice)
print("relative error (in sample, no split) (%)", mean_absolute_error(y, home_model.predict(X))/mean_value*100)

relative error (in sample, no split) (%) 0.03446491583955324


Models' practical value $\rightarrow$ making predictions on $\textbf{new}$ data, so:

2. We exclude some data from the model-building process, then use those to test it - dividing the data in train data and test data.


In [74]:
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state = 0)
# Define model
home_model = DecisionTreeRegressor()
# Fit model
home_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = home_model.predict(test_X)                 
print("relative error (%)", mean_absolute_error(test_y, val_predictions)/mean_value*100)

relative error (%) 18.56042023230685


The error of almost 20% makes this model unusable. Since the "in sample error" is very small, we conclude that our model is overfit.
To improve this movel, we can experiment with:
* choosing different features
* different model types.

First approach:

3. Finding a sweet spot between over/underfitting by controlling the tree depth - number of nodes.

In [110]:
##function to compare Mean Absolute Error (MAE) scores for different numbers of nodes
def get_mae(max_leaf_nodes, train_X, test_X, train_y, test_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(test_X)
    mae = mean_absolute_error(test_y, preds_val)
    return(mae)

def best_nodes():
    for leaf_size in np.arange(2,200,50):
        scores = {leaf_size: get_mae(leaf_size, train_X, test_X, train_y, test_y)}
    return min(scores, key=scores.get)


# Fit the model with best_tree_size. Fill in argument to make optimal size
final_model = DecisionTreeRegressor(max_leaf_nodes=best_nodes(), random_state=1)

# fit the final model
final_model.fit(X, y)
val_predictions = final_model.predict(test_X)                 
print("relative error (%)", mean_absolute_error(test_y, val_predictions)/mean_value*100)
#best_nodes(max_leaf_nodes, train_X, test_X, train_y, test_y)

relative error (%) 7.838130878949268


4. We improved our model by 10% just adjusting the amount of nodes! Now using RandomForestRegressor instead of decision trees, we actually didn't find a better result:


In [116]:
forest_model = RandomForestRegressor(random_state=2)
forest_model.fit(train_X, train_y)
home_preds = forest_model.predict(test_X)
print("relative error (%)",mean_absolute_error(test_y, home_preds)/mean_value*100)

relative error (%) 12.679975565349173
