## Machine Learning Model ##

In [1]:
import pandas as pd

df = pd.read_csv('melb_df.csv')
df

Unnamed: 0.1,Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,...,1.0,0.0,156.0,79.00,1900.0,Yarra,-37.80790,144.99340,Northern Metropolitan,4019.0
1,4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,...,2.0,0.0,134.0,150.00,1900.0,Yarra,-37.80930,144.99440,Northern Metropolitan,4019.0
2,6,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,...,1.0,2.0,120.0,142.00,2014.0,Yarra,-37.80720,144.99410,Northern Metropolitan,4019.0
3,11,Abbotsford,124 Yarra St,3,h,1876000.0,S,Nelson,7/05/2016,2.5,...,2.0,0.0,245.0,210.00,1910.0,Yarra,-37.80240,144.99930,Northern Metropolitan,4019.0
4,14,Abbotsford,98 Charles St,2,h,1636000.0,S,Nelson,8/10/2016,2.5,...,1.0,2.0,256.0,107.00,1890.0,Yarra,-37.80600,144.99540,Northern Metropolitan,4019.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6191,19732,Whittlesea,30 Sherwin St,3,h,601000.0,S,Ray,29/07/2017,35.5,...,2.0,1.0,972.0,149.00,1996.0,Whittlesea,-37.51232,145.13282,Northern Victoria,2170.0
6192,19733,Williamstown,75 Cecil St,3,h,1050000.0,VB,Williams,29/07/2017,6.8,...,1.0,0.0,179.0,115.00,1890.0,Hobsons Bay,-37.86558,144.90474,Western Metropolitan,6380.0
6193,19734,Williamstown,2/29 Dover Rd,1,u,385000.0,SP,Williams,29/07/2017,6.8,...,1.0,1.0,0.0,35.64,1967.0,Hobsons Bay,-37.85588,144.89936,Western Metropolitan,6380.0
6194,19736,Windsor,201/152 Peel St,2,u,560000.0,PI,hockingstuart,29/07/2017,4.6,...,1.0,1.0,0.0,61.60,2012.0,Stonnington,-37.85581,144.99025,Southern Metropolitan,4380.0


In [2]:
y = df['Price'] # the target (dependent variable)
X = df[['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']] # independent variables

### Implement the Random Forest Algorithm 

In [3]:
from sklearn.ensemble import RandomForestRegressor

rf_prediction_model = RandomForestRegressor() # declare the algorithm object
rf_prediction_model.fit(X,y) # fit the model. (independent_variables, target_variable)

RandomForestRegressor()

#### Check model's output

In [4]:
print('input samples:')
print(X.head())
print('Actual prices')
print(y.head())
print('model`s output')
print(rf_prediction_model.predict(X.head()))

input samples:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
0      2       1.0     156.0   -37.8079    144.9934
1      3       2.0     134.0   -37.8093    144.9944
2      4       1.0     120.0   -37.8072    144.9941
3      3       2.0     245.0   -37.8024    144.9993
4      2       1.0     256.0   -37.8060    144.9954
Actual prices
0    1035000.0
1    1465000.0
2    1600000.0
3    1876000.0
4    1636000.0
Name: Price, dtype: float64
model`s output
[1060955.  1359075.  1524178.5 1719845.  1509950. ]


At this moment, the model is unprecise due to overfitting and non-calibrated training. Lest's work it out to refine the presition, but first, lest measure the MAE (mean absolute error):

## MAE

In [5]:
from sklearn.metrics import mean_absolute_error

prediction = rf_prediction_model.predict(X)
mean_absolute_error(y, prediction)

69321.3416169162

The MAE output declares that the model if off by ~ $69,503. To calibrate this, we are going to balance the fitting of the model by dividing the data into two parts: training, and actual values

## Validation

In [6]:
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X,y) # four variables take the four values returned by the method.

rf_prediction_model = RandomForestRegressor() #declare another algorithm object
rf_prediction_model.fit(train_X,train_y) # fit the new predictor just with the training data

# run the predictor with the actual values to measure the MAE
val_prediction = rf_prediction_model.predict(val_X) 
print(mean_absolute_error(val_y, val_prediction)) # check the new MAE when fitting it with the training values

192064.63707122873


To refine the algorithm`s performance, we need to determine what number of leafs delivers the better result. To calculate this, it is necessary to iterate over a list of possible max leaf values, run the algorithm, and measure the MAE for each number of leafs. The leaf number that return the lowest MAE is the optimal max leaf to fit the model. A get_mae function will help to optimize the process

In [7]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = RandomForestRegressor(max_leaf_nodes = max_leaf_nodes) # create a model specifying the max_leaf_nodes parameter
    model.fit(train_X, train_y)
    val_prediction = model.predict(val_X)
    mae = mean_absolute_error(val_y, val_prediction)
    return mae

# start iteration

for max_leaf_nodes in [55,550,5500,5550]:
    mae_value = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print(f'Max leaf nodes: {max_leaf_nodes}............MAE: {mae_value}')

Max leaf nodes: 55............MAE: 234978.72931908243
Max leaf nodes: 550............MAE: 194484.30262729805
Max leaf nodes: 5500............MAE: 190607.68603594744
Max leaf nodes: 5550............MAE: 191804.16743275232


As we can see, the most optimal MAE output is obtaines with 5500 leaf nodes

In [8]:
try_model = RandomForestRegressor(max_leaf_nodes = 5555)
try_model.fit(train_X,train_y)

print("input data")
print(val_X.head())
print("actual values")
print(val_y.head())
print("output")
out = try_model.predict(val_X.head())
print(out)
print("MAE value")
print(mean_absolute_error(val_y.head(), out))

input data
      Rooms  Bathroom  Landsize  Lattitude  Longtitude
5690      4       3.0     153.0  -37.83591   144.94963
1180      3       2.0     118.0  -37.78630   145.10600
2649      1       1.0    1768.0  -37.85100   144.98800
2310      3       2.0     472.0  -37.93870   145.04610
1523      3       2.0     112.0  -37.80010   144.87970
actual values
5690    1800000.0
1180     746000.0
2649     492000.0
2310     917000.0
1523     735000.0
Name: Price, dtype: float64
output
[2280217.          938286.33333333  474103.         1056570.
  650910.        ]
MAE value
182812.06666666665


Since the result is still fluctuating with a high error margin, we should use another approach. The GridSearchCV module is used especially to determine the optimal values for the algorithm's parameters, hence improving the algorithm's overall performance.

In [9]:
from sklearn.model_selection import GridSearchCV

#Define the parameter grid
param_grid = {
#     'n_estimators': [300,400,500,600],
#     'max_depth': [40,45,50,55,60,65,70,75],
#     'min_samples_split': [10,15,20],
#     'min_samples_leaf': [1, 2, 4, 5],
#     'max_features': ['auto', 'sqrt', 'log2'],
#     'max_leaf_nodes': [5550,5555,5600,6000]
}

# Create a RandomForestClassifier instance
rf = RandomForestRegressor()

# Use GridSearchCV to find the best parameters
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='neg_mean_absolute_error',n_jobs=-1)
grid_search.fit(train_X,train_y)

# # Print the best parameters
print("Best parameters found: ", grid_search.best_params_)

#results: 
#'n_estimators': 280
#'max_depth': 50
#'min_samples_split': 10
#'min_samples_leaf': 2
#'max_features': 'sqrt'
#'max_leaf_nodes': 5555


Best parameters found:  {}


In [10]:
# Train the model with the best parameters
# best_rf = grid_search.best_estimator_
# best_rf.fit(X_train, y_train)

# or recreate the model manualy:
best_rf = RandomForestRegressor(n_estimators=280,
                                max_depth=50, 
                                min_samples_split = 10, 
                                min_samples_leaf = 2, 
                                max_features='sqrt',
                               max_leaf_nodes= 5555)
best_rf.fit(train_X,train_y)

print("input data")
print(val_X.head())
print("actual values")
print(val_y.head())
print("output")
out = best_rf.predict(val_X.head())
print(out)
print("MAE value")
print(mean_absolute_error(val_y.head(), out))

input data
      Rooms  Bathroom  Landsize  Lattitude  Longtitude
5690      4       3.0     153.0  -37.83591   144.94963
1180      3       2.0     118.0  -37.78630   145.10600
2649      1       1.0    1768.0  -37.85100   144.98800
2310      3       2.0     472.0  -37.93870   145.04610
1523      3       2.0     112.0  -37.80010   144.87970
actual values
5690    1800000.0
1180     746000.0
2649     492000.0
2310     917000.0
1523     735000.0
Name: Price, dtype: float64
output
[2024757.023561    974725.46413789  473014.54354695 1166446.2719341
  783653.15556782]
MAE value
154113.47433077352


## Conclutions 
In this case, the best performance for the Random forest is achieved by using the parameters: 
n_estimators=280,
max_depth=50, 
min_samples_split = 10, 
min_samples_leaf = 2, 
max_features='sqrt',
max_leaf_nodes= 5555

In [22]:


best_rf.fit(X,y)
def rf_predictor(rooms, bathroom, landsize, lattitude, longtitude):
    out = best_rf.predict([[rooms, bathroom, landsize, lattitude, longtitude]])
    return out
print("input data")
print(X.head())
print("actual values")
print(y.head())
print("output")
# out = best_rf.predict([[2,1.0,156.0,-37.8079,144.9934]])
# print(out)
print(rf_predictor(2,1.0,156.0,-37.8079,144.9934))
print("MAE value")
# print(mean_absolute_error(y.head(), out))

input data
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
0      2       1.0     156.0   -37.8079    144.9934
1      3       2.0     134.0   -37.8093    144.9944
2      4       1.0     120.0   -37.8072    144.9941
3      3       2.0     245.0   -37.8024    144.9993
4      2       1.0     256.0   -37.8060    144.9954
actual values
0    1035000.0
1    1465000.0
2    1600000.0
3    1876000.0
4    1636000.0
Name: Price, dtype: float64
output
[1081834.7670911]
MAE value


