# **Dataset: Melbourne Housing**

The dataset includes Address, Type of Real estate, Suburb, Method of Selling, Rooms, Price, Real Estate Agent, Date of Sale and distance from C.B.D.
(https://www.kaggle.com/datasets/dansbecker/melbourne-housing-snapshot)

**Importing pandas**

In [1]:
import pandas as pd

**Load dataset**

In [2]:
# to use google drive as data storage
from google.colab import drive 
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [4]:
# read data melb_data.csv
housing_df = pd.read_csv("/content/gdrive/MyDrive/Data Melbourne/melb_data.csv")
housing_df.head() # show the first 5 rows

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


## **Exploratory Data Analysis (EDA)**

**Identify the shapes of the dataset**

In [7]:
housing_df.shape # shape or dimensionality of housing_df expressed by row and column (row, column)

(13580, 21)

**Get the list of columns**

In [6]:
housing_df.columns # list of column names from housing_df

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

**Data summary**

In [5]:
housing_df.describe() # summary data from housing_df

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


**Case: show the largest land size**

In [8]:
housing_df.describe()['Landsize']['max']

433014.0

**Data cleaning**

In [9]:
housing_df = housing_df.dropna() # Cleaning missing value

In [12]:
housing_df.shape # shape or dimensionality of housing_df after data cleaning

(6196, 21)

**Choose a prediction target (y)**

In [13]:
y = housing_df['Price'] # house prices are the object to be predicted
y

1        1035000.0
2        1465000.0
4        1600000.0
6        1876000.0
7        1636000.0
           ...    
12205     601000.0
12206    1050000.0
12207     385000.0
12209     560000.0
12212    2450000.0
Name: Price, Length: 6196, dtype: float64

**Selecting features (X)**

The objects involved in predicting house prices are Rooms, Bathroom, Landsize, Lattitude, Longtitude

In [14]:
features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = housing_df[features]
X

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.80790,144.99340
2,3,2.0,134.0,-37.80930,144.99440
4,4,1.0,120.0,-37.80720,144.99410
6,3,2.0,245.0,-37.80240,144.99930
7,2,1.0,256.0,-37.80600,144.99540
...,...,...,...,...,...
12205,3,2.0,972.0,-37.51232,145.13282
12206,3,1.0,179.0,-37.86558,144.90474
12207,1,1.0,0.0,-37.85588,144.89936
12209,2,1.0,0.0,-37.85581,144.99025


In [16]:
X.describe() # data summary of features

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


# **Build Machine Learning Models**

Build machine learning models with a Decision Tree Regressor

**Training dan Testing Dataset**

In [39]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, train_size=0.7) # data is categorized into 70% training data and 30% testing data


In [18]:
X_train.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
10813,3,2.0,81.0,-37.83631,144.95536
6316,3,2.0,377.0,-37.7578,145.0076
6881,7,2.0,614.0,-37.75,144.8858
2005,3,1.0,102.0,-37.736,144.9665
3984,1,1.0,0.0,-37.8157,144.9727


**Importing DecisionTreeRegressor**

In [19]:
from sklearn.tree import DecisionTreeRegressor

**Model configuration and training**

In [21]:
housing_model = DecisionTreeRegressor(random_state=1)
housing_model.fit(X_train, y_train)

DecisionTreeRegressor(random_state=1)

**Importing evaluation metric**

In [23]:
from sklearn.metrics import mean_absolute_percentage_error

**Model evaluation**

In [27]:
y_hat = housing_model.predict(X_test)
mean_absolute_percentage_error(y_test, y_hat) 

0.23899042931701311

**Model optimization**

In [32]:
def get_mape(max_leaf_nodes, X_train, X_test, y_train, y_test):
  model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
  model.fit(X_train, y_train)
  y_hat = model.predict(X_test)
  mape = mean_absolute_percentage_error(y_test, y_hat)
  return mape

**Comparing MAPE with several values of max_leaf_nodes to find the optimum number of leaves**

In [33]:
for max_leaf_nodes in [5, 50, 500, 5000]:
  leaf_mape = get_mape(max_leaf_nodes, X_train, X_test, y_train, y_test)
  print(f"Max leaf nodes: {max_leaf_nodes} \t Mean Absolute Percentage Error: {(leaf_mape)}")

Max leaf nodes: 5 	 Mean Absolute Percentage Error: 0.40193485771459436
Max leaf nodes: 50 	 Mean Absolute Percentage Error: 0.2621266250239739
Max leaf nodes: 500 	 Mean Absolute Percentage Error: 0.22911244195276875
Max leaf nodes: 5000 	 Mean Absolute Percentage Error: 0.24502910124021718


# **Exploration Random Forest Models**

**Building machine learning models with Random Forest Regressor**

**Importing RandomForestRegressor**

In [34]:
from sklearn.ensemble import RandomForestRegressor

**Model optimization**

In [40]:
rf_model = RandomForestRegressor(n_estimators=100, random_state=1)
rf_model.fit(X_train, y_train)
y_hat = rf_model.predict(X_test)
print(f"Mean Percentage Absolute Error: {mean_absolute_percentage_error(y_test, y_hat)}")

Mean Percentage Absolute Error: 0.18247748965116836


# **Conclusion**

The stages of creating a machine learning model begin with exploratory data analysis to see the characteristics of the data and continue with data cleaning. Then the data is categorized into 70% training data and 30% testing data, then a Decision Tree Regressor and Random Forest Regressor machine learning model is created.

Based on the Decision Tree Regressor model, the optimum max leaf nodes used are 500, but when compared to the Random Forest Regressor model which uses 100 Decision Tree models, the Random Forest Regressor model is better than the Decision Tree Regressor model which uses 500 max leaf nodes.