# MAST30034 Project 2 Group 1 Summary Notebook

## 1. Data collection

### Data sources

#### External datasets
- **ABS Census data:** Downloaded from ABS Census DataPacks - General Community Profile. As of 2021.
- **Crime data:**: Downloaded from Crime Statstics Agency - Criminal incident by LGA - Year Ending Mar 2025.
- **Population data:** Downloaded from Department of Transport and Planning. VIF2023 Victoria Demographic Projections to 2051
- **Public transport data:** Downloaded from PTV General Transit Feed Specification (GTFS) Data. As of September 2025.
- **School data:** Downloaded from Department of Education. School Locations 2024.
- **Median property prices:** Downloaded from VIC.GOV.AU Victorian Property Sales Report - Median House by Suburb Quarterly

#### Given datasets
- **Rental property listings:** Scraped from domain.com.au. As of September 2025.
- **Moving annual median rent by suburb and town:** Victoria Government Families, Fairness and Housings Rental report. As of March quarter 2025.

**Assumptions and limitations:**

The rental dataset reflects only observed/scraped listings and may not necessarily be representative of all rental properties available within Victoria.

Additionally, using several external datasets (resulting in 57 total features) may introduce potential multicollinearity between features (e.g. number of schools vs suburb population). Thus, the model may capture correlations rather than causal effects. However, we aim to address this through feature selection and feature importance.

Finally, the external datasets span different time periods—some are current (e.g., crime and transport data), while others, such as the ABS Census (last updated in 2021), are less recent. This temporal mismatch may introduce inconsistencies, as demographic or social trends could have shifted in the intervening years. Thus, we assume that the characteristics captured in older datasets remain reasonably stable over time, and any changes since collection are gradual enough not to invalidate their use in modelling current rental prices.

## 2. Dataset building

The 5 external datasets were processed and cleaned to be merged with the rental property listings dataset.

This means that for each rental listing, there will be rental specific attributes (e.g. number of bedrooms, number of schools within 2km), as well as suburb level attributes (e.g. suburb population, suburb crime index).

This gives us 57 initial attributes/features, which need to be further cleaned, processed, filtered and selected to analyse rental prices.

In [1]:
import pandas as pd
merged_dataset = pd.read_csv("../data/processed/real_estate/vic_rentals_all_enriched.csv")
merged_dataset.columns

Index(['listing_id', 'suburb', 'postcode', 'weekly_rent', 'bond',
       'available_date', 'date_listed', 'days_listed', 'bedrooms', 'bathrooms',
       'carspaces', 'property_type', 'address', 'lat', 'lon', 'photo_count',
       'video_count', 'floorplans_count', 'virtual_tour', 'primary_type',
       'secondary_type', 'agency', 'agent_names', 'land_area',
       'num_metro_bus_stops', 'num_metro_tram_stops', 'num_metro_train_stops',
       'num_regional_bus_stops', 'num_regional_train_stops', 'num_schools_2km',
       'Median_age_persons', 'Median_mortgage_repay_monthly',
       'Median_tot_prsnl_inc_weekly', 'Median_rent_weekly',
       'Median_tot_fam_inc_weekly', 'Average_num_psns_per_bedroom',
       'Median_tot_hhd_inc_weekly', 'Average_household_size',
       'Owner occupied (%)', 'Mortgage (%)', 'Total rented (%)',
       'Other tenure (%)', 'Unemployment', 'post_gradutae (%)',
       'Graduate_diploma_certificate(%)', 'Bachelor (%)',
       'Advanced_&_Diploma (%)', 'Certific

## 3. Feature Engineering

### Handling missing values

With the initial merged dataset, need to deal with any possible nan values per feature. This follows a 2 step process:

**1. Mean imputation:** We create a lookup dictionary that is grouped by the property's suburb, property_type, bedrooms, and bathrooms and impute the feature's missing value according to the mean-aggregated dictionary value. Next, we create a relaxed version of this lookup dictionary on 'property_type', 'bedrooms' and use a similar pattern to impute more nans. This is done for the "weekly_rent" and "carspaces" features, which have the among the highest missing values. 

**2. Listwise deletion:** After imputing nans, there is significantly less remaining missing values so just drop them.

Note: land_area was simply dropped and remove despite being a potential useful feature as it had too many missing values i.e 12329/12331

**Assumptions and limitations**
- Assumes features are missing at random and not systematically biased
- Mean imputation may be too simple a method for imputation so may ignore natural variance in features. Though this is aimed to be addressed through the use of lookup dictionaries.

In [2]:
def find_nans(data):
    missing_list = [(col, data[col].isnull().sum()) for col in data.columns]
    non_nans = [(col, cnt) for col, cnt in missing_list if cnt != 0]
    return sorted(non_nans, key=lambda x: x[1], reverse=True)  # sort by column name

### Outlier detection

These numerical features were considered to determine outliers: ['weekly_rent', 'bedrooms', 'bathrooms', 'carspaces', 'num_metro_bus_stops', 'num_metro_tram_stops', 'num_schools_2km', 'incidents_recorded']

We assume that extreme values are unrepresentative, which should be a valid assumption as there are only 27 rental properties above $3000, which should have little impact on model performance.

We used 3000 as the as upper limit for the rental prices of houses in the Vic-Gov website is 2885. (https://www.housing.vic.gov.au/what-does-rent-cost-victoria)

In [5]:
data=merged_dataset
#Find how many 0 weekly_rent values there are
zero_rent_count = (data["weekly_rent"] == 0).sum()
print("Zero rent count:", zero_rent_count)

#Find how many high outlier weekly_rent values there are i.e above 3000
highoutlier_rent_count = (data["weekly_rent"] >= 3000).sum()
print("High outlier rent count:", highoutlier_rent_count)

#Find how many data points with 50 or more bedrooms
high_bedroom_count = (data["bedrooms"] >= 50).sum()
print("High bedroom count:", high_bedroom_count)

Zero rent count: 16
High outlier rent count: 26
High bedroom count: 1


Surprisingly, not many outliers were detected. Removing these outliers gives the following distribution for the numerical values:

In [6]:
#Remove outliers rows
data = data[(data["weekly_rent"] > 0) & (data["weekly_rent"] <= 3000) & (data["bedrooms"] < 50)]
#Looking at numerical variables
data[['weekly_rent', 'bedrooms', 'bathrooms', 'carspaces', 'num_metro_bus_stops', 'num_metro_tram_stops', 'num_schools_2km', 'incidents_recorded']].describe()

Unnamed: 0,weekly_rent,bedrooms,bathrooms,carspaces,num_metro_bus_stops,num_metro_tram_stops,num_schools_2km,incidents_recorded
count,11913.0,11913.0,11910.0,10264.0,11913.0,11913.0,11913.0,11913.0
mean,622.27726,2.721061,1.588161,1.66855,62.066734,21.067237,8.073533,13273.937392
std,249.702662,1.080075,0.629483,0.9347,43.156864,35.080943,4.794813,5812.202109
min,33.0,1.0,1.0,1.0,0.0,0.0,0.0,77.0
25%,490.0,2.0,1.0,1.0,23.0,0.0,4.0,9525.0
50%,560.0,3.0,2.0,2.0,66.0,0.0,8.0,13140.5
75%,685.0,4.0,2.0,2.0,96.0,35.0,12.0,17495.333333
max,3000.0,11.0,12.0,22.0,183.0,127.0,23.0,34620.0


Finally, after encoding categorical variables (extracting time features from 'available_date'), our dataset is finally cleaned and ready to be analysed to determine important features for predicting rental prices.

## 4. Modelling Feature Importance

Models: Random Forest Regressor and XG boost were selected with their ability to capture complex relationships in data and a useful feature importance function to help understand which features are were most important in model predictions. 

### Feature Engeering And Encoding

Time data was feature engineered to hour, day and month; and then encoded using cyclic encoding to help model capture potential seasonal change in rent prices.

Frequency encoding was used for non-numerical features i.e postcode, property_type and agency. 

In [7]:
#Feature Engineering time data
data['available_date'] = pd.to_datetime(data['available_date'], errors='coerce')
data['available_day'] = data['available_date'].dt.day
data['available_month'] = data['available_date'].dt.month   
data['available_year'] = data['available_date'].dt.year
data = data.drop(columns=['available_date'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['available_date'] = pd.to_datetime(data['available_date'], errors='coerce')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['available_day'] = data['available_date'].dt.day
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['available_month'] = data['available_date'].dt.month
A value is t

In [9]:
import numpy as np

# Encode month cyclically 
data['month_sin'] = np.sin(data['available_month'] / 12 * 2 * np.pi)
data['month_cos'] = np.cos(data['available_month'] / 12 * 2 * np.pi)
data = data.drop(columns=['available_month'])

#Encode day cyclically
data['day_sin'] = np.sin(data['available_day'] / 31 * 2 * np.pi)
data['day_cos'] = np.cos(data['available_day'] / 31 * 2 * np.pi)
data = data.drop(columns=['available_day'])

#Frequency encoding for Non-numericeal columns
post_freq = data['postcode'].value_counts(normalize=True)
data['postcode'] = data['postcode'].map(post_freq)
property_freq = data['property_type'].value_counts(normalize=True)
data['property_type'] = data['property_type'].map(property_freq)
agency_freq = data['agency'].value_counts(normalize=True)
data['agency'] = data['agency'].map(agency_freq)

#### Limitations and Assumptions

We assume that postcode, propety_type and agency have certain catergories(e.g a common postcode, popular property type,etc.) that can influence weekly rent prices when using frequency encoding.

Cycle encodinng treats months and day pattern as if patterns always repeat identically, training only on 2025 might cause bias in the predictions of data for a different year. This shouldn't be an issue however, as this data set will not be used in forcasting.

### Data leakage

Certain features in the data such as 'median_rent_weekly', 'median_morgage_repay_monthly' and 'bond' can cause data leakage and hence were removed. (bond was remove earlier in preprocessing)

In [10]:
#Drop lat and long for modeling
data = data.drop(columns=['lat', 'lon', 'Median_rent_weekly', 'Median_mortgage_repay_monthly'])

All data was rescaled using the Standardize Scalar method

In [12]:
from sklearn.preprocessing import StandardScaler

#Standardize scalar, resacling all data. (can target specific columns if needed)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

NameError: name 'X_train' is not defined

### Findings

Feature Selection was conducted using Mutual Information(MI) as we hav 42 features. 20 Features were selected from the 42.

Results show that models without feature selection performes significantly better with average r^2 of 0.7 and average MAE of 70 compared to models with feature selection with average r^2  0.25 and average MAE of 126.5.

Cause: likely due to MI only capturing univeriate relationships between one variable vs target variable while our data contains strong multivariable interactions; hence MI fails to compare them and discards useful features.

In [14]:
from sklearn.feature_selection import mutual_info_regression


# Compute MI
mi = mutual_info_regression(X, y, discrete_features="auto", random_state=0)
mi_scores = pd.Series(mi, index=X.columns).sort_values(ascending=False)

k = 20
selected_features = mi_scores.head(k).index.tolist()

X_selection = X[selected_features]
X_selection

NameError: name 'X' is not defined

#### Model Findings
Most important and strongest indicators variables are **Bedrooms**, **Bathrooms** and **Bachelor(%)** with secondary drivers such as **education level**, **household income** and **public transport convenience**  able to provide additional context for more accurate model predictions of rental prices. 
 

In [15]:
rf_feature_importance = pd.read_csv("../data/outputs/rf_importance_features.csv")
rf_feature_importance.head(10)

Unnamed: 0,Feature,Importance(%) for Prediction
0,bathrooms,21.96
1,Bachelor (%),16.2
2,bedrooms,13.21
3,days_listed,3.66
4,Median_tot_fam_inc_weekly,2.91
5,Median_tot_prsnl_inc_weekly,2.73
6,num_metro_bus_stops,2.56
7,num_metro_tram_stops,2.55
8,agency,2.16
9,postcode,2.13


In [16]:
xgboost_feature_importance = pd.read_csv("../data/outputs/xgboost_importance_features.csv")
xgboost_feature_importance.head(10)

Unnamed: 0,Feature,Importance(%) for Prediction
0,bathrooms,26.63
1,bedrooms,20.99
2,Bachelor (%),14.36
3,Median_tot_fam_inc_weekly,8.52
4,Certificate_level (%),3.06
5,Median_tot_prsnl_inc_weekly,2.65
6,Graduate_diploma_certificate(%),2.46
7,num_metro_tram_stops,2.27
8,population_est,2.18
9,carspaces,1.93


#### Limitations and assumptions

While both models agree on the same set of primary features, they differ slightly on the order of importance of secondary drivers. These dependencies arise from differences in model assumptions and algorithms. Nonetheless, the close agreement on the top predictors indicates that the key drivers identified are reliable.

We assume correlated features don’t overly bias interpretations as models can distribute importance to these variables differently.

## 5. Liveability & affordability



### Components that go into livability score of a suburb are: 
- number of schools  
- number of train, bus and tram stops 
- crime index as normalised (0-100) score of total offenses recorded within the suburb, where below 20 is considered low crime

To quanitify the livability we assigned weights to the components above, the assumption was that crime is the most important indicator of suburb's quality hence its weight of 0.4. Number of schools is the second most important factor with weight 0.3, then number of public transport stops with overall weight of 0.3. 

### Affordability calculation 

Affordability was quantified as ratio of weekly rent to median weekly household income in a suburb. 

Graph below shows the top 10 suburbs with highest affordability score. 

![Top 10 Affordability](/graphs/Most_affordable.png)

The bar chart below shows 10 subrubs with highest livability score. 

![Top 10 Livability](/graphs/Most_livable.png)


## 6. Rent Forecasting
The model we incorporated for rent forecasting was a time series model using Auto ARIMA, designed to capture seasonal trends in quarterly rental data from 2000 to 2024. Each suburb and property type was modeled separately to account for local variations in rental dynamics. The model produced five-year forecasts (2025–2029), allowing us to estimate future rent trajectories across different property categories. The results highlight that 2-bedroom houses and flats were the most predictable, while larger properties such as 3- and 4-bedroom dwellings showed greater volatility, likely reflecting lower transaction volumes and higher sensitivity to economic fluctuations. The suburb with the largest increase in predicted growth is Mildura while the lowest is Clayton



### Findings 

Our model allowed as to forecast weekly rental price for each quarter of the next 5 years. When we ran the model on validation set with historical data it resulted in accuracy of 94-97 % which is satisfactory. 

![Forecast example](/graphs/forecast.png)

![Forecast example 2](/graphs/forecast_2.png)



### Below is an analysis of our predictions, divided into sections:

1. Average growth of rental prices by suburbs, first on the overall level and then computed for both only houses and only apartments

Average growth is calculated as 
**[avg(future rent prices) - latest historical rent price / latest historical rent price]**

Below are bar charts showing 10 suburbs with highest predicted rental price increase: overall, only houses, only apartments.

![Top 10 Overall](/graphs/Overall_top_growth.png)

![Top 10 Houses](/graphs/Houses_top_growth.png)

![Top 10 Apartments](/graphs/Apartments_top_growth.png)


Below are graphs showing similar analysis but for bottom 10 subrubs.

![Bottom 10 Overall](/graphs/Overall_bottom_growth.png)

![Bottom 10 Houses](/graphs/Houses_bottom_growth.png)

![Bottom 10 Apartments](/graphs/Apartments_bottom_growth.png)

2. Return on Investment analysis done by including current property prices for houses and apartments (extracted from the Victorian
goverment website). Question that we're answering here is: 

### What property type (house/apartment) and in which suburb should I buy to maximize my return on investment (ROI)? 

ROI is calculated as average rental income over next 5 years divided by the median property price today. 

We're limited by the property price data which includes only median prices on the suburb level. Therefore we can't analyze based 
on the number of bedrooms and have to stay with the house/unit granularity.

The bar chart below shows the 5 investments (property type & subrub) with the highest 5 year return on investemnt. 

![Top 5 ROI](/graphs/ROI_overall.png)


## 7. Overall Limitations

- Temporal mismatches in rental prices: Rental data reflects current listings, while external datasets (e.g., census and population) are lagging indicators

- Sample bias: Scraped domain.com.au dataset may not be representative of the Victorian rental market

- Correlation or causation (even though causation) for feature importance when determining rental prices

## 8. Product Demo: Live Integrated Dashboards

The dashboard is too large to display here, please run final_results.ipynb and open map.html in browser to view.

Essentially, this demo product integrates our key analysis across metrics indicative of: Rental desirability, Affordability, Liveability, Future Growth.

Currently, these are visualised independently, as it would not make sense to aggregate these 4 drivers into a single index since they measure different things.

However, according to clients needs, can easily accomplish this by creating a weighted linear combination of the 4 factors.
For example, if a client/investor priorities future growth, then possibly weight desirability = 0.1, affordability=0.2, liveability=0.2, future growth=0.5. This results in a much neater dashboard where there is no need for multiple layers but rather a single view tailored to the client's needs.

As such, since we don't know what priorities potential investors have, the dashboard currently visualises each factor independently.

A static screenshot for reference:
![image.png](attachment:image.png)