This notebook summarise the approach, assumption, limitation, and the result of our project

## 1. Result Summary

#### Objective 1: **Identify key internal and external features that influence rental price**
+ Best Internal features (most to least important): number of bath, number of bed, whether the property is furnished, number of parking
+ Best External feature (most to least important): distance to CBD, region average rental price, region population density, distance to nearest hospital, region average income 

<br>

#### Objective 2: **Identify the top 10 regions with the highest predicted growth rates**
Top 10 Region with highest growth with growth rate (Highest to lowest): Port Melbourne, South Yarra – South, Melbourne CBD – East, Fitzroy, Melbourne ,, BD - West, Southbank – East, Collingwood, South Yarra – West, South Melbourne, Southbank (West) – South Wharf


<div style='text-align: center;'>
  <img src="markdown-img/top_10_growth.png">
</div>

<br>

#### Objective 3: **Determine the most liveable and affordable region**
+ Top 3 most affortable regions: Alphington – Fairfield, Prahan – Windsor, Seddon - Kingsville
<div style='text-align: center;'>
  <img src="markdown-img/top_3_affordability.png">
</div>

+ Top 3 most livable regions: Melbourne CBD – North, Melbourne CBD – West, Melbourne CBD – East 
<div style='text-align: center;'>
  <img src="markdown-img/top_3_livability.png">
</div>



<br>

#### Recommendation:
+ Investor should look into: Buloke, Gannawarra, Ballarat as these 3 region are predicted to have high **rent growth**
+ Leaser should split their larger properties into multiple smaller ones for more profit
+ Leaser should furnish their property to earn on average $79 more per week. We estimated that leasers will break even in 6-8 months after purchasing the furniture

***

## 2. Assumptions

+ Distribution of data within a region was assumed to be uniform. For example the center of Edenhope has the same rent as the outer edge of Edenhope
+ Growth rates of all SA2 district were assumed to be linear
+ Our models are based on riskfree rental prices. 
For example, the insight about furnishing a property give leaser $79 more per week doesn't take into account a tenant potentially damaging furniture and the cost to replace it  

***

## 3. Approach



### 3.1.Dataset

***
#### 3.1.1.  Scrape property data from [domain.com](domain.com) using `scripts/domain_rent_scrape.py`

We chose to scrape our property data from `Domain.com`. Using a random agent from the fake-useragent package and some random delay between each attempt, we successfully scraped 9638 properties with only 10 failed properties. For each property, we scraped:
+ address line1
+ address line2
+ suburb
+ state
+ postcode
+ price
+ area
+ type
+ description
+ latitude
+ longitude
+ bed
+ bath
+ parking
+ bond

The result is save in `data/landing/domain_properties.json`
***


#### 3.1.2. External datasets: downloaded using `scripts/external_datasets.py`

Datasets from [abs.gov.au](abs.gov.au):
+ Population by SA2 district: `data/landing/SA2_Population.xlsx`
+ Income by SA2 district: `data/landing/SA2_Income.xlsx`
+ SA2 district shapefile: `data/landing/SA2_Borders.zip`, extracted in the same folder

School location data from [education.vic.gov.au](education.vic.gov.au)
+ School location: `data/landing/School_Locations.csv`

Past rental price data from [dffh.vic.gov.au](dffh.vic.gov.au)
+ Rent average: `data/landing/moving_annual_rent.xlsx`

Crime data from [crimestatistics.vic.gov.au](crimestatistics.vic.gov.au)
+ Crime history by region: `data/landing/crime_stat_March_2023.xlsx`
***


#### 3.1.3. Spatial data: retrieved using [Google Map API](https://mapsplatform.google.com/maps-products/)

**Place of interest**: 

This is done in `notebooks/spatial-data/infrastructure.ipynb`
Google services provides a $200/month/account of free credit, which we utilised. Google map text search API provides places that match the text description based toward a given location. However, each call only provide up to 20 places in that location. We decide to splited VIC into smaller grid of 10 km x 10 km (approximately a square) and search for placed within a radius of 7071 meters of the square center. This approach still occationally omitted some places but it was a lot better.

<img src="markdown-img/VIC_splited_grid.png" alt="Map showing VIC grid" width="692" height="422">

Places retrieved for each square:
+ Train station: 1983
+ Tram stop: 412
+ Supermarket: 2195
+ Market (this will also return grocery shop): 1509
+ Police station: 756
+ Hospital: 528
+ Shopping centre (some shopping centre does include supermarket): 664

The result is saved in `data/spatial-data/infrastructure.csv`

<br>

**Distance to nearest place of interest**:

After places of interest is retrived, the distance from each property to the nearest place of interest is the calculated in `notebooks/spatial-data/distance.ipynb`

We initially used [OpenRouteService](openrouteservice.org) API to get the distance and duration, however this took too long so we decided to use the rest of our google free credit

The result is saved in `data/spatial-data/distance_duration.csv`

***
<BR>

### 3.2. Preprocess

#### preprocess-raw
`/notebooks/preprocess-raw/population_to_raw.ipynb`:
+ Imported population data from  `data/landing/SA2_Population.xlsx `
+ Removed unnecessary columns
+ Changed column names
+ The result is saved to  `data/raw/processed_population_data.csv`

`/notebooks/preprocess-raw/annual_rent_to_raw.ipynb`:
+ Imported annual rent data from `data/landing/moving_annual_rent.xlsx` 
+ Iterate through columns to extract month/year and corresponding Count and Media
+ Extract count and median values for the suburbs
+ The result is saved to `data/raw/annual_rent`

`/notebooks/preprocess-raw/properties_domain_to_raw.ipynb`:
+ Imported scraped domain property data from  `data/landing/domain_properties.json `
+ Remove unnecessary columns
+ Convert data to desired data types
+The result is saved to `data/raw/raw_domain_properties.csv`

`/notebooks/preprocess-raw/crime_to_raw.ipynb`:
+ Imported crime data from  `data/landing/crime_stat_March_2023.xlsx `
+ Remove unnecessary columns
+ Convert data to desired data types
+ The result is saved to `data/raw/crime.csv`

`/notebooks/preprocess-raw/school_location_to_raw.ipynb`:
+ Imported school location data from `data/landing/School_Locations.csv` 
+ Convert all the features name to lowercase
+ Drop columns contain missing values
+ Convert data types into correct format
+ Rename columns for simplification
+ Visualise the distribution of school per surbub
+ The result is saved to `data/raw/raw_school_locations.csv`

#### preprocess-curated

 `/notebooks/preprocess-curated/annual_rent_to_curated.ipynb`:
+ Imported all annual rent datasets from  `data/raw/annual_rent `
+ Dropped property count columns
+ Converted column names to new format
+ Dropped columns from before 2015
+ Converted data from str to int
+ Calculate the average growth rate across all suburbs
+ Use this to impute missing values
+ For rows with all missing values, impute using column mean
+ The result is saved to new directory  `data/curated/annual_rent/`


`/notebooks/preprocess-curated/properties_domain_to_curated.ipynb`:
+ Import domain property data from  `data/raw/raw_domain_properties.csv `
+ Ensure correct data types
+ Changed rent to price per week
+ Removed duplicate properties
+ Encode URLs to ID
+ Impute missing value for the bond amount, using average of that type of property and suburb
+ Fill missing values with 0 for bedrooms, bathrooms and parking spots.
+ Create variable indicating if property is furnished or not
+ Aggregate house types to  `House `,  `Apartment/Flat/Unit ` and  `Other `
+ Remove outliers with too low/high price
+ The result is saved to `data/curated/curated_domain_properties.csv`

`/notebooks/preprocess-curated/crime_to_curated.ipynb`:
+ Import crime data from  `data/raw/crime.csv `
+ Changed date format
+ Group crimes by offense division
+ Dropped data from before 2015
+ The result is saved to `data/curated/curated_crime.csv`

`/notebooks/preprocess-curated/school_location_to_curated.ipynb`:
+ Import school location data from `data/raw/raw_school_locations.csv`
+ Drop unnecessary columns
+ Convert 'suburb' values to lowercase
+ The result is saved to `data/curated/curated_school_location.csv`

`/notebooks/preprocess-curated/population_to_curated.ipynb`:
+ Import population data from `data/raw/processed_population_data.csv`
+ Drop data before 2015 and also 'SA2 Code' columns
+ Predict data for 2023 using Linear Regression
+ Convert everything to lowercase
+ The result is saved to `data/curated/curated_population.csv`


`/notebooks/preprocess-curated/income_to_curated.ipynb`:
+ Import income data from `data/raw/processed_income_data.csv`
+ Predict income for 2020, 2021 and 2022 using GDP growth rate per capita from previous year
+ Predict 2023 income using Linear Regression
+ Drop unnecessary column and convert to lowercase
+ The result is saved to `data/curated/curated_income.csv`


`notebooks/spatial-data/region_to_SA2.ipynb`;
+ Get the bounding box of each region by Google map API
+ Assign weight for each region by the shared area of that region with it each SA2 district
+ The result is saved to `data/spatial-data/mapper_matrix.csv`

***
<BR>

### 3. Merge data

`notebooks/merge_data.ipynb`
+ Combine curated dataset of SA2 district and map each region data to SA2 district into one file and save it to `data/curated/sa2_data_by_year.csv`
+ Combine SA2 data of 2023 to property data and save it to `data/curated/prop_feature.csv`


***

### 3.4 Model
***

#### 3.4.1 Propterty feature

This model is intended to answer the first objective: **Identify key internal and external features that influence rental price** and is done in `models/full_model.ipynb`


+ To achieve this, we used the data of each property with the initial features: 

Features: id ,  postcode ,  type ,  price (per week) ,  bond ,  bed ,  bath ,  parking ,  is_furnished ,  latitude ,  longitude ,  sa2 ,dist_CBD ,
dist_public_transport ,  dist_hospital ,  dist_police_station ,
dist_supermarket ,  dist_market ,  dist_shopping_center ,
dist_school ,  dur_CBD ,  dur_public_transport ,  dur_hospital ,
dur_police_station ,  dur_supermarket ,  dur_market ,
dur_shopping_center ,  dur_school ,  income ,  population_density ,
crime_density ,  rent (SA2 district average rent)

+ We then did a stepwise selection using OLS for an initial understanding and select the subset of the feature:

Features selected with importance in order:    bath ,
                                                                        rent ,
                                                                        bond,
                                                                        income,
                                                                        bed,
                                                                        crime_density,
                                                                        is_furnished,
                                                                        dist_police_station,
                                                                        dist_hospital,
                                                                        parking,
                                                                        dist_CBD,
                                                                        dist_market,
                                                                        population_density,
                                                                        dist_shopping_center

+ We then fit the data on a RandomForest model and uses grid search for the best hyperparameter and then extract the feature importance of the best model

<img src="markdown-img/full_model_feature_importance.png" alt="feature_importance" width=220 heigth=375>

***

#### 3.4.2 SA2 model

This model aims to answer the second and third objective:
**Identify the top 10 regions with the highest predicted growth rates**, and 
**Determine the most liveable and affordable region** which is done in `models/sa2_model.ipynb`

+ We first scale our data and grid search for the best hyperparameter of AgglomerativeClustering and Kmeans with silhouette_score as the evaluating metric. 
+ We then label the SA2 district using  Agglomerative Clustering with 3 cluster.
+ For each of the feature: average rental price, income, population density, crime density, we fit data with the group label on seperate linear regression.
+ We then predict the growth of each SA2 district to 2028
+ Growth rate for each year is then calculated with the composite grow index (CGI) as follow:

<div align="center">
  <img src="markdown-img/growth_by_year_formular.png" width=335 width=51>
</div>

<div align="center">
  <img src="markdown-img/CGI_define.png" width=622 height=51>
</div>

+ We then calculate the livability and affortability with the metric below with AFF as affordability, GD as grocery shop density, PSG as police station density, HD as hospital density, PTD as public transport density, and SCD as shopping center density

<div style='text-align: center;'>
  <img src="markdown-img/Affordability_Livability_define.png" width=383 heigth=118>
</div>




## 4. Limitation

+ Due to COVID 19, there is a drop in growth rate in 2021. This is an anomally in our model that we included in that could skew the prediction of future growth
+ We couldn't find the shapefile for all of VIC region so we have to approximate data for SA2 district by using the region bounding box and SA2 district bounding box
