# Supervised Learning - House Price Prediction



## Introduction

Vietnam is a developing country in Southeast Asia, with a population of about 100 million people and is one of the fastest growing economies in recent years. Along with rapid economic development and urbanization, housing price is also increased repidly and become too expensive for most people. Housing price is always one of the issues of great concern to many families.

In this assignment, I'm going to build a simple supervised learning model to predict housing prices in Ho Chi Minh city, the economic and largest city in Vietnam. For the purpose of price prediction, I'm going to use sklearn regression methods. Various analytical techniques are also used to explore and clean data.

### About the dataset:

I will be using the  **'House Price Prediction Dataset Vietnam - 2024'** dataset which was downloaded from Kaggle (https://www.kaggle.com/datasets/nguyentiennhan/vietnam-housing-dataset-2024/data) for the purpose of this study.

According to the dataset description from Kaggle's website: This dataset contains information about various housing properties in Vietnam. It includes detailed attributes of each property, such as its location, physical characteristics, and legal and furnishing status, along with the price. 

The data was crawled from batdongsan.vn, which is one of the largest real estate listing websites in Vietnam.

Features:
* Address: The complete address of the property, including details such as the project name, street, ward, district, and city.
* Area: The total area of the property, measured in square meters.
* Frontage: The width of the front side of the property, measured in meters.
* Access Road: The width of the road providing access to the property, measured in meters.
* House Direction: The cardinal direction the front of the house is facing (e.g., East, West, North, South).
* Balcony Direction: The cardinal direction the balcony is facing.
* Floors: The total number of floors in the property.
* Bedrooms: The number of bedrooms in the property.
* Bathrooms: The number of bathrooms in the property.
* Legal Status: Indicates the legal status of the property, such as whether it has a certificate of ownership or is under a sale contract.
* Furniture State: Indicates the state of furnishing in the property, such as fully furnished, partially furnished, or unfurnished.

Target: 
* Price: The price of the property, represented in billions of Vietnamese Dong (VND).

### GitHub repository

[https://github.com/dongndp/csca-5622](https://github.com/dongndp/csca-5622)

In [None]:
# type: ignore
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Initial Exploration & Data Cleaning

In [None]:
# Loading data
df = pd.read_csv('data/vietnam_housing_dataset.csv')
# Replace all spaces in colum names with hyphens
df.columns = df.columns.str.replace(" ", "_")
# Display information about the dataset
df.info()

First observations: The dataset contains 30229 samples, 11 features and 1 target variable. Among the 11 features, there are 6 numeric and 5 categorical features. 
There are many features with missing values. Let's look at a few rows.

In [None]:
# Listing first few rows
df.head()

In [None]:
# Let's see more detail about address
for i in range(5):
    print(df['Address'][i])

From the result, the Address field is a combination of the construction project name, buldings, street, town/ward, city/district and city/province separated by commas (,). I'm going to remove all house entries that are not located in Ho Chi Minh city (since I just want to predict house prices in Ho Chi Minh city only) and create new column District and then drop the original Address column as I don't need it anymore.

In [None]:
df = df[df['Address'].str.split(", ").str[-1].str.replace(".", "") == "Hồ Chí Minh"]
df['District'] = df['Address'].str.split(", ").str[-2].str.replace(".", "")
df.drop('Address', inplace = True, axis = 1)

In [None]:
# Now I want to look at 'District'
df['District'].value_counts()

I'm going to drop all districts which have less than 10 entries. I believe the district name of these entries are not input correctly. 

In [None]:
min_district_samples = 10

unique_districts = df['District'].value_counts()
unique_districts = unique_districts[unique_districts >= min_district_samples].index
df = df[df['District'].isin(unique_districts)]

In [None]:
# Check null/missing values
df.isnull().sum()

In [None]:
# Inspect house direction feature
df['House_direction'].value_counts()

In [None]:
# Check balcony direction
df['Balcony_direction'].value_counts()

The House and Balcony direction are categorical features. They indicate the cardinal direction of the house such as East (Đông), West (Tây), South (Nam), North (Bắc), South - East (Đông Nam), North - West (Tây Bắc), etc. While house orientation is one of the factors that influence buyers' decisions, it has little or no impact on home prices. And since there are so many missing values, I decide to drop these two features.

In [None]:
# Drop House/Balcony direction features
df.drop('House_direction', inplace = True, axis = 1)
df.drop('Balcony_direction', inplace = True, axis = 1)

It seems not make sense for a house or apartment with zero floor, bedroom, bathroom or no access road... So for any numeric feature, I'm going to fill the missing value with the median. Likewise for any categorical feature, I'm going to fill the missing value with the mode.

In [None]:
# Fill missing values
for col in ['Floors', 'Frontage', 'Access_Road', 'Bedrooms', 'Bathrooms']: 
    fill_value = df[col].median()
    df.fillna({col: fill_value}, inplace=True)

for col in ['Legal_status', 'Furniture_state']:
    fill_value = df[col].mode()
    df.fillna({col: fill_value[0]}, inplace=True)

In [None]:
#Reset index
df.reset_index(drop=True, inplace=True)
df.info()

### Summary of Initial Exploration and Data Cleaning

The original dataset contains 30.227 samples, 11 features. There are many entries with null values.

What I have done so far:
* Remove all house entries which are not located in Ho Chi Minh city since I want to predict the housing prices in Ho Chi Minh city only.
* Remove all features that contain too many null values and does not make sense to keep them, including the House and Balcony orientations.
* Fill all missing values with the median (for numerical features) and the mode (for categorical features)

After all of above cleaning tasks, now I have a clean dataset which contains 11.751 samples of listing houses in Ho Chi Minh city, each has 9 features: Area, Frontage, Access Road, Floors, Number of Bedrooms, Number of Bathrooms, Legal status, Furniture state, and District where it's located. 

## Exploratory Data Analysis


Above, I assumed that house prices vary by province and district. I'm going to check if the data supports my hypothesis.

In [None]:
# Draw a box plot to see if house prices vary by district
fig, ax = plt.subplots(figsize=(10, 4))
fig.suptitle('Price vs District')
sns.boxplot(x='District', y='Price', data=df, ax=ax)
ax.set(ylabel='Price (Billion VND)', xlabel='District')
plt.xticks(rotation=45, ha='right')
plt.show()

From the box plot results above, I can conclude that the data supports my assumption. A house located in downtown (Quận 1) or new areas such as Quận 2 tend to be higher than on located in surbubs such as Củ Chi or Bình Chánh.

From the plot, there are some outliers. I assume that these are true ouliers and I'm not going to remove them.

### Price Distribution

Now I want to see how the distribution of prices looks like.

In [None]:
# Display the distribution of prices
fig, ax = plt.subplots(figsize=(10, 4))
fig.suptitle('House price distribution')
sns.histplot(df['Price'], kde=True, ax=ax)
ax.set(xlabel='Price (Billion VND)', ylabel='Count')
plt.show()

The distribution of prices looks a little bit skewed.

### Correlation Heatmap

 Let's view the correlation heatmap

In [None]:
# Select numerical columns and display the correlation heatmap
df1 = df.select_dtypes(include=[np.number])
plt.figure(figsize=(10, 4))
sns.heatmap(df1.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()

The heatmap shows that Bathrooms, Bedrooms, Floors and Areas are correlated with the target variable. It also shows that Bathrooms and Bedrooms are strongly correlated (0.80), Bathrooms and Floors (0.61), Bedrooms and Floors (0.54) are also correlated with each others.

### Summary of EDA

What I've found so far:

* The house prices vary by areas, houses located in the downtown or new areas are higher than in the surbubs
* The price distribution of house prices is a bit skewed
* Bathrooms, Bedrooms Floors, and Area are correlated with the price, but not strongly, while Fontage isn't.
* Bathrooms and Bedrooms are strongly correlated, Bathrooms and Floors, Bedrooms and Floors are also correlated with each others.

## Models

### Convert categorical variables to numerical variables and prepare train and test dataset

In [None]:
from sklearn.model_selection import train_test_split

# Converts categorical variables into dummy/indicator variables
df = pd.get_dummies(df, drop_first=True, dtype=int)

# split into train and test dataset
X = df.drop(columns=['Price'])
y = df['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error

lr_model = LinearRegression().fit(X_train, y_train)
lr_score = lr_model.score(X_test, y_test)
lr_y_pred = lr_model.predict(X_test)
lr_rmse = root_mean_squared_error(y_test, lr_y_pred)
lr_residuals = y_test - lr_y_pred
print(f"Score: {lr_score:.2f}, RMSE: {lr_rmse:.2f}")

### RandomForest Regressor

The linear regression model gave the score of 0.52 and RMSE=1.44. In the following I'm going to use RandomForest Regressor to see if it can provide better score. I also try to use GridSearchCV to test the hyper-parameters and see which is the best choice of hyper-parameters that gives the highest score.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

param_grid = {'max_depth': [1, 5, 10, 15, 20, 25, 30]}
base_estimator = RandomForestRegressor(max_features='sqrt')
clf = GridSearchCV(base_estimator, param_grid, cv=5).fit(X_train, y_train)
rf_model = clf.best_estimator_
rf_params = clf.best_params_
rf_score = rf_model.score(X_test, y_test)
rf_y_pred = rf_model.predict(X_test)
rf_residuals = y_test - rf_y_pred
rf_rmse = root_mean_squared_error(y_test, rf_y_pred)
print(f"Best params: {rf_params}, Best score: {rf_score:.2f}, Best RMSE: {rf_rmse:.2f}")

As the result shown, the RandomForest Regressor model gave a score of 0.64 and RMSE=1.25 which is much better than previous Linear Regression model.

### Residual Plot

In [None]:
# Plot residuals
fig, (ax1, ax2) = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(10, 4))
fig.suptitle('Histogram of Residuals')
sns.histplot(lr_residuals, kde=True, ax=ax1)
ax1.set(xlabel=f'Residual (LinearRegresstion), RMSE={lr_rmse:.2f}', ylabel='Frequency')
sns.histplot(rf_residuals, kde=True, ax=ax2)
ax2.set(xlabel=f'Residual (RandomForestRegressor), RMSE={rf_rmse:.2f}')
plt.show()

### Predicted vs Actual values plot

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(10, 4))
fig.suptitle('Predicted vs Actual Values')
ax1.scatter(y_test, lr_y_pred, alpha=0.5)
ax1.plot([1, 10], [1, 10], 'r--')
ax1.set(xlabel='Actual Value (LinearRegresstion)', ylabel='Predicted Value (Billion VND)')
ax2.scatter(y_test, rf_y_pred, alpha=0.5)
ax2.plot([1, 10], [1, 10], 'r--')
ax2.set(xlabel='Actual Value (RandomForestRegressor)')
plt.show()

## Summary

In this exercise, I explored the Vietnam housing price dataset, picked up houses in Ho Chi Minh city, performed data cleaning tasks such as removing features that have too many missing values or fill missing values with median/mode value. 

During EDA, I looked at the data distribution, I assumed the outliers are true outliers and decided to not removing them. I plotted the correlation heatmap and analyzed the correlation between features and target variable. 

I prepared the data before test my models by converting all categorical features to numerical and then splitted the data into train and test datasets. 

Finally, I tested with 2 sklearn's regression models: LinearRegresstion and RandomForestRegressor. For the later model, I also tried different values of hyper-parameters using GridSearchCV() to find the model that gives the highest score. As the result shown, RandomForestRegressor gives the score of 0.64 which is much better than LinearRegression model. But I believe that the model's can be further improved by considering to remmove outliers, tunning model's hyper-parameters or exploring more models.