CS4001/4042 Assignment 1, Part B, Q1
---

Real world datasets often have a mix of numeric and categorical features – this dataset is one example. To build models on such data, categorical features have to be encoded or embedded.

PyTorch Tabular is a library that makes it very convenient to build neural networks for tabular data. It is built on top of PyTorch Lightning, which abstracts away boilerplate model training code and makes it easy to integrate other tools, e.g. TensorBoard for experiment tracking.

For questions B1 and B2, the following features should be used:   
- **Numeric / Continuous** features: dist_to_nearest_stn, dist_to_dhoby, degree_centrality, eigenvector_centrality, remaining_lease_years, floor_area_sqm
- **Categorical** features: month, town, flat_model_type, storey_range



---



In [1]:
#!pip install pytorch_tabular

In [2]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig
from pytorch_tabular.config import (
    DataConfig,
    OptimizerConfig,
    TrainerConfig,
)

> Divide the dataset (‘hdb_price_prediction.csv’) into train, validation and test sets by using entries from year 2019 and before as training data, year 2020 as validation data and year 2021 as test data.
**Do not** use data from year 2022 and year 2023.



In [3]:
df = pd.read_csv('hdb_price_prediction.csv')

df.head()

Unnamed: 0,month,year,town,full_address,nearest_stn,dist_to_nearest_stn,dist_to_dhoby,degree_centrality,eigenvector_centrality,flat_model_type,remaining_lease_years,floor_area_sqm,storey_range,resale_price
0,1,2017,ANG MO KIO,406 ANG MO KIO AVENUE 10,Ang Mo Kio,1.007264,7.006044,0.016807,0.006243,"2 ROOM, Improved",61.333333,44.0,10 TO 12,232000.0
1,1,2017,ANG MO KIO,108 ANG MO KIO AVENUE 4,Ang Mo Kio,1.271389,7.983837,0.016807,0.006243,"3 ROOM, New Generation",60.583333,67.0,01 TO 03,250000.0
2,1,2017,ANG MO KIO,602 ANG MO KIO AVENUE 5,Yio Chu Kang,1.069743,9.0907,0.016807,0.002459,"3 ROOM, New Generation",62.416667,67.0,01 TO 03,262000.0
3,1,2017,ANG MO KIO,465 ANG MO KIO AVENUE 10,Ang Mo Kio,0.94689,7.519889,0.016807,0.006243,"3 ROOM, New Generation",62.083333,68.0,04 TO 06,265000.0
4,1,2017,ANG MO KIO,601 ANG MO KIO AVENUE 5,Yio Chu Kang,1.092551,9.130489,0.016807,0.002459,"3 ROOM, New Generation",62.416667,67.0,01 TO 03,265000.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159553 entries, 0 to 159552
Data columns (total 14 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   month                   159553 non-null  int64  
 1   year                    159553 non-null  int64  
 2   town                    159553 non-null  object 
 3   full_address            159553 non-null  object 
 4   nearest_stn             159553 non-null  object 
 5   dist_to_nearest_stn     159553 non-null  float64
 6   dist_to_dhoby           159553 non-null  float64
 7   degree_centrality       159553 non-null  float64
 8   eigenvector_centrality  159553 non-null  float64
 9   flat_model_type         159553 non-null  object 
 10  remaining_lease_years   159553 non-null  float64
 11  floor_area_sqm          159553 non-null  float64
 12  storey_range            159553 non-null  object 
 13  resale_price            159553 non-null  float64
dtypes: float64(7), int64

In [5]:
# TODO: Enter your code here

# Training Data Set: Year 2019 and before
df_train = df[df['year'] <= 2019].copy()
# Validation Data Set: Year 2020
df_val = df[df['year'] == 2020].copy()
# Testing Data Set: Year 2021
df_test = df[df['year'] == 2021].copy()

# Dropping Unncessary Columns
df_train.drop(columns=['year','full_address','nearest_stn'], inplace=True)
df_val.drop(columns=['year','full_address','nearest_stn'], inplace=True)
df_test.drop(columns=['year','full_address','nearest_stn'], inplace=True)

print("Training Data:", df_train.shape)
print("Validation Data:", df_val.shape)
print("Testing Data:", df_test.shape)

Training Data: (64057, 11)
Validation Data: (23313, 11)
Testing Data: (29057, 11)


> Refer to the documentation of **PyTorch Tabular** and perform the following tasks: https://pytorch-tabular.readthedocs.io/en/latest/#usage
- Use **[DataConfig](https://pytorch-tabular.readthedocs.io/en/latest/data/)** to define the target variable, as well as the names of the continuous and categorical variables.
- Use **[TrainerConfig](https://pytorch-tabular.readthedocs.io/en/latest/training/)** to automatically tune the learning rate. Set batch_size to be 1024 and set max_epoch as 50.
- Use **[CategoryEmbeddingModelConfig](https://pytorch-tabular.readthedocs.io/en/latest/models/#category-embedding-model)** to create a feedforward neural network with 1 hidden layer containing 50 neurons.
- Use **[OptimizerConfig](https://pytorch-tabular.readthedocs.io/en/latest/optimizer/)** to choose Adam optimiser. There is no need to set the learning rate (since it will be tuned automatically) nor scheduler.
- Use **[TabularModel](https://pytorch-tabular.readthedocs.io/en/latest/tabular_model/)** to initialise the model and put all the configs together.

In [6]:
num_col_names = ['dist_to_nearest_stn','dist_to_dhoby','degree_centrality','eigenvector_centrality',
                 'remaining_lease_years','floor_area_sqm']
cat_col_names = ['month','town','flat_model_type','storey_range']

In [7]:
# TODO: Enter your code here
data_config = DataConfig(
    target=["resale_price"],  
    continuous_cols=num_col_names,
    categorical_cols=cat_col_names,
)
trainer_config = TrainerConfig(
    auto_lr_find=True,  # Runs the LRFinder to automatically derive a learning rate
    batch_size=1024,
    max_epochs=50,
)
optimizer_config = OptimizerConfig()

model_config = CategoryEmbeddingModelConfig(
    task="regression",
    layers="50",  
)

tabular_model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)

2023-10-12 02:17:29,945 - {pytorch_tabular.tabular_model:105} - INFO - Experiment Tracking is turned off


In [8]:
#!pip install torch_optimizer

In [9]:
from torch_optimizer import QHAdam
# Training Tabular Model
tabular_model.fit(df_train, 
                  validation=df_val, 
                  optimizer=QHAdam)

Global seed set to 42
2023-10-12 02:17:29,975 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-10-12 02:17:29,977 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
2023-10-12 02:17:30,037 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-10-12 02:17:30,056 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
  rank_zero_deprecation(
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-10-12 02:17:30,087 - {pytorch_tabular.tabular_model:573} - INFO - Auto LR Find Started
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
  rank_zero_warn(
  rank_zero_warn(
  rank_zero_warn(
  rank_zero_warn(


Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

  rank_zero_warn(
  rank_zero_warn(
`Trainer.fit` stopped: `max_steps=100` reached.
Learning rate set to 0.5754399373371567
Restoring states from the checkpoint path at /Users/enlih/Library/CloudStorage/OneDrive-NanyangTechnologicalUniversity/SC4001 Neural Network & Deep Learning/Project/.lr_find_69d4ba94-61a9-4110-9d07-11a423d2699a.ckpt
Restored all states from the checkpoint file at /Users/enlih/Library/CloudStorage/OneDrive-NanyangTechnologicalUniversity/SC4001 Neural Network & Deep Learning/Project/.lr_find_69d4ba94-61a9-4110-9d07-11a423d2699a.ckpt
2023-10-12 02:17:32,870 - {pytorch_tabular.tabular_model:575} - INFO - Suggested LR: 0.5754399373371567. For plot and detailed analysis, use `find_learning_rate` method.
2023-10-12 02:17:32,871 - {pytorch_tabular.tabular_model:582} - INFO - Training Started


Output()

2023-10-12 02:17:48,963 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed
2023-10-12 02:17:48,964 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model
  rank_zero_deprecation(


<pytorch_lightning.trainer.trainer.Trainer at 0x7fcd4c2bb5b0>

In [10]:
# Evaluation and Prediction
evaluation = tabular_model.evaluate(df_test)
predicted = tabular_model.predict(df_test)

Output()

  rank_zero_warn(


Output()

In [11]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

# Getting the True Values and Predicted Values for RMSE and R2
y_true = predicted['resale_price']
y_pred = predicted['resale_price_prediction']

r2 = r2_score(y_true, y_pred)
print("R^2 Score:", r2)
mse = mean_squared_error(y_true, y_pred)
print("Root Mean Squared Error (RMSE):", np.sqrt(mse))

R^2 Score: 0.7776187688137617
Root Mean Squared Error (RMSE): 76696.92315065298


> Report the test RMSE error and the test R2 value that you obtained.



\# TODO: \<The R^2 value is 0.7776187688137617 and RMSE error is 76696.92315065298\>

> Print out the corresponding rows in the dataframe for the top 25 test samples with the largest errors. Identify a trend in these poor predictions and suggest a way to reduce these errors.



In [12]:
# TODO: Enter your code here
# Calculate the errors
df['error'] = abs(y_true - y_pred)

# Sort the DataFrame by error in descending order
sorted_df = df.sort_values(by='error', ascending=False)

# Select the top 25 samples with the largest errors
top_25_errors = sorted_df.head(25)

# Reset Index to see the position of each housing error
top_25_errors = top_25_errors.reset_index(drop=True)

# Printing out the result
top_25_errors

Unnamed: 0,month,year,town,full_address,nearest_stn,dist_to_nearest_stn,dist_to_dhoby,degree_centrality,eigenvector_centrality,flat_model_type,remaining_lease_years,floor_area_sqm,storey_range,resale_price,error
0,11,2021,BUKIT MERAH,46 SENG POH ROAD,Tiong Bahru,0.581977,2.309477,0.016807,0.047782,"3 ROOM, Standard",50.166667,88.0,01 TO 03,780000.0,415313.5
1,6,2021,BUKIT BATOK,288A BUKIT BATOK STREET 25,Bukit Batok,1.29254,10.763777,0.016807,0.000217,"EXECUTIVE, Apartment",75.583333,144.0,10 TO 12,968000.0,353501.75
2,12,2021,TAMPINES,156 TAMPINES STREET 12,Tampines,0.370873,12.479752,0.033613,0.000229,"EXECUTIVE, Maisonette",61.75,148.0,01 TO 03,998000.0,342406.375
3,12,2021,BISHAN,273B BISHAN STREET 24,Bishan,0.776182,6.297489,0.033613,0.015854,"5 ROOM, DBSS",88.833333,120.0,37 TO 39,1360000.0,339411.375
4,12,2021,QUEENSTOWN,89 DAWSON ROAD,Queenstown,0.658035,3.807573,0.016807,0.008342,"4 ROOM, Premium Apartment Loft",93.333333,109.0,04 TO 06,968000.0,332061.125
5,6,2021,BUKIT MERAH,17 TIONG BAHRU ROAD,Tiong Bahru,0.693391,2.058774,0.016807,0.047782,"3 ROOM, Standard",50.583333,88.0,01 TO 03,680888.0,322531.28125
6,8,2021,CENTRAL AREA,4 TANJONG PAGAR PLAZA,Tanjong Pagar,0.451637,2.594828,0.016807,0.103876,"5 ROOM, Adjoined flat",54.583333,118.0,16 TO 18,938000.0,320045.875
7,12,2021,BUKIT MERAH,49 KIM PONG ROAD,Tiong Bahru,0.468378,2.365532,0.016807,0.047782,"3 ROOM, Standard",50.166667,88.0,01 TO 03,695000.0,318997.75
8,6,2021,QUEENSTOWN,91 DAWSON ROAD,Queenstown,0.745596,3.720593,0.016807,0.008342,"4 ROOM, Premium Apartment Loft",93.916667,97.0,07 TO 09,930000.0,317532.1875
9,8,2021,BISHAN,275A BISHAN STREET 24,Bishan,0.827889,6.370404,0.033613,0.015854,"5 ROOM, DBSS",88.916667,120.0,25 TO 27,1280000.0,317021.875


In [13]:
top_25_errors['town'].value_counts()

town
BUKIT MERAH     10
QUEENSTOWN       6
BISHAN           3
CENTRAL AREA     2
BUKIT BATOK      1
TAMPINES         1
ANG MO KIO       1
HOUGANG          1
Name: count, dtype: int64

In [14]:
df.isna().sum()

month                          0
year                           0
town                           0
full_address                   0
nearest_stn                    0
dist_to_nearest_stn            0
dist_to_dhoby                  0
degree_centrality              0
eigenvector_centrality         0
flat_model_type                0
remaining_lease_years          0
floor_area_sqm                 0
storey_range                   0
resale_price                   0
error                     130496
dtype: int64

\# TODO: \< I have realised that the top 25 errors come from houses that are in the year of 2021 which is the test data used for evaluation and prediction of the tabular model. Hence, I think ensuring that the data is properly scaled and normalized is neccesary especially if algorithms are sensitive to the scale of the input features. There should also be re-evaluation of the model using cross-validation to ensure that the observed errors are not just due to a particular split of the data. \>