# Question B1 (15 marks)

Real world datasets often have a mix of numeric and categorical features – this dataset is one example. To build models on such data, categorical features have to be encoded or embedded.

PyTorch Tabular is a library that makes it very convenient to build neural networks for tabular data. It is built on top of PyTorch Lightning, which abstracts away boilerplate model training code and makes it easy to integrate other tools, e.g. TensorBoard for experiment tracking.

For questions B1 and B2, the following features should be used:   
- **Numeric / Continuous** features: dist_to_nearest_stn, dist_to_dhoby, degree_centrality, eigenvector_centrality, remaining_lease_years, floor_area_sqm
- **Categorical** features: month, town, flat_model_type, storey_range



---



In [1]:
!pip install pytorch_tabular[extra]

Collecting pytorch_tabular[extra]
  Downloading pytorch_tabular-1.1.0-py2.py3-none-any.whl.metadata (21 kB)
Collecting pytorch-lightning<2.2.0,>=2.0.0 (from pytorch_tabular[extra])
  Downloading pytorch_lightning-2.1.4-py3-none-any.whl.metadata (21 kB)
Collecting omegaconf>=2.3.0 (from pytorch_tabular[extra])
  Downloading omegaconf-2.3.0-py3-none-any.whl.metadata (3.9 kB)
Collecting torchmetrics<1.3.0,>=0.10.0 (from pytorch_tabular[extra])
  Downloading torchmetrics-1.2.1-py3-none-any.whl.metadata (20 kB)
Collecting pytorch-tabnet==4.1 (from pytorch_tabular[extra])
  Downloading pytorch_tabnet-4.1.0-py3-none-any.whl.metadata (15 kB)
Collecting PyYAML<6.1.0,>=5.4 (from pytorch_tabular[extra])
  Downloading PyYAML-6.0.1-cp38-cp38-win_amd64.whl.metadata (2.1 kB)
Collecting ipywidgets (from pytorch_tabular[extra])
  Downloading ipywidgets-8.1.2-py3-none-any.whl.metadata (2.4 kB)
Collecting einops<0.8.0,>=0.6.0 (from pytorch_tabular[extra])
  Using cached einops-0.7.0-py3-none-any.whl.meta

In [35]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig
from pytorch_tabular.config import (
    DataConfig,
    OptimizerConfig,
    TrainerConfig,
)

1.Divide the dataset (‘hdb_price_prediction.csv’) into train, validation and test sets by using entries from year 2019 and before as training data, year 2020 as validation data and year 2021 as test data.
**Do not** use data from year 2022 and year 2023.



In [55]:
df = pd.read_csv('hdb_price_prediction.csv')
df.head()

Unnamed: 0,month,year,town,full_address,nearest_stn,dist_to_nearest_stn,dist_to_dhoby,degree_centrality,eigenvector_centrality,flat_model_type,remaining_lease_years,floor_area_sqm,storey_range,resale_price
0,1,2017,ANG MO KIO,406 ANG MO KIO AVENUE 10,Ang Mo Kio,1.007264,7.006044,0.016807,0.006243,"2 ROOM, Improved",61.333333,44.0,10 TO 12,232000.0
1,1,2017,ANG MO KIO,108 ANG MO KIO AVENUE 4,Ang Mo Kio,1.271389,7.983837,0.016807,0.006243,"3 ROOM, New Generation",60.583333,67.0,01 TO 03,250000.0
2,1,2017,ANG MO KIO,602 ANG MO KIO AVENUE 5,Yio Chu Kang,1.069743,9.0907,0.016807,0.002459,"3 ROOM, New Generation",62.416667,67.0,01 TO 03,262000.0
3,1,2017,ANG MO KIO,465 ANG MO KIO AVENUE 10,Ang Mo Kio,0.94689,7.519889,0.016807,0.006243,"3 ROOM, New Generation",62.083333,68.0,04 TO 06,265000.0
4,1,2017,ANG MO KIO,601 ANG MO KIO AVENUE 5,Yio Chu Kang,1.092551,9.130489,0.016807,0.002459,"3 ROOM, New Generation",62.416667,67.0,01 TO 03,265000.0


In [56]:
# YOUR CODE HERE
# Drop full_address, nearest_stn, Year (2022, 2023)
# predict resale_price
# Continuous features: dist_to_nearest_stn, dist_to_dhoby, degree_centrality, eigenvector_centrality, remaining_lease_years, floor_area_sqm
# Categorical features: month, town, flat_model_type, storey_range
print(f"df length before dropping 2022-23: {len(df)}")
df = df[df.year != 2022]
df = df[df.year != 2023]
print(f"df length after dropping 2022-23: {len(df)}")

df = df.drop(['full_address', 'nearest_stn'], axis=1)
# print(f"df:\n{df}")

train = df[df.year < 2020]
val = df[df.year == 2020]
test = df[df.year == 2021]

# print(f"test:\n{test}")

continuous_cols = ["dist_to_nearest_stn", "dist_to_dhoby", "degree_centrality", "eigenvector_centrality", "remaining_lease_years", "floor_area_sqm"]
categorical_cols = [ "month", "town", "flat_model_type", "storey_range"]

df length before dropping 2022-23: 159553
df length after dropping 2022-23: 116427


2.Refer to the documentation of **PyTorch Tabular** and perform the following tasks: https://pytorch-tabular.readthedocs.io/en/latest/#usage
- Use **[DataConfig](https://pytorch-tabular.readthedocs.io/en/latest/data/)** to define the target variable, as well as the names of the continuous and categorical variables.
- Use **[TrainerConfig](https://pytorch-tabular.readthedocs.io/en/latest/training/)** to automatically tune the learning rate. Set batch_size to be 1024 and set max_epoch as 50.
- Use **[CategoryEmbeddingModelConfig](https://pytorch-tabular.readthedocs.io/en/latest/models/#category-embedding-model)** to create a feedforward neural network with 1 hidden layer containing 50 neurons.
- Use **[OptimizerConfig](https://pytorch-tabular.readthedocs.io/en/latest/optimizer/)** to choose Adam optimiser. There is no need to set the learning rate (since it will be tuned automatically) nor scheduler.
- Use **[TabularModel](https://pytorch-tabular.readthedocs.io/en/latest/tabular_model/)** to initialise the model and put all the configs together.

In [57]:
continuous_cols = ["dist_to_nearest_stn", "dist_to_dhoby", "degree_centrality", "eigenvector_centrality", "remaining_lease_years", "floor_area_sqm"]
categorical_cols = [ "month", "town", "flat_model_type", "storey_range"]
target = ['resale_price']

# 1. Define the DataConfig
data_config = DataConfig(
    target= target,
    continuous_cols=continuous_cols,
    categorical_cols=categorical_cols,
)

# 2. Define the TrainerConfig
trainer_config = TrainerConfig(
    auto_lr_find=True, 
    batch_size=1024, 
    max_epochs=50, 
)

# 3. Define the CategoryEmbeddingModelConfig
model_config = CategoryEmbeddingModelConfig(
    task="regression",
    layers="50"  # One hidden layer with 50 neurons.
)

# 4. Create the OptimizerConfig
optimizer_config = OptimizerConfig()

# 5. Create the TabularModel
tabular_model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)

#6. Train the tabular model
tabular_model.fit(train=train, validation=val)
result = tabular_model.evaluate(test)
pred_df = tabular_model.predict(test)

Seed set to 42


GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_steps=100` reached.
Learning rate set to 0.5754399373371567
Restoring states from the checkpoint path at C:\Users\nsupr\CZ4042 NN & DL Assignment\.lr_find_482f3ae5-19fa-4873-9838-a8b2c923751c.ckpt
Restored all states from the checkpoint at C:\Users\nsupr\CZ4042 NN & DL Assignment\.lr_find_482f3ae5-19fa-4873-9838-a8b2c923751c.ckpt


Output()

Output()

3.Report the test RMSE error and the test R2 value that you obtained.



In [58]:
pred_df.head()

Unnamed: 0,resale_price_prediction
87370,136601.484375
87371,166652.90625
87372,299092.6875
87373,294914.8125
87374,266336.75


In [59]:
# YOUR CODE & RESULT HERE
from sklearn.metrics import r2_score

print(f"{result[0]['test_mean_squared_error']}")
mse = np.sqrt(float(result[0]['test_mean_squared_error']))
pred = pred_df["resale_price_prediction"].values.tolist()
rsquared = r2_score(y, pred)
print(f"test loss: {result[0]['test_loss']}")
print(f"RMSE: {mse}")
print(f"R Squared: {rsquared}")

6465972736.0


NameError: name 'y' is not defined

4.Print out the corresponding rows in the dataframe for the top 25 test samples with the largest errors. Identify a trend in these poor predictions and suggest a way to reduce these errors.



In [41]:
pred_df['error'] = abs(pred_df['resale_price'] - pred_df['resale_price_prediction'])
# print(f"pred_df:\n{pred_df}")
sorted_df = pred_df.sort_values("error", ascending=False)
error_df = sorted_df.head(25)
error_df

KeyError: "['full_address', 'nearest_stn', 'year'] not found in axis"