# CS4001/4042 Assignment 1, Part B, Q2
In Question B1, we used the Category Embedding model. This creates a feedforward neural network in which the categorical features get learnable embeddings. In this question, we will make use of a library called Pytorch-WideDeep. This library makes it easy to work with multimodal deep-learning problems combining images, text, and tables. We will just be utilizing the deeptabular component of this library through the TabMlp network:

In [1]:
#!pip install pytorch-widedeep

In [2]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import R2Score

>Divide the dataset (‘hdb_price_prediction.csv’) into train and test sets by using entries from the year 2020 and before as training data, and entries from 2021 and after as the test data.

In [3]:
df = pd.read_csv('hdb_price_prediction.csv')

# TODO: Enter your code here

# Training Data
df_train = df[df['year'] <= 2020].copy()
# Testing Data
df_test = df[df['year'] >= 2021].copy()

# Dropping Unncessary Columns
df_train.drop(columns=['year','full_address','nearest_stn'], inplace=True)
df_test.drop(columns=['year','full_address','nearest_stn'], inplace=True)

print("Training Data:", df_train.shape)
print("Testing Data:", df_test.shape)

Training Data: (87370, 11)
Testing Data: (72183, 11)


In [4]:
df.head()

Unnamed: 0,month,year,town,full_address,nearest_stn,dist_to_nearest_stn,dist_to_dhoby,degree_centrality,eigenvector_centrality,flat_model_type,remaining_lease_years,floor_area_sqm,storey_range,resale_price
0,1,2017,ANG MO KIO,406 ANG MO KIO AVENUE 10,Ang Mo Kio,1.007264,7.006044,0.016807,0.006243,"2 ROOM, Improved",61.333333,44.0,10 TO 12,232000.0
1,1,2017,ANG MO KIO,108 ANG MO KIO AVENUE 4,Ang Mo Kio,1.271389,7.983837,0.016807,0.006243,"3 ROOM, New Generation",60.583333,67.0,01 TO 03,250000.0
2,1,2017,ANG MO KIO,602 ANG MO KIO AVENUE 5,Yio Chu Kang,1.069743,9.0907,0.016807,0.002459,"3 ROOM, New Generation",62.416667,67.0,01 TO 03,262000.0
3,1,2017,ANG MO KIO,465 ANG MO KIO AVENUE 10,Ang Mo Kio,0.94689,7.519889,0.016807,0.006243,"3 ROOM, New Generation",62.083333,68.0,04 TO 06,265000.0
4,1,2017,ANG MO KIO,601 ANG MO KIO AVENUE 5,Yio Chu Kang,1.092551,9.130489,0.016807,0.002459,"3 ROOM, New Generation",62.416667,67.0,01 TO 03,265000.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159553 entries, 0 to 159552
Data columns (total 14 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   month                   159553 non-null  int64  
 1   year                    159553 non-null  int64  
 2   town                    159553 non-null  object 
 3   full_address            159553 non-null  object 
 4   nearest_stn             159553 non-null  object 
 5   dist_to_nearest_stn     159553 non-null  float64
 6   dist_to_dhoby           159553 non-null  float64
 7   degree_centrality       159553 non-null  float64
 8   eigenvector_centrality  159553 non-null  float64
 9   flat_model_type         159553 non-null  object 
 10  remaining_lease_years   159553 non-null  float64
 11  floor_area_sqm          159553 non-null  float64
 12  storey_range            159553 non-null  object 
 13  resale_price            159553 non-null  float64
dtypes: float64(7), int64

>Refer to the documentation of Pytorch-WideDeep and perform the following tasks:
https://pytorch-widedeep.readthedocs.io/en/latest/index.html
* Use [**TabPreprocessor**](https://pytorch-widedeep.readthedocs.io/en/latest/examples/01_preprocessors_and_utils.html#2-tabpreprocessor) to create the deeptabular component using the continuous
features and the categorical features. Use this component to transform the training dataset.
* Create the [**TabMlp**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/model_components.html#pytorch_widedeep.models.tabular.mlp.tab_mlp.TabMlp) model with 2 linear layers in the MLP, with 200 and 100 neurons respectively.
* Create a [**Trainer**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/trainer.html#pytorch_widedeep.training.Trainer) for the training of the created TabMlp model with the root mean squared error (RMSE) cost function. Train the model for 100 epochs using this trainer, keeping a batch size of 64. (Note: set the *num_workers* parameter to 0.)

In [6]:
cat_embed_cols = [
    ('month', len(np.unique(df['month']))),
    ('town', len(np.unique(df['town']))),
    ('flat_model_type', len(np.unique(df['flat_model_type']))),
    ('storey_range', len(np.unique(df['storey_range']))),
]


continuous_cols = ['dist_to_nearest_stn','dist_to_dhoby','degree_centrality','eigenvector_centrality',
                 'remaining_lease_years','floor_area_sqm']

In [7]:
# TODO: Enter your code here
tab_preprocessor = TabPreprocessor(
    cat_embed_cols = cat_embed_cols, continuous_cols = continuous_cols
)

# Scaled Training Data
X_tab = tab_preprocessor.fit_transform(df_train)



In [8]:
# Creating the TabMlp Model
model = TabMlp(tab_preprocessor.column_idx, 
                cat_embed_input = tab_preprocessor.cat_embed_input, 
                cat_embed_dropout = 0.1,
                continuous_cols = continuous_cols,
                mlp_hidden_dims = [200, 100])

wide_deep = WideDeep(deeptabular = model) 

In [9]:
# Creating the Trainer 
trainer = Trainer(model = wide_deep,
                  objective = "regression",
                  lr_scheduler_step = False,  
                  num_workers = 0,  
                  metrics = [R2Score])

In [10]:
# Training Model with 100 Epochs with Batch Size 64
trainer.fit(X_tab = X_tab, 
            target = df_train['resale_price'].values, 
            n_epochs = 100, 
            batch_size = 64)

epoch 1: 100%|███| 1366/1366 [00:07<00:00, 193.17it/s, loss=6.9e+10, metrics={'r2': -1.9059}]
epoch 2: 100%|█████| 1366/1366 [00:07<00:00, 183.53it/s, loss=9.7e+9, metrics={'r2': 0.5917}]
epoch 3: 100%|████| 1366/1366 [00:07<00:00, 188.26it/s, loss=6.57e+9, metrics={'r2': 0.7235}]
epoch 4: 100%|█████| 1366/1366 [00:07<00:00, 184.54it/s, loss=5.6e+9, metrics={'r2': 0.7644}]
epoch 5: 100%|████| 1366/1366 [00:08<00:00, 167.43it/s, loss=5.13e+9, metrics={'r2': 0.7843}]
epoch 6: 100%|████| 1366/1366 [00:08<00:00, 158.64it/s, loss=4.84e+9, metrics={'r2': 0.7963}]
epoch 7: 100%|████| 1366/1366 [00:10<00:00, 125.14it/s, loss=4.66e+9, metrics={'r2': 0.8041}]
epoch 8: 100%|████| 1366/1366 [00:10<00:00, 135.84it/s, loss=4.46e+9, metrics={'r2': 0.8123}]
epoch 9: 100%|████| 1366/1366 [00:09<00:00, 145.54it/s, loss=4.35e+9, metrics={'r2': 0.8169}]
epoch 10: 100%|███| 1366/1366 [00:09<00:00, 147.67it/s, loss=4.34e+9, metrics={'r2': 0.8175}]
epoch 11: 100%|███| 1366/1366 [00:09<00:00, 141.30it/s, loss

epoch 88: 100%|████| 1366/1366 [00:08<00:00, 152.22it/s, loss=3.6e+9, metrics={'r2': 0.8487}]
epoch 89: 100%|████| 1366/1366 [00:09<00:00, 146.54it/s, loss=3.6e+9, metrics={'r2': 0.8484}]
epoch 90: 100%|███| 1366/1366 [00:08<00:00, 160.25it/s, loss=3.61e+9, metrics={'r2': 0.8482}]
epoch 91: 100%|████| 1366/1366 [00:08<00:00, 160.75it/s, loss=3.61e+9, metrics={'r2': 0.848}]
epoch 92: 100%|███| 1366/1366 [00:08<00:00, 158.24it/s, loss=3.57e+9, metrics={'r2': 0.8498}]
epoch 93: 100%|████| 1366/1366 [00:08<00:00, 163.05it/s, loss=3.6e+9, metrics={'r2': 0.8487}]
epoch 94: 100%|████| 1366/1366 [00:08<00:00, 159.00it/s, loss=3.6e+9, metrics={'r2': 0.8485}]
epoch 95: 100%|███| 1366/1366 [00:08<00:00, 163.22it/s, loss=3.61e+9, metrics={'r2': 0.8481}]
epoch 96: 100%|███| 1366/1366 [00:09<00:00, 146.00it/s, loss=3.58e+9, metrics={'r2': 0.8494}]
epoch 97: 100%|███| 1366/1366 [00:09<00:00, 151.09it/s, loss=3.57e+9, metrics={'r2': 0.8497}]
epoch 98: 100%|███| 1366/1366 [00:09<00:00, 144.79it/s, loss

>Report the test RMSE and the test R2 value that you obtained.

In [13]:
# Scaled Test Data
X_test = tab_preprocessor.transform(df_test)

# Make predictions on the test dataset
y_pred = trainer.predict(X_tab = X_test)

predict: 100%|██████████████████████████████████████████| 1128/1128 [00:02<00:00, 499.53it/s]


In [14]:
y_pred

array([103143.89, 135814.58, 273692.06, ..., 632185.75, 582168.06,
       600626.3 ], dtype=float32)

In [16]:
# TODO: Enter your code here
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

y_true = df_test['resale_price']

r2 = r2_score(y_true, y_pred)
print("R^2 Score:", r2)
mse = mean_squared_error(y_true, y_pred)
print("Root Mean Squared Error (RMSE):", np.sqrt(mse))

R^2 Score: 0.6162854537452702
Root Mean Squared Error (RMSE): 104798.4517345718


The R^2 value obtained is 0.6162854537452702 and RMSE value is 104798.4517345718