<h2><center> üè† üìÉ Rental & Buying Properties ‚Äì The Real Estate Tyrone In 21st Century üí∏ üçÉ </center></h2>

<h4><center> From Raw Data To Analytic Insights: Predictive Reasoning & Empirical Data </center></h4>

<p><b>Dataset Author:</b> Cresht. (2025). </p>

<p><b>Official Source:</b> BatDongSanVN. 
<a href="https://batdongsan.vn/" target="_blank">[Link]</a></p>

<p><b> Launch Date:</b> September 19th, 2025 - Present (2026) </p>

<p><b>Successors</b>:
    <ul>
        <li> Trung, T, Trung, L, Nghi, N & Hai, N. (2023). <i> Group 16 - House Price Prediction - 21KDL </i>. Github. <a href="https://github.com/TrungNotHot/House-Price-Prediction" target="_blank">[Repository]</a></li>
        <li> Surjyanee. (2020). <i> Linear-Regression Model for House Price Prediction </i>. Github. <a href = "https://github.com/huzaifsayed/Linear-Regression-Model-for-House-Price-Prediction">[Repository]</a></li>
        <li> Rishabh, T. (2022). <i> House Price Prediction </i>. Github. <a href = "https://github.com/Rishabh-Tripathi1/House-Price-Prediction"> [Repository]</a></li>
    </ul>
</p>

<p><b>Feature Engineering Performer:</b> Cresht </p>

### üé® Abstract üåå ### 

This feature engineering notebook builds upon insights derived from the exploratory data analysis to transform raw real estate buying and rental datasets into structured, model-ready inputs. Predominant procedures include handling missing and inconsistent values, encoding categorical attributes, normalizing numerical variables, and acquiring informative features that capture property characteristics, pricing dynamics, spatial patterns, and temporal trends. Domain-driven transformations such as aggregated location indicators and time-based features are integrated to enhance predictive relevance. Moreover, manual outlier treatment and feature selection techniques are incorporated to enhance data quality and reduce noise. These engineering methods aim to improve model interpretability, robustness, and performance in further machine learning tasks in terms of price prediction and market behaviour analysis.

### üìö Libraries üîñ ###

In [1]:
#========================== Backbone of the exploratory analysis ==========================
#Feature Engineering
import numpy as np
import pandas as pd
import seaborn as sns

#Visualisation
import matplotlib.pyplot as plt
import matplotlib.figure
import matplotlib.cm as cm

#Preprocessing
from sklearn.model_selection import train_test_split
import joblib

#========================== Miscellaneous ==========================
#Extension for multiple savestates
import sys, os, string, re

#Absolute path
from pathlib import Path

#JSON Format for label mapping
import json

#Timeline estimation
from datetime import datetime, timedelta
import calendar
    
#Deprecation Warning Surpass
import warnings
warnings.filterwarnings('ignore')

### üóÉÔ∏è Data Pipeline üì¶ ###

In [2]:
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
src_path = os.path.join(project_root)

if src_path not in sys.path:
    sys.path.append(src_path)

#Import necessary preprocessing features
from workarounds.preprocessing.feature_preprocessing.encoders.label_encoder import LabelEncoding
from workarounds.preprocessing.feature_preprocessing.encoders.target_encoder import TargetEncoding
from workarounds.preprocessing.feature_preprocessing.encoders.one_hot_encoder import OneHotEncoding
from workarounds.preprocessing.feature_preprocessing.scalers.minmax_scaler import MinMaxScaling
from workarounds.preprocessing.feature_preprocessing.scalers.standard_scaler import StandardScaling
from workarounds.preprocessing.feature_preprocessing.normalization.normalizer import Normalizing
from workarounds.preprocessing.feature_preprocessing.outliers.iqr_removal import IQRMethod
from workarounds.preprocessing.feature_preprocessing.outliers.zscore_removal import ZScoreMethod
from workarounds.preprocessing.feature_preprocessing.transformers.boxcox import BoxCoxTransformer
from workarounds.preprocessing.feature_preprocessing.transformers.yeo_johnson import YeoJohnsonTransformer
from workarounds.preprocessing.feature_preprocessing.pipeline import PreprocessingPipeline
from workarounds.preprocessing.feature_preprocessing.imputers.mice import MICEImputation
from workarounds.preprocessing.feature_preprocessing.scalers.robust_scaler import RobustScaling

[32m2025-12-30 15:32:12.742[0m | [1mINFO    [0m | [36mworkarounds.config[0m:[36m<module>[0m:[36m11[0m - [1mPROJ_ROOT path is: C:\Users\Admin\Downloads\catalyst-pre[0m


### üß™ Experiment üî¨ ###

#### Add the dataframe ####

In [122]:
np.random.seed(42)

# Simulate 150 rows
n = 150

df = pd.DataFrame({
    "city": np.random.choice(["Hanoi", "Tokyo", "Jakarta", "Bangkok", "Seoul"], n),
    "gender": np.random.choice(["Male", "Female", "Other"], n),
    "age": np.random.normal(loc=35, scale=10, size=n).round(1),
    "income": np.random.normal(loc=50000, scale=15000, size=n).round(2),
    "spending": np.random.normal(loc=1500, scale=300, size=n).round(2),
    "score": np.random.exponential(scale=50, size=n).round(2)
})

#### Workaround (Constraints) ####

In [4]:
# Fix invalids for Box-Cox (must be positive)
df["age"] = df["age"].apply(lambda x: max(1, x))
df["score"] = df["score"].apply(lambda x: max(1, x))

#### Outlier Generator ####

In [5]:
# Introduce some outliers for testing
outlier_indices = np.random.choice(df.index, size=5, replace=False)
df.loc[outlier_indices, "income"] *= 5
df.loc[outlier_indices, "spending"] *= 3

#### Pipeline ####

In [6]:
pipeline = PreprocessingPipeline([
    # Outliers Removers (Numeric) 
    ZScoreMethod(column="spending"),
    IQRMethod(column="income"),

    # Encoders & Transformers
    LabelEncoding(column = "city"),
    OneHotEncoding(column = "gender"),
    BoxCoxTransformer(column = "score"),
    YeoJohnsonTransformer(column = "age"),

    # Scalers
    Normalizing(columns = ["spending"]),
    StandardScaling(columns = ["age"]),
    MinMaxScaling(columns = ["income"]),
])

In [7]:
df_transformed = pipeline.fit_transform(df)

In [8]:
print(df_transformed)

     city       age    income  spending     score  gender_Female  gender_Male  \
0       0 -0.043497  0.770313       1.0  1.619781          False         True   
1       3  1.339479  0.688730       1.0  1.562839          False        False   
2       2 -0.308604  0.651429       1.0  0.063852           True        False   
3       3  2.525175  0.652370       1.0 -1.104333          False         True   
5       4 -0.905980  0.312723       1.0  1.156571          False         True   
..    ...       ...       ...       ...       ...            ...          ...   
145     0  0.528949  0.000000       1.0  0.620257          False         True   
146     0 -0.865765  0.175151       1.0 -0.802965          False        False   
147     2  1.940405  0.816790       1.0  1.497082           True        False   
148     4 -1.057334  0.878703       1.0 -0.021278          False        False   
149     0 -1.260556  0.457037       1.0 -0.118073          False        False   

     gender_Other  
0      

### üìú Introduction ü™∂ ###

In [65]:
# Buying properties folder path
data_path = os.path.join('..', 'data', 'raw', 'house_buying_dec29th_2025.csv')

# Load the dataset
df = pd.read_csv(data_path)
df.head()

Unnamed: 0,id,detail_url,title,location,timeline_hours,area_m2,bedrooms,bathrooms,floors,frontage,price_million_vnd
0,196958,https://batdongsan.vn/ban-nha-nguyen-son-long-...,B√°n nh√† Nguy·ªÖn S∆°n Long Bi√™n- D√¢n tr√≠ Tuy·ªát v·ªù...,"Long Bi√™n, H√† N·ªôi",2,50.0,4.0,4.0,6.0,False,12500.0
1,196967,https://batdongsan.vn/ban-nha-hem-xe-hoi-4m-2-...,"B√ÅN NH√Ä H·∫∫M XE H∆†I 4M, 2 M·∫∂T TI·ªÄN - 2 T·∫¶NG , N...","B√¨nh Ch√°nh, H·ªì Ch√≠ Minh",2,98.0,4.0,4.0,3.0,True,5500.0
2,196983,https://batdongsan.vn/o-to-ngu-nha-62m2-5-tang...,√î T√î NG·ª¶ NH√Ä ‚Äì 62M¬≤ ‚Äì 5 T·∫ßng/4PN ‚Äì G·∫ßn kdc CIT...,"G√≤ V·∫•p, H·ªì Ch√≠ Minh",2,62.0,4.0,5.0,6.0,False,8740.0
3,196962,https://batdongsan.vn/ban-nha-phuc-loi-ngo-non...,B√°n nh√† Ph√∫c L·ª£i - ng√µ n√¥ng - √¥ t√¥ v√†o nh√† - t...,"Long Bi√™n, H√† N·ªôi",2,39.0,3.0,4.0,6.0,False,6500.0
4,196972,https://batdongsan.vn/tran-duy-hung-6-tang-tha...,TR·∫¶N DUY H∆ØNG - 6 T·∫¶NG THANG M√ÅY - NG√ï TH√îNG D...,"C·∫ßu Gi·∫•y, H√† N·ªôi",2,45.0,3.0,3.0,7.0,True,16980.0


In [66]:
df.describe()

Unnamed: 0,id,timeline_hours,area_m2,bedrooms,bathrooms,floors,price_million_vnd
count,66912.0,66912.0,50344.0,55273.0,51548.0,52278.0,64809.0
mean,141092.732768,8749.222666,1377.977,4.429649,4.296617,14775.95,13810.677155
std,32545.265005,6863.201079,276334.7,4.500243,4.618477,3377399.0,35144.733752
min,80084.0,2.0,1.0,1.0,1.0,0.0,0.0
25%,116152.75,3600.0,42.0,3.0,2.0,3.0,4600.0
50%,142310.0,7200.0,60.0,4.0,3.0,4.0,7000.0
75%,168691.5,8760.0,92.0,5.0,5.0,6.0,12800.0
max,197061.0,26280.0,62000000.0,127.0,127.0,772221300.0,989000.0


### ‚öôÔ∏è Data Preprocessing üîΩ ###

The pipeline begins by preserving the original dataset and removing redundant identifiers, followed by decomposing composite location fields into geographic columns. Temporal information is reconstructed from relative timelines into seperated day, month and year components, later encoded cyclically. Data quality is improved through systematic removal of invalid records, filtering of extreme outliers using quantile-based thresholds, and log transformation of the dependent variable (Million VND price) to normalize variance. Categorical location features are hierarchically consoliated to alleviate sparsity, similarly with the boolean and missingness indicators. Numerical attributes undergo multivariate imputation and robust scaling, and categorical variables are one-hot encoded within train-test division. Ultimately, inverse mappings help preserving the original and encoded administrative units so as to enable interpretability of model outputs in both dependent and independant targets.

#### 0. Preservation #### 

In [67]:
org_hb = df.copy()

#### 1. Unnecessary Feature Remomval ####

In [68]:
org_hb = org_hb.drop(columns = ["title", "id", "detail_url"])

#### 2. Aggregation ####

In [69]:
#Strip the composite attribute into seperate columns
org_hb[["ward", "province"]] = org_hb["location"].str.split(",", n = 1, expand = True)

#Remove leading/trailing spaces
org_hb["ward"] = org_hb["ward"].str.strip()
org_hb["province"] = org_hb["province"].str.strip()

#Remove redundant location column
org_hb = org_hb.drop(columns = "location")

#### 3. Timeline Seperation ####

In [70]:
#Convert from timeline (hours) to specific date

#Use current datetime as reference
reference_date = datetime(2025, 9, 21)


# Step 2: Convert timeline_hours to datetime
org_hb['date'] = org_hb['timeline_hours'].apply(lambda h: reference_date - timedelta(hours=h))

# Step 3: Extract day, month, year
org_hb['day'] = org_hb['date'].dt.day
org_hb['month'] = org_hb['date'].dt.month
org_hb['year'] = org_hb['date'].dt.year

# Remove the redundant date
org_hb = org_hb.drop(columns = ["timeline_hours", "date"])

In [71]:
org_hb

Unnamed: 0,area_m2,bedrooms,bathrooms,floors,frontage,price_million_vnd,ward,province,day,month,year
0,50.0,4.0,4.0,6.0,False,12500.0,Long Bi√™n,H√† N·ªôi,20,9,2025
1,98.0,4.0,4.0,3.0,True,5500.0,B√¨nh Ch√°nh,H·ªì Ch√≠ Minh,20,9,2025
2,62.0,4.0,5.0,6.0,False,8740.0,G√≤ V·∫•p,H·ªì Ch√≠ Minh,20,9,2025
3,39.0,3.0,4.0,6.0,False,6500.0,Long Bi√™n,H√† N·ªôi,20,9,2025
4,45.0,3.0,3.0,7.0,True,16980.0,C·∫ßu Gi·∫•y,H√† N·ªôi,20,9,2025
...,...,...,...,...,...,...,...,...,...,...,...
66907,65.0,3.0,3.0,3.0,False,10200.0,Ho√†ng Mai,H√† N·ªôi,25,12,2024
66908,51.0,,,3.0,True,16200.0,Hai B√† Tr∆∞ng,H√† N·ªôi,25,12,2024
66909,37.0,2.0,2.0,,False,10300.0,Ho√†ng Mai,H√† N·ªôi,25,12,2024
66910,69.0,1.0,1.0,5.0,False,3990.0,Qu·∫≠n 12,H·ªì Ch√≠ Minh,25,12,2024


#### 4. Invalid Attributes & Outliers Removal ####

In [72]:
# Notable scenario: all attributes are missing concurrently
mask_all_missing = org_hb[["bedrooms", "bathrooms", "floors"]].isna().all(axis=1)
org_hb_pf = org_hb.loc[~mask_all_missing]

# For each empty area_m2 rows, filter out any rows missing more than 2 attributes
area_missing = org_hb_pf.loc[org_hb_pf["area_m2"].isna()]
mask_drop = area_missing[["bedrooms", "bathrooms", "floors"]].isna().sum(axis=1) >= 2
org_hb_strict = org_hb_pf.drop(index=area_missing.loc[mask_drop].index)

# For each price missing rows, do similarly
price_missing = org_hb_strict.loc[org_hb_strict["price_million_vnd"].isna()]
mask_drop = (org_hb_strict["price_million_vnd"].isna() &(org_hb_strict[["bedrooms", "bathrooms", "floors"]].isna().sum(axis=1) >= 2))
org_hb_pf2 = org_hb_strict.loc[~mask_drop]

In [73]:
org_hb_pf2

Unnamed: 0,area_m2,bedrooms,bathrooms,floors,frontage,price_million_vnd,ward,province,day,month,year
0,50.0,4.0,4.0,6.0,False,12500.0,Long Bi√™n,H√† N·ªôi,20,9,2025
1,98.0,4.0,4.0,3.0,True,5500.0,B√¨nh Ch√°nh,H·ªì Ch√≠ Minh,20,9,2025
2,62.0,4.0,5.0,6.0,False,8740.0,G√≤ V·∫•p,H·ªì Ch√≠ Minh,20,9,2025
3,39.0,3.0,4.0,6.0,False,6500.0,Long Bi√™n,H√† N·ªôi,20,9,2025
4,45.0,3.0,3.0,7.0,True,16980.0,C·∫ßu Gi·∫•y,H√† N·ªôi,20,9,2025
...,...,...,...,...,...,...,...,...,...,...,...
66907,65.0,3.0,3.0,3.0,False,10200.0,Ho√†ng Mai,H√† N·ªôi,25,12,2024
66908,51.0,,,3.0,True,16200.0,Hai B√† Tr∆∞ng,H√† N·ªôi,25,12,2024
66909,37.0,2.0,2.0,,False,10300.0,Ho√†ng Mai,H√† N·ªôi,25,12,2024
66910,69.0,1.0,1.0,5.0,False,3990.0,Qu·∫≠n 12,H·ªì Ch√≠ Minh,25,12,2024


In [82]:
# Preserving dataframe and eliminating invalid price rows
df = org_hb_pf2[org_hb_pf2["price_million_vnd"] > 0].copy()

# Outlier masking
# floor_outlier_idx = (df["floors"].sort_values(ascending=False).head(294).index)
floor_cutoff = df["floors"].quantile(0.995)
area_cutoff = df["area_m2"].quantile(0.98)
bed_cutoff = df["bedrooms"].quantile(0.99)
bath_cutoff = df["bathrooms"].quantile(0.99)

df = df.loc[
    ((df["floors"] <= floor_cutoff) | (df["floors"].isna())) &
    ((df["area_m2"] <= area_cutoff) | (df["area_m2"].isna())) &
    ((df["bedrooms"] <= bed_cutoff) | (df["bedrooms"].isna())) &
    ((df["bathrooms"] <= bath_cutoff) | (df["bathrooms"].isna()))
]

# (~df.index.isin(floor_outlier_idx)) &

# Save into the new variable
cleaned_hb = df

In [83]:
cleaned_hb.describe()

Unnamed: 0,area_m2,bedrooms,bathrooms,floors,price_million_vnd,day,month,year
count,46440.0,52432.0,49073.0,46102.0,55934.0,55934.0,55934.0,55934.0
mean,74.233441,4.12603,3.984737,4.32029,12328.114743,22.84176,7.123092,2024.277273
std,53.986437,2.66877,2.743406,1.863598,27908.533225,2.417917,3.13226,0.832887
min,1.0,1.0,1.0,0.0,1.0,7.0,1.0,2022.0
25%,41.0,3.0,2.0,3.0,4800.0,22.0,5.0,2024.0
50%,60.0,4.0,3.0,4.0,7200.0,23.0,9.0,2024.0
75%,86.0,5.0,5.0,6.0,12500.0,24.0,9.0,2025.0
max,436.0,23.0,23.0,11.0,950000.0,31.0,12.0,2025.0


In [84]:
# Log-transform price
cleaned_hb["log_price"] = np.log1p(cleaned_hb["price_million_vnd"])

price_million_bak = cleaned_hb["price_million_vnd"]
cleaned_hb = cleaned_hb.drop(columns = ["price_million_vnd"])

In [85]:
cleaned_hb.describe()

Unnamed: 0,area_m2,bedrooms,bathrooms,floors,day,month,year,log_price
count,46440.0,52432.0,49073.0,46102.0,55934.0,55934.0,55934.0,55934.0
mean,74.233441,4.12603,3.984737,4.32029,22.84176,7.123092,2024.277273,8.893177
std,53.986437,2.66877,2.743406,1.863598,2.417917,3.13226,0.832887,1.08559
min,1.0,1.0,1.0,0.0,7.0,1.0,2022.0,0.693147
25%,41.0,3.0,2.0,3.0,22.0,5.0,2024.0,8.47658
50%,60.0,4.0,3.0,4.0,23.0,9.0,2024.0,8.881975
75%,86.0,5.0,5.0,6.0,24.0,9.0,2025.0,9.433564
max,436.0,23.0,23.0,11.0,31.0,12.0,2025.0,13.764218


#### 5. Feature Encoding ####

#### Target & Feature Separation ####

In [86]:
target_col = "log_price"

y = cleaned_hb[target_col].copy()
X = cleaned_hb.drop(columns = [target_col])

#### Hierarchical Encoding ####

In [87]:
# Ensure no records are NaN / Null
X[["ward", "province"]] = X[["ward", "province"]].fillna("Unknown")

# Create combined key
X["ward_province"] = X["ward"] + " | " + X["province"]

# Frequency counts
combo_counts = X["ward_province"].value_counts()

MIN_TARGET = 15
rare_combos = combo_counts[combo_counts < MIN_TARGET].index

# Collapse rare combinations to province-level
X["ward_province_clean"] = X["ward_province"]
X.loc[
    X["ward_province"].isin(rare_combos),
    "ward_province_clean"
] = "OtherWard | " + X["province"]

# Optional diagnostic (further prevent overfitting)
collapse_rate = len(rare_combos) / X["ward_province"].nunique()
print(f"Rare category collapse rate: {collapse_rate:.2f}")

Rare category collapse rate: 0.61


#### Boolean Encoding ####

In [88]:
X["frontage"] = X["frontage"].fillna(False).astype(int)

#### Cyclical Encoding ####

In [89]:
def encode_day_cyclic(df):
    # Compute number of days in each month / year
    days_in_month = df.apply(lambda row : calendar.monthrange(int(row['year']), int(row['month']))[1], axis = 1)

    # Cyclical encoding
    df["day_sin"] = np.sin(2 * np.pi * df["day"] / days_in_month)
    df["day_cos"] = np.cos(2 * np.pi * df["day"] / days_in_month)
    
    return df

# Preserve original date (mapping)
org_date = X[["day", "month"]].copy().to_dict(orient = "index")

# Day encoding
X = encode_day_cyclic(X)

# Month encoding
X["month_sin"] = np.sin(2 * np.pi * X["month"] / 12)
X["month_cos"] = np.cos(2 * np.pi * X["month"] / 12)

# Exclude redundant columns
X = X.drop(columns = ["day", "month"])

#### Missing Indicators (Before Imputation) ####

In [90]:
num_impute_cols = ["area_m2", "bedrooms", "bathrooms", "floors"]

for col in num_impute_cols:
    X[f"{col}_missing"] = X[col].isna().astype(int)

#### Primary Encoding ####

In [91]:
cat_cols = ["province", "ward_province_clean"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing Pipeline
prepipeline = PreprocessingPipeline([
    MICEImputation(
        columns=num_impute_cols,
        max_iter=5,
        random_state=42,
        n_nearest_features=5
    ),
    RobustScaling(columns=["area_m2", "bedrooms", "bathrooms", "floors", "year"]),
    OneHotEncoding(cat_cols)
])

# Fit only on train
X_train_enc = prepipeline.fit_transform(X_train)

# Transform test using the same fitted pipeline
X_test_enc = prepipeline.transform(X_test)

# Reattach target
train_df = X_train_enc.copy()
train_df[target_col] = y_train.values

test_df = X_test_enc.copy()
test_df[target_col] = y_test.values

# Drop redundant original categorical columns
train_df = train_df.drop(columns=["ward", "ward_province"])
test_df = test_df.drop(columns=["ward", "ward_province"])

In [94]:
train_df

Unnamed: 0,area_m2,bedrooms,bathrooms,floors,frontage,year,day_sin,day_cos,month_sin,month_cos,...,ward_province_clean_OtherWard | Ph√∫ Y√™n,ward_province_clean_OtherWard | B·∫øn Tre,ward_province_clean_OtherWard | Qu·∫£ng Tr·ªã,ward_province_clean_OtherWard | Qu·∫£ng Nam,ward_province_clean_OtherWard | Vƒ©nh Long,ward_province_clean_OtherWard | ƒê·ªìng Th√°p,ward_province_clean_OtherWard | H√≤a B√¨nh,ward_province_clean_OtherWard | ƒêi·ªán Bi√™n,ward_province_clean_OtherWard | S√≥c TrƒÉng,log_price
59338,0.504732,0.520792,0.381793,0.252704,0,1.0,-9.377521e-01,0.347305,1.000000e+00,6.123234e-17,...,0,0,0,0,0,0,0,0,0,8.575651
23002,0.937359,-0.520792,0.381793,1.000000,1,1.0,-9.510565e-01,0.309017,8.660254e-01,-5.000000e-01,...,0,0,0,0,0,0,0,0,0,10.609082
21821,0.384558,0.000000,0.381793,0.500000,0,1.0,-9.510565e-01,0.309017,8.660254e-01,-5.000000e-01,...,0,0,0,0,0,0,0,0,0,8.909370
55302,0.624906,0.520792,0.381793,0.500000,0,1.0,-9.884683e-01,0.151428,5.000000e-01,-8.660254e-01,...,0,0,0,0,0,0,0,0,0,10.571317
39338,-1.057533,-0.520792,0.048459,-1.500000,0,0.0,-9.510565e-01,-0.309017,-1.000000e+00,-1.836970e-16,...,0,0,0,0,0,0,0,0,0,8.146419
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54043,-0.360523,-1.041585,-0.618207,-0.156714,0,1.0,-9.945219e-01,0.104528,1.224647e-16,-1.000000e+00,...,0,0,0,0,0,0,0,0,0,8.071219
65082,0.913324,0.000000,0.048459,-0.500000,1,1.0,-9.884683e-01,0.151428,5.000000e-01,8.660254e-01,...,0,0,0,0,0,0,0,0,0,9.752723
44519,0.613001,-0.520792,-0.284874,-1.000000,0,-1.0,-9.945219e-01,-0.104528,-1.000000e+00,-1.836970e-16,...,0,0,0,0,0,0,0,0,0,8.779711
880,-0.432627,1.562377,0.048459,0.000000,0,1.0,-2.449294e-16,1.000000,-8.660254e-01,-5.000000e-01,...,0,0,0,0,0,0,0,0,0,8.764210


#### 6. Target Mapping ####

In [95]:
# Inverse transformations 
def inverse_log_price(log_price):
    """Convert log1p-transformed price back to original."""
    return np.expm1(log_price)

def inverse_year(y_scaled):
    """Convert scaled year back to original year."""
    year_center = X_train["year"].median()
    year_scale = X_train["year"].quantile(0.75) - X_train["year"].quantile(0.25)
    return y_scaled * year_scale + year_center

# Ward / Province Feature Mapping 
ward_province_features = [
    col for col in train_df.columns
    if col.startswith("ward_province_clean_")
]

ward_province_feature_map = {
    col: col.replace("ward_province_clean_", "")
    for col in ward_province_features
}

# Original Prices 
org_price_train = inverse_log_price(train_df["log_price"]).to_list()
org_price_test = inverse_log_price(test_df["log_price"]).to_list()

# Original Years 
original_year_train = inverse_year(X_train["year"]).to_list()
original_year_test = inverse_year(X_test["year"]).to_list()

### üì• Savestates üîñ ###

Prior to model training, reproducible savestates are created to maintain consistency, portability and interpretability across machine learning workflows. The fully prepreocessed training and testing datasets are exported in diverse formats, while auxiliary label and inverse-mapping metadata are initialized seperately to preserve links between encoded and original semantic features. Concurrently, the fitted preprocessing pipeline is versioned and persisted as a timestamped artifact, which allows for identical transformations to be reapplied during model deployment or future experimentation. Together, these savestates establish a reliable boundary between preliminaries and model experiments, supporting experiment reproducibility.

#### Interdisciplanary ML Dataset Versions ####

In [96]:
# Save a dataframe in multiple formats and optionally save the label mappings
def save_dataset(df, filename, folder = "data\processed", formats = [], label_maps=None):
    # Resolve target folder relative to project root
    root = Path.cwd().resolve().parent          # catalyst/
    target_folder = root / folder               # catalyst/data/processed
    target_folder.mkdir(parents=True, exist_ok=True)

    # Save dataset in requested formats
    for fmt in formats:
        path = target_folder / f"{filename}.{fmt}"
        if fmt == "csv":
            df.to_csv(path, index=False)
        elif fmt == "parquet":
            df.to_parquet(path, index=False)
        elif fmt == "xlsx":
            df.to_excel(path, index=False)
        print(f"Saving Completed: {filename} -> {fmt}")

    # Save label mappings if provided
    if label_maps:
        for col, mapping in label_maps.items():
            map_path = target_folder / f"{filename}_{col}_mapping.json"
    
            # Normalize keys to JSON-safe native types
            clean_mapping = {
                int(k) if isinstance(k, (int, float)) else str(k): v
                for k, v in mapping.items()
            }
    
            with open(map_path, "w", encoding="utf-8") as f:
                json.dump(clean_mapping, f, ensure_ascii=False, indent=4)
    
            print(f"Mapping Saved: {col} -> {map_path}")

In [97]:
label_maps = {
    "ward_province": ward_province_feature_map,
    "price_million_vnd_train": {i: p for i, p in enumerate(org_price_train)},
    "price_million_vnd_test": {i: p for i, p in enumerate(org_price_test)},
    "year_train": {i: y for i, y in enumerate(original_year_train)},
    "year_test": {i: y for i, y in enumerate(original_year_test)},
    "day_month": org_date
}

# Save train and test 
save_dataset(train_df, "train_df", formats=["csv"], label_maps=label_maps)
save_dataset(test_df, "test_df", formats=["csv"])

# Path to save
export_path = Path("..") / "models"
export_path.mkdir(parents=True, exist_ok=True)

# Timestamped filename
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
model_path = export_path / f"preprocessing_{timestamp}.pkl"

# Save fitted pipeline
joblib.dump(prepipeline, model_path)

print(f"Preprocessing pipeline successfully saved to {model_path}")

Saving Completed: train_df -> csv
Mapping Saved: ward_province -> C:\Users\Admin\Downloads\catalyst-pre\data\processed\train_df_ward_province_mapping.json
Mapping Saved: price_million_vnd_train -> C:\Users\Admin\Downloads\catalyst-pre\data\processed\train_df_price_million_vnd_train_mapping.json
Mapping Saved: price_million_vnd_test -> C:\Users\Admin\Downloads\catalyst-pre\data\processed\train_df_price_million_vnd_test_mapping.json
Mapping Saved: year_train -> C:\Users\Admin\Downloads\catalyst-pre\data\processed\train_df_year_train_mapping.json
Mapping Saved: year_test -> C:\Users\Admin\Downloads\catalyst-pre\data\processed\train_df_year_test_mapping.json
Mapping Saved: day_month -> C:\Users\Admin\Downloads\catalyst-pre\data\processed\train_df_day_month_mapping.json
Saving Completed: test_df -> csv
Preprocessing pipeline successfully saved to ..\models\preprocessing_20251230_163143.pkl


### ‚ùØ‚ùØ‚ùØ‚ùØ  Coming Soon: Model Experiments ‚öóÔ∏è ###