## ðŸ“˜ HDB Resale Flat Prices

### ðŸ“Œ Notebook Description

- **Team:** Team A  
- **Members:** Ben, Shazlin, Alan  
- **Project Name:** HDB Resale Flat Data Engineering Pipeline
- **Description:** Implements automated data ingestion from data.gov.sg and performs dataset merging to produce a unified, analysis-ready dataset.
- **Data Artifacts:**  
    - `/DataLake/<raw files>`  
    - `/Staging/Main.csv`

### ðŸ“¦ Import Required Libraries

In [1]:
import pandas as pd

#---Customized-----------------------------------------
import control_output
pd.set_option("display.float_format", "{:,.2f}".format)
control_output.css

### ðŸ”§ Data Processing

In [2]:
#hdb_data = pd.read_csv('datasets/Main.csv', low_memory=False, usecols=lambda col: not col.startswith("Unnamed"))
hdb_data = pd.read_csv('../Project-HDB-Store/staging/Main.csv', low_memory=False, usecols=lambda col: not col.startswith("Unnamed"))
hdb_data = hdb_data.rename(columns={'month': 'year_month'})
hdb_data["year_month"] = pd.to_datetime(hdb_data["year_month"], format="%Y-%m-%d")
hdb_data.set_index('year_month', inplace=True)

# Important: sort by index
hdb_data = hdb_data.sort_index()

hdb_data_final = hdb_data['2000-01':'2022-12-31']

df_stat_monthly = pd.read_csv('../Project-HDB-Store/staging/stat_monthly.csv',
    index_col="year_month",      # use year_month as index during load
    parse_dates=["year_month"]   # convert to datetime automatically
)

df_stat_yearly = pd.read_csv('../Project-HDB-Store/staging/stat_yearly.csv',
    index_col="year",      # use year_month as index during load
    parse_dates=["year"]   # convert to datetime automatically
)

df_merged = hdb_data_final.join(df_stat_monthly, how="left")
#df_merged = df_merged.join(df_stat_yearly, how="left")
#df_merged.to_csv("staging/Main_final.csv")
#df_merged.to_csv("../Project-HDB-Store/staging/Main_final.csv", index=True, index_label="year_month")

### ðŸ“‹ Sample Data (head)

In [3]:
df_merged.head()

Unnamed: 0_level_0,town,flat_type,flat_model,floor_area_sqm,street_name,resale_price,lease_commence_date,storey_range,block,remaining_lease,...,full_address,lat,long,nearest_mrt,nearest_distance_to_mrt,remaining_years,price_per_sqm,birth,marriages,divorces
year_month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2000-01-01,ANG MO KIO,3RM,IMPROVED,69.0,ANG MO KIO AVE 4,147000.0,1986,07 TO 09,170,85,...,170 ANG MO KIO AVENUE 4 KEBUN BARU LINK 1 SING...,1.37,103.84,mayflower,0.28,85,2130.43,3585,1602,396
2000-01-01,BISHAN,4RM,MODEL A,113.0,BISHAN ST 22,382500.0,1992,07 TO 09,249,91,...,249 BISHAN STREET 22 SINGAPORE 570249,1.36,103.84,ang mo kio,1.16,91,3384.96,3585,1602,396
2000-01-01,BEDOK,3RM,IMPROVED,68.0,NEW UPP CHANGI RD,166000.0,1985,01 TO 03,57,84,...,57 NEW UPPER CHANGI ROAD SINGAPORE 461057,1.32,103.94,tanah merah,0.62,84,2441.18,3585,1602,396
2000-01-01,PASIR RIS,4RM,MODEL A,110.0,PASIR RIS ST 53,338000.0,1995,16 TO 18,574,94,...,574 PASIR RIS STREET 53 SINGAPORE 510574,1.37,103.95,pasir ris,0.38,94,3072.73,3585,1602,396
2000-01-01,PASIR RIS,4RM,MODEL A,108.0,PASIR RIS ST 53,307800.0,1995,04 TO 06,575,94,...,575 PASIR RIS STREET 53 SINGAPORE 510575,1.37,103.95,pasir ris,0.33,94,2850.0,3585,1602,396


### ðŸ“‹ Check Retrieve Dates

In [4]:
sorted_dates = df_merged.index.unique()
datexs = list((sorted_dates[:12].values)) + list((sorted_dates[-12:].values))

for d in datexs:
    print(d)

2000-01-01T00:00:00.000000000
2000-02-01T00:00:00.000000000
2000-03-01T00:00:00.000000000
2000-04-01T00:00:00.000000000
2000-05-01T00:00:00.000000000
2000-06-01T00:00:00.000000000
2000-07-01T00:00:00.000000000
2000-08-01T00:00:00.000000000
2000-09-01T00:00:00.000000000
2000-10-01T00:00:00.000000000
2000-11-01T00:00:00.000000000
2000-12-01T00:00:00.000000000
2022-01-01T00:00:00.000000000
2022-02-01T00:00:00.000000000
2022-03-01T00:00:00.000000000
2022-04-01T00:00:00.000000000
2022-05-01T00:00:00.000000000
2022-06-01T00:00:00.000000000
2022-07-01T00:00:00.000000000
2022-08-01T00:00:00.000000000
2022-09-01T00:00:00.000000000
2022-10-01T00:00:00.000000000
2022-11-01T00:00:00.000000000
2022-12-01T00:00:00.000000000


### ðŸ“¤ Output

In [5]:
print(f"Total HDB Resale Transation: {len(df_merged)}")

Total HDB Resale Transation: 599650


In [6]:
import copy
df_mergedx = copy.copy(df_merged)
store = []
count = 0
for x in df_mergedx.index:
    date_str = str(x).split(' ')[0]
    date_str = date_str.split('-')
    date_str = f"{date_str[0]}-01-{date_str[2]}"
    if count<10:
        count += 1
        #print(date_str)
    store.append(df_stat_yearly.loc[date_str].values)

df_year_new = pd.DataFrame(store)
df_year_new.columns = df_stat_yearly.columns
df_year_new

Unnamed: 0,unemployment,inflation,gdp
0,3.60,1.34,96076539925.74
1,3.60,1.34,96076539925.74
2,3.60,1.34,96076539925.74
3,3.60,1.34,96076539925.74
4,3.60,1.34,96076539925.74
...,...,...,...
599645,2.90,3.82,509017841146.56
599646,2.90,3.82,509017841146.56
599647,2.90,3.82,509017841146.56
599648,2.90,3.82,509017841146.56


# Fixed Issue for Merging Yearly State

In [9]:
df1_reset = df_merged.reset_index()
#df2_reset = df_year_new.reset_index(drop=True)

df_mergedx = pd.concat([df1_reset, df_year_new], axis=1)
df_mergedx.set_index('year_month', inplace=True)
df_mergedx

Unnamed: 0_level_0,town,flat_type,flat_model,floor_area_sqm,street_name,resale_price,lease_commence_date,storey_range,block,remaining_lease,...,nearest_mrt,nearest_distance_to_mrt,remaining_years,price_per_sqm,birth,marriages,divorces,unemployment,inflation,gdp
year_month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2000-01-01,ANG MO KIO,3RM,IMPROVED,69.00,ANG MO KIO AVE 4,147000.00,1986,07 TO 09,170,85,...,mayflower,0.28,85,2130.43,3585,1602,396,3.60,1.34,96076539925.74
2000-01-01,BISHAN,4RM,MODEL A,113.00,BISHAN ST 22,382500.00,1992,07 TO 09,249,91,...,ang mo kio,1.16,91,3384.96,3585,1602,396,3.60,1.34,96076539925.74
2000-01-01,BEDOK,3RM,IMPROVED,68.00,NEW UPP CHANGI RD,166000.00,1985,01 TO 03,57,84,...,tanah merah,0.62,84,2441.18,3585,1602,396,3.60,1.34,96076539925.74
2000-01-01,PASIR RIS,4RM,MODEL A,110.00,PASIR RIS ST 53,338000.00,1995,16 TO 18,574,94,...,pasir ris,0.38,94,3072.73,3585,1602,396,3.60,1.34,96076539925.74
2000-01-01,PASIR RIS,4RM,MODEL A,108.00,PASIR RIS ST 53,307800.00,1995,04 TO 06,575,94,...,pasir ris,0.33,94,2850.00,3585,1602,396,3.60,1.34,96076539925.74
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-12-01,WOODLANDS,3RM,MODEL A,68.00,WOODLANDS DR 62,470000.00,2018,10 TO 12,694A,94 years 10 months,...,admiralty,0.41,95,6911.76,2825,3540,486,2.90,3.82,509017841146.56
2022-12-01,HOUGANG,4RM,MODEL A,103.00,HOUGANG AVE 10,590000.00,1987,07 TO 09,512,63 years 07 months,...,hougang,0.42,64,5728.16,2825,3540,486,2.90,3.82,509017841146.56
2022-12-01,GEYLANG,4RM,NEW GENERATION,91.00,EUNOS CRES,560000.00,1977,07 TO 09,11,53 years 05 months,...,eunos,0.26,54,6153.85,2825,3540,486,2.90,3.82,509017841146.56
2022-12-01,PASIR RIS,5RM,IMPROVED,124.00,PASIR RIS ST 11,625000.00,1993,07 TO 09,180,69 years 08 months,...,tampines east,1.08,70,5040.32,2825,3540,486,2.90,3.82,509017841146.56


In [12]:
df_mergedx.to_csv("../Project-HDB-Store/staging/Main_final.csv", index=True, index_label="year_month")