# **Daegu Apartment Price Prediction Model**
---
- Sulaeman Nurhakim
- JCDS 2804-001
- Purwadhika Digital Technology School - Job Connector Data Science

# Chapter I : Business Problem
---
## *1. Background*

The city of Daegu in South Korea has experienced steady growth in population and urban development, which increases the demand for housing, especially apartments. As the demand goes up, apartment prices can vary widely depending on several factors. For property developers, investors, and potential buyers, it is important to be able to estimate the selling price of an apartment accurately.

Machine learning can be a useful tool to solve this problem by using historical data to build predictive models. As a student who is currently learning data science and machine learning, this project is a great opportunity to apply the knowledge I have learned to a real-world case. Through this project, I aim to build prediction models and find out which one works best for estimating apartment prices in Daegu.

## *2. Problem Statements*
This project will try to answer the following questions:

| **No.** | **Question**                                                                                                        |
|---------|----------------------------------------------------------------------------------------------------------------------|
| 1       | How can we use apartment data in Daegu to predict the selling price?                                                |
| 2       | Which machine learning model gives the most accurate results in predicting apartment prices?                        |
| 3       | What features or factors are most important in affecting the price?    

## *3. Goals* 
The main goals of this project are:

| **No.** | **Goal**                                                                                              |
|---------|-------------------------------------------------------------------------------------------------------|
| 1       | Build several machine learning models to predict the selling price of apartments in Daegu.           |
| 2       | Evaluate and compare the performance of each model.                                                  |
| 3       | Identify the key features that have the most impact on apartment prices.                              |

## *4. Stakeholders*
The stakeholders involved in this project include:

| **Stakeholder**              | **Role**                                                                                           |
|------------------------------|----------------------------------------------------------------------------------------------------|
| **Myself (the student)**      | Builds the models and learns from the project, gaining practical experience in data science.       |
| **Lecturer or Instructor**    | Evaluates the results, provides guidance, and facilitates the learning process.                   |
| **Property Developers and Investors (Hypothetical)** | Could benefit from price prediction models to make informed investment decisions in the real estate market. |


## *5. Analytical Approach*
This project will follow these steps:

| **Step**                  | **Description**                                                                                                 |
|---------------------------|-----------------------------------------------------------------------------------------------------------------|
| **Understanding the Data** | Review apartment dataset features such as size, location, year built, etc.                                     |
| **Data Preprocessing**    | Clean missing data, convert data types, and encode categorical variables.                                      |
| **Exploratory Data Analysis (EDA)** | Explore patterns, trends, and correlations within the dataset.                                                |
| **Model Building**        | Build models using algorithms like Linear Regression, Decision Tree, Random Forest, and Gradient Boosting.     |
| **Model Evaluation**      | Compare model performance using metrics such as MAE, RMSE, and R².                                              |
| **Result Interpretation** | Identify and interpret the most influential features affecting apartment prices.                               |




# Chapter II : Data Understanding 
---

In [1]:
# Import library

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns
import missingno as msno

# Statistics
from scipy.stats import normaltest
from scipy.stats import skew

# Train Test Split
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV, GridSearchCV, KFold

# Preprocessing
import category_encoders as ce
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer
from sklearn.preprocessing import RobustScaler, MinMaxScaler, StandardScaler, PolynomialFeatures

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# ML algorithm
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor, VotingRegressor, StackingRegressor
from sklearn.svm import SVR
import xgboost as xgb
from sklearn.compose import TransformedTargetRegressor

# Evaluation
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error, r2_score

import warnings
warnings.filterwarnings('ignore')

from ydata_profiling import ProfileReport


In [2]:
train = pd.read_csv('data_daegu_apartment.csv')

In [3]:
train.head()

Unnamed: 0,HallwayType,TimeToSubway,SubwayStation,N_FacilitiesNearBy(ETC),N_FacilitiesNearBy(PublicOffice),N_SchoolNearBy(University),N_Parkinglot(Basement),YearBuilt,N_FacilitiesInApt,Size(sqf),SalePrice
0,terraced,0-5min,Kyungbuk_uni_hospital,0.0,3.0,2.0,1270.0,2007,10,1387,346017
1,terraced,10min~15min,Kyungbuk_uni_hospital,1.0,5.0,1.0,0.0,1986,4,914,150442
2,mixed,15min~20min,Chil-sung-market,1.0,7.0,3.0,56.0,1997,5,558,61946
3,mixed,5min~10min,Bangoge,5.0,5.0,4.0,798.0,2005,7,914,165486
4,terraced,0-5min,Sin-nam,0.0,1.0,2.0,536.0,2006,5,1743,311504


In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4123 entries, 0 to 4122
Data columns (total 11 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   HallwayType                       4123 non-null   object 
 1   TimeToSubway                      4123 non-null   object 
 2   SubwayStation                     4123 non-null   object 
 3   N_FacilitiesNearBy(ETC)           4123 non-null   float64
 4   N_FacilitiesNearBy(PublicOffice)  4123 non-null   float64
 5   N_SchoolNearBy(University)        4123 non-null   float64
 6   N_Parkinglot(Basement)            4123 non-null   float64
 7   YearBuilt                         4123 non-null   int64  
 8   N_FacilitiesInApt                 4123 non-null   int64  
 9   Size(sqf)                         4123 non-null   int64  
 10  SalePrice                         4123 non-null   int64  
dtypes: float64(4), int64(4), object(3)
memory usage: 354.4+ KB


### Columns Info

| Feature Name                      | Description                                        |
|----------------------------------|----------------------------------------------------|
| Hallway Type                     | Apartment type (based on hallway structure)        |
| TimeToSubway                     | Time needed to reach the nearest subway station    |
| SubwayStation                    | Name of the nearest subway station                 |
| N_FacilitiesNearBy(ETC)          | Number of miscellaneous facilities nearby          |
| N_FacilitiesNearBy(PublicOffice)| Number of public office facilities nearby          |
| N_SchoolNearBy(University)       | Number of universities nearby                      |
| N_Parkinglot(Basement)           | Number of basement parking lots                    |
| YearBuilt                        | Year the apartment was built                       |
| N_FacilitiesInApt                | Number of facilities within the apartment complex  |
| Size(sqft)                       | Apartment size in square feet                      |
| SalePrice                        | Apartment price (in Korean Won)                    |


In [5]:
# Mengubah format penulisan pada feature TimeToSubway

train.loc[train['TimeToSubway']=='5min~10min','TimeToSubway']='5min-10min'
train.loc[train['TimeToSubway']=='10min~15min','TimeToSubway']='10min-15min'
train.loc[train['TimeToSubway']=='15min~20min','TimeToSubway']='15min-20min'

In [6]:
# Membuat laporan profiling
profile = ProfileReport(train, title="Laporan EDA", explorative=True)

# Melihat di Jupyter
profile.to_notebook_iframe()

# Atau simpan ke HTML
profile.to_file("laporan_eda.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 11/11 [00:00<00:00, 439.94it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

# Chapter III : Explanatory Data Analysis
---
## 1. Summary Statistics

In [7]:
train.describe(include='O')

Unnamed: 0,HallwayType,TimeToSubway,SubwayStation
count,4123,4123,4123
unique,3,5,8
top,terraced,0-5min,Kyungbuk_uni_hospital
freq,2528,1953,1152


In [8]:
train.describe(include='number')

Unnamed: 0,N_FacilitiesNearBy(ETC),N_FacilitiesNearBy(PublicOffice),N_SchoolNearBy(University),N_Parkinglot(Basement),YearBuilt,N_FacilitiesInApt,Size(sqf),SalePrice
count,4123.0,4123.0,4123.0,4123.0,4123.0,4123.0,4123.0,4123.0
mean,1.930876,4.135338,2.746301,568.979141,2002.999757,5.817851,954.630851,221767.926995
std,2.198832,1.80264,1.49661,410.372742,8.905768,2.340507,383.805648,106739.839945
min,0.0,0.0,0.0,0.0,1978.0,1.0,135.0,32743.0
25%,0.0,3.0,2.0,184.0,1993.0,4.0,644.0,144752.0
50%,1.0,5.0,2.0,536.0,2006.0,5.0,910.0,209734.0
75%,5.0,5.0,4.0,798.0,2008.0,7.0,1149.0,291150.0
max,5.0,7.0,5.0,1321.0,2015.0,10.0,2337.0,585840.0


### Insight
| No. | Description |
|-----|-------------|
| 1 | The dataset contains 4,123 rows with 11 features. |
| 2 | The most common apartment type is a terraced apartment. |
| 3 | Most apartments in Daegu are quite close to a station, with the majority within 0–5 minutes walking distance. Kyungbuk Uni Hospital Station is the nearest station for most properties. |
| 4 | Apartments in Daegu have an average of 2 nearby facilities, 4 public offices, 3 universities, and 569 basement parking spaces. |
| 5 | The average apartment was built in 2003. The oldest was built in 1978, and the newest in 2015. |
| 6 | Apartments have an average of 6 internal facilities, with the highest being 10, which is considered a high number of in-building amenities. |
| 7 | Apartment sizes and prices vary widely. The average size is 954.63 sq ft and the average price is 221,767.93 won. This gives an average price of 232.31 won/sq ft. |


In [9]:
# Creating a list of numeric columns for distribution analysis
numeric_columns = train.select_dtypes(include=['float64', 'int64']).columns

# Plotting the distributions of numeric columns
plt.figure(figsize=(15, 10))

for i, column in enumerate(numeric_columns, 1):
    plt.subplot(3, 4, i)
    sns.histplot(train[column], kde=True)
    plt.title(f'Distribution of {column}')
    plt.tight_layout()

plt.show()

### Insight
| No. | Feature | Insight |
|-----|---------|---------|
| 1 | **N_FacilitiesNearBy(ETC)** | Most properties have few or no additional nearby facilities, with a few having up to 5. |
| 2 | **N_FacilitiesNearBy(PublicOffice)** | Shows broader variation, with peaks at 2 and 5 nearby public offices. |
| 3 | **N_SchoolNearBy(University)** | Most properties are near 1 or 2 universities, making them attractive for students and families. |
| 4 | **N_Parkinglot(Basement)** | Right-skewed distribution; most have few basement parking spots, but some offer a large number. |
| 5 | **YearBuilt** | Peaks around the 2000s, suggesting a boom in apartment construction during that time. |
| 6 | **N_FacilitiesInApt** | Most properties have a small number of in-apartment facilities, though some have many. |
| 7 | **Size(sqf)** | Property sizes are concentrated around smaller sizes, with only a few being very large. |
| 8 | **SalePrice** | Sale prices are right-skewed; most are low-priced, with some high-value outliers. |
| 9 | **Business Insight** | These findings can guide pricing strategies, market segmentation, and marketing plans. Agents can target preferred property types or recommend improvements to increase resale value. |


In [10]:
# Normality test feature YearBuilt, Size(Sqf), dan SalePrice

hasil=[]
for i in numeric_columns:
    stats,pval=normaltest(train[i])
    if pval>0.05:
        hasil.append('Distribusi normal')
    else:
        hasil.append('Tidak berdistribusi normal')

pd.DataFrame({'Kolom':numeric_columns, 'Distribusi':hasil})

Unnamed: 0,Kolom,Distribusi
0,N_FacilitiesNearBy(ETC),Tidak berdistribusi normal
1,N_FacilitiesNearBy(PublicOffice),Tidak berdistribusi normal
2,N_SchoolNearBy(University),Tidak berdistribusi normal
3,N_Parkinglot(Basement),Tidak berdistribusi normal
4,YearBuilt,Tidak berdistribusi normal
5,N_FacilitiesInApt,Tidak berdistribusi normal
6,Size(sqf),Tidak berdistribusi normal
7,SalePrice,Tidak berdistribusi normal


In [11]:
# Creating boxplots for the numerical variables
plt.figure(figsize=(15, 10))

for i, column in enumerate(numeric_columns, 1):
    plt.subplot(4, 2, i)
    sns.boxplot(x=train[column])
    plt.title(f'Boxplot of {column}')
    plt.xlabel(column)

plt.tight_layout()
plt.show()

In [12]:
# Creating count plots for categorical variables
categorical_columns = ['HallwayType', 'TimeToSubway', 'SubwayStation']

plt.figure(figsize=(15, 2 * len(categorical_columns)))

for i, column in enumerate(categorical_columns, 1):
    plt.subplot(len(categorical_columns), 1, i)
    sns.countplot(y=train[column], order = train[column].value_counts().index)
    plt.title(f'Count of {column}')
    plt.xlabel('Count')
    plt.ylabel(column)

plt.tight_layout()
plt.show()

### Insight
| No. | Feature | Insight |
|-----|---------|---------|
| 1 | **HallwayType** | The most common hallway type is "terraced", followed by "mixed" and "corridor". This suggests a preference for terraced hallways, possibly due to better privacy or aesthetics. |
| 2 | **TimeToSubway** | Most properties are located within a "0–5min" walk to the subway, showing that proximity to public transportation is a key factor in property selection. Properties farther from stations may be less desirable and need competitive pricing. |
| 3 | **SubwayStation** | "Kyungbuk_uni_hospital" is the most frequently nearest station, indicating its surrounding area is a prime location. Other popular stations include "Myung-duk" and "Banwoldang", while "Chil-sung-market" and "Daegu" are less commonly the closest, possibly due to fewer listings or lower demand. |
| 4 | **Business Insight** | Real estate agents can use this information to adjust pricing strategies. Properties near desirable stations or with "terraced" hallway types may command higher prices. In contrast, those farther from transport hubs may need better offers or marketing approaches. |


In [13]:
# Calculating the correlation matrix for the numeric variables in the dataset
correlation_matrix = train.select_dtypes(include='number').corr()

# Plotting the heatmap for the correlation matrix
plt.figure(figsize=(12, 8))
heatmap = sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm")
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':18}, pad=12)
plt.show()

### Insight
| No. | Insight Area | Description |
|-----|--------------|-------------|
| 1 | **Overall Correlation Strength** | All features have a **medium correlation** (0.3–0.7) with apartment prices, indicating a meaningful relationship between each feature and the sale price. |
| 2 | **Highest Positive Correlation** | **Apartment Size** has the highest positive correlation with price, meaning that **larger apartments tend to be more expensive**. |
| 3 | **Other Positively Correlated Features** | - **Number of apartment facilities**  <br> - **Year Built**  <br> - **Number of basement parking lots** <br> These also show medium positive correlations, suggesting that **newer apartments with more amenities and parking tend to cost more**. |
| 4 | **Negatively Correlated Features** | - **Number of public office facilities nearby**  <br> - **Other nearby facilities**  <br> - **Number of nearby universities** <br> These have **medium negative correlations** with apartment prices, implying that **more nearby institutions or facilities might reduce property values**—potentially due to noise, congestion, or zoning. |
| 5 | **Multicollinearity Observed** | Strong correlations were observed between: <br> - Other nearby facilities ↔️ Nearby universities (**0.80**) <br> - Public office facilities ↔️ Nearby universities (**0.74**) <br> These strong inter-feature correlations **indicate multicollinearity**, which can distort regression model interpretations and lead to unstable coefficient estimates. |
| 6 | **Business Implication** | When using models like linear or logistic regression, **care must be taken to address multicollinearity**, perhaps by removing or combining highly correlated variables to ensure model stability and interpretability. |


In [14]:
# Prepare a list of numeric and categorical columns for analysis
numeric_columns = train.select_dtypes(include=['float64', 'int64']).columns.tolist()
numeric_columns.remove('SalePrice')  # Remove the target variable from the numeric columns list
categorical_columns = train.select_dtypes(include=['object']).columns.tolist()

# Plotting the numeric features
fig, axes = plt.subplots(nrows=len(numeric_columns), ncols=1, figsize=(18, 3*len(numeric_columns)))
for i, col in enumerate(numeric_columns):
    sns.scatterplot(data=train, x=col, y='SalePrice', ax=axes[i])
    axes[i].set_title(f'Relationship between {col} and SalePrice')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('SalePrice')

# Adjust layout
plt.tight_layout()

# Show the plots
plt.show()

In [15]:
# Generating horizontal box plots for categorical variables
for col in categorical_columns:
    plt.figure(figsize=(18, 3))
    sns.boxplot(data=train, y=col, x='SalePrice')
    plt.title(f'Relationship between {col} and SalePrice')
    plt.ylabel(col)
    plt.xlabel('SalePrice')
    plt.xticks(rotation=45)
    plt.show()


### Insight

| **Relationship**                                   | **Insights**                                                                                                                                                                                                                   |
|----------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Hallway Type and Selling Price**                 | - Properties with the "terraced" alley type have a higher median selling price compared to "mixed" and "corridor", possibly indicating higher market preference or quality.                                                       |
|                                                    | - Outliers are present, especially in "terraced" and "corridor" types, suggesting that some properties sell at significantly higher prices, likely due to unique features or desirable locations.                               |
| **Time to Train Station and Selling Price**        | - Properties within a 5-minute walk from a train station generally have higher sale prices, highlighting the importance of proximity to public transportation.                                                                   |
|                                                    | - No significant price difference is observed between properties located 10-15 minutes and 15-20 minutes from the station, indicating that 15-20 minutes is considered the upper limit of walkability.                          |
| **Train Station and Selling Price**                | - Significant variations in prices exist between different train stations, with areas near "Kyungbuk_uni_hospital" and "Myung-duk" stations showing higher price distributions, possibly indicating desirable locations.            |
|                                                    | - Stations such as "Daegu" have a lower median price and a narrower distribution, suggesting that properties here are less desirable or have fewer attractive characteristics.                                                   |


# Chapter IV : Data Preprocessing 
---

In [16]:
train_model = train.copy()

In [17]:
# Tipe data, jumlah data kosong, jumlah data unik, dan sampel data unik pada setiap featue

listItem = []
for col in train_model.columns :
    listItem.append([col, train_model[col].dtype, train_model[col].isna().sum(), round((train_model[col].isna().sum()/len(train_model[col])) * 100,2),
                    train_model[col].nunique(), list(train_model[col].drop_duplicates().sample(2).values)])

dfDesc = pd.DataFrame(columns=['Data Features', 'Data Type', 'Null', 'Null Percentage', 'Unique', 'Unique Sample'],
                     data=listItem)
dfDesc

Unnamed: 0,Data Features,Data Type,Null,Null Percentage,Unique,Unique Sample
0,HallwayType,object,0,0.0,3,"[corridor, mixed]"
1,TimeToSubway,object,0,0.0,5,"[no_bus_stop_nearby, 15min-20min]"
2,SubwayStation,object,0,0.0,8,"[Daegu, Chil-sung-market]"
3,N_FacilitiesNearBy(ETC),float64,0,0.0,4,"[5.0, 1.0]"
4,N_FacilitiesNearBy(PublicOffice),float64,0,0.0,8,"[7.0, 6.0]"
5,N_SchoolNearBy(University),float64,0,0.0,6,"[3.0, 1.0]"
6,N_Parkinglot(Basement),float64,0,0.0,20,"[18.0, 605.0]"
7,YearBuilt,int64,0,0.0,16,"[2013, 2015]"
8,N_FacilitiesInApt,int64,0,0.0,9,"[9, 7]"
9,Size(sqf),int64,0,0.0,89,"[508, 1273]"


In general it can be seen that: 
* There are 4,123 data and 11 columns in the dataset.
* There are 3 categorical columns and 8 numerical columns.
* There is no empty data in all dataset columns.
* Several columns in the data have a value of 0 which can be interpreted as the absence of facilities (public offices, universities, basement parking lots, etc.) around the apartment.

### Standardization of Data Types 

In [18]:
# Change data type from float to integer

train_model['N_FacilitiesNearBy(ETC)'] = train_model['N_FacilitiesNearBy(ETC)'].astype('int64')
train_model['N_FacilitiesNearBy(PublicOffice)'] = train_model['N_FacilitiesNearBy(PublicOffice)'].astype('int64')
train_model['N_SchoolNearBy(University)'] = train_model['N_SchoolNearBy(University)'].astype('int64')
train_model['N_Parkinglot(Basement)'] = train_model['N_Parkinglot(Basement)'].astype('int64')

### Missing Values

In [19]:
train_model.isna().sum()

HallwayType                         0
TimeToSubway                        0
SubwayStation                       0
N_FacilitiesNearBy(ETC)             0
N_FacilitiesNearBy(PublicOffice)    0
N_SchoolNearBy(University)          0
N_Parkinglot(Basement)              0
YearBuilt                           0
N_FacilitiesInApt                   0
Size(sqf)                           0
SalePrice                           0
dtype: int64

no missing values

### Duplicates Data

In [20]:
train_model.duplicated().sum()

1422

In [21]:
# Percentage of duplicate data
print('Persentase data duplikat:',len(train_model[train_model.duplicated()])/len(train_model))

Persentase data duplikat: 0.3448944943002668


In [22]:
train_model.drop_duplicates(inplace=True)

In [23]:
train_model.shape

(2701, 11)

There are 1,422 duplicate records in the dataset, making up 34.49% of the data. Duplicate data can lead to model bias and overfitting, as the same data points are counted multiple times. Therefore, we will remove all duplicates. After removing the duplicates, 2,701 records remain from the original 4,123.


### Outlier

In [24]:
num_features = train_model.describe().columns
num_features

Index(['N_FacilitiesNearBy(ETC)', 'N_FacilitiesNearBy(PublicOffice)',
       'N_SchoolNearBy(University)', 'N_Parkinglot(Basement)', 'YearBuilt',
       'N_FacilitiesInApt', 'Size(sqf)', 'SalePrice'],
      dtype='object')

In [25]:
num_feature = train_model.describe().columns
plot = 1

plt.figure(figsize=(20,10))
for feature in num_feature:
    plt.subplot(4,2,plot)
    sns.boxplot(data=train_model, x=feature)
    plt.title(feature, size=15)
    plt.tight_layout()
    plot += 1

plt.suptitle('Check Outlier Data (Numerik)', size=20)
plt.tight_layout()
plt.show()

In [26]:
# Detect outlier
def detect_outliers(train_model):
    outliers = {}
    for col in train_model.columns:
        if train_model[col].dtype in ['int64', 'float64']:
            Q1 = train_model[col].quantile(0.25)
            Q3 = train_model[col].quantile(0.75)
            IQR = Q3-Q1
            lower_bound = Q1-1.5*IQR
            upper_bound = Q3+1.5*IQR
            outliers[col] = len(train_model[(train_model[col]<lower_bound) | (train_model[col]>upper_bound)])
    return outliers
outliers = detect_outliers(train_model)
for col, count in outliers.items():
    print(f'Column: {col}, Outliers total: {count}')

Column: N_FacilitiesNearBy(ETC), Outliers total: 0
Column: N_FacilitiesNearBy(PublicOffice), Outliers total: 0
Column: N_SchoolNearBy(University), Outliers total: 0
Column: N_Parkinglot(Basement), Outliers total: 0
Column: YearBuilt, Outliers total: 0
Column: N_FacilitiesInApt, Outliers total: 0
Column: Size(sqf), Outliers total: 84
Column: SalePrice, Outliers total: 17


In [27]:
# Function to check outlier
def outlier(train_model):
    Q1 = train_model.quantile(0.25)
    Q3 = train_model.quantile(0.75)
    IQR = Q3-Q1
    print(f'''
    IQR: {Q3-Q1}
    Lower bound: {Q1-(1.5*IQR)}
    Upper bound: {Q3+(1.5*IQR)}
    ''')

In [28]:
# Outlier in the apartment size feature
print('Apartment Size')
outlier(train_model['Size(sqf)'])

# Outlier in the apartment prices feature
print('Price Apartment')
outlier(train_model['SalePrice'])

Apartment Size

    IQR: 424.0
    Lower bound: 107.0
    Upper bound: 1803.0
    
Price Apartment

    IQR: 147345.0
    Lower bound: -67478.5
    Upper bound: 521901.5
    


In [29]:
# Descriptive statistical features of apartment size
print('Deskriptif Statistik Ukuran Apartment')
display(train_model['Size(sqf)'].describe())

# Descriptive statistical features of apartment price
print('Deskriptif Statistik Harga Apartment')
display(train_model['SalePrice'].describe())

Deskriptif Statistik Ukuran Apartment


count    2701.000000
mean      984.028878
std       391.982619
min       135.000000
25%       743.000000
50%       910.000000
75%      1167.000000
max      2337.000000
Name: Size(sqf), dtype: float64

Deskriptif Statistik Harga Apartment


count      2701.000000
mean     229511.365790
std      105079.891321
min       32743.000000
25%      153539.000000
50%      221238.000000
75%      300884.000000
max      585840.000000
Name: SalePrice, dtype: float64

Based on the boxplot, the features with outliers are apartment size (84 outliers) and apartment price (17 outliers).

We will examine the descriptive statistics and distribution of these features to decide how to handle the outliers. Removing outliers may lead to the loss of important information and affect the model's accuracy. However, if outliers are due to measurement errors, removal may be necessary.

After reviewing the data, we have decided not to remove the outliers. They are not caused by errors but represent normal variations, such as very large or expensive houses, based on domain knowledge.


# Chapter V : Modelling
---


## 1. Feature Engineering 

### Scaling

Scaling transforms numerical data to a consistent range, improving fairness between variables with different units or ranges. This is important for algorithms like regression that are sensitive to data scale. Scaling can enhance algorithm performance, reduce computing time, and improve model interpretability.


| **Feature**               | **Scaling Method**   | **Reason**                                                                                          |
|---------------------------|----------------------|-----------------------------------------------------------------------------------------------------|
| `N_Parkinglot(Basement)`   | Robust Scaler        | To handle outliers and skewed distribution, improving data consistency and reducing extreme value influence. |
| `Size(Sqf)`                | Robust Scaler        | To handle outliers and skewed distribution, improving data consistency and reducing extreme value influence. |

### Encoding

Encoding converts categorical data into numerical format for use in models.

| **Feature**       | **Encoding Method**    | **Reason**                                                                                           |
|-------------------|------------------------|------------------------------------------------------------------------------------------------------|
| `HallwayType`     | One-Hot Encoding       | A nominal variable with 3 categories; One-Hot Encoding is suitable due to the small number of categories. |
| `SubwayStation`   | Binary Encoding        | A nominal variable with 8 categories; Binary Encoding is used to reduce the number of dummy variables and minimize overfitting. |
| `TimeToSubway`    | Ordinal Encoding       | An ordinal variable with categories ordered based on the time to reach the nearest station.          |


In [30]:
train_model.columns

Index(['HallwayType', 'TimeToSubway', 'SubwayStation',
       'N_FacilitiesNearBy(ETC)', 'N_FacilitiesNearBy(PublicOffice)',
       'N_SchoolNearBy(University)', 'N_Parkinglot(Basement)', 'YearBuilt',
       'N_FacilitiesInApt', 'Size(sqf)', 'SalePrice'],
      dtype='object')

In [31]:
# Scaling and encoding

ordinal_mapping = [
    {'col':'TimeToSubway', 
    'mapping':{'no_bus_stop_nearby':0, '0min-5min':1, '5min-10min':2, '10min-15min':3, '15-20min':4}}
    ]

ordinal_encoder = ce.OrdinalEncoder(cols=['TimeToSubway'], mapping=ordinal_mapping)

transformer = ColumnTransformer([
            ('Robust',RobustScaler(),['N_Parkinglot(Basement)','Size(sqf)']),
            ('OneHotEncoding', OneHotEncoder(drop='first'), ['HallwayType']),
            ('BinaryEncoding', ce.BinaryEncoder(), ['SubwayStation']),
            ('OrdinalEncoding', ce.OrdinalEncoder(), ['TimeToSubway'])
            ], remainder='passthrough')

## 2. Train Test Splitting

In [32]:
feature =  train_model.drop(columns=['SalePrice'], axis=1)
target = train_model['SalePrice']

In [33]:
# SPlit into training and testing
X_train, X_test, y_train, y_test = train_test_split(feature, target, random_state=42, test_size=0.2)
print(X_train.shape)
print(y_train.shape)

(2160, 10)
(2160,)


### Variable Definition

| **Variable** | **Type**            | **Description**                                                   |
|--------------|---------------------|-------------------------------------------------------------------|
| **x**        | Independent Variable| The features used to predict the target variable (SalePrice).     |
| **y**        | Dependent Variable  | The target variable to be predicted (SalePrice).                  |

### Features (x)

| **Feature**                        | **Description**                           |
|------------------------------------|-------------------------------------------|
| HallwayType                        | Type of hallway (nominal variable)        |
| TimeToSubway                       | Time to the nearest subway (ordinal)     |
| SubwayStation                      | The subway station (nominal variable)    |
| N_FacilitiesNearBy(ETC)            | Number of nearby ETC facilities          |
| N_FacilitiesNearBy(PublicOffice)   | Number of nearby public office facilities|
| N_SchoolNearBy(University)         | Number of nearby universities            |
| N_Parkinglot(Basement)             | Number of basement parking lots          |
| YearBuilt                          | Year the property was built              |
| N_FacilitiesInApt                  | Number of facilities in the apartment    |
| Size(sqf)                          | Size of the apartment (in square feet)   |

### Target (y)

| **Target** | **Description**             |
|------------|-----------------------------|
| SalePrice  | The sale price of the apartment|

### Data Split

| **Data Type**   | **Percentage** | **Purpose**                                             |
|-----------------|----------------|---------------------------------------------------------|
| Training Data   | 80%            | Used to train machine learning models                   |
| Testing Data    | 20%            | Used to test the performance of the trained models      |


## 3. Benchmark Model

### Regression Models

Several regression models will be used to select the benchmark model:

| **Model**                        | **Description**                                                                                               |
|-----------------------------------|---------------------------------------------------------------------------------------------------------------|
| **Linear Regression**             | Models the linear relationship between one or more input variables and a target variable.                     |
| **Lasso Regression**              | A linear regression model that reduces overfitting by adding regularization (absolute number of coefficients) to the model, setting some coefficients to zero to focus on the most important features. |
| **Ridge Regression**              | A linear regression model that reduces overfitting by adding regularization (sum of squares of coefficients) to the model. |
| **KNN Regression**                | A regression model based on the K-Nearest Neighbors algorithm, predicting the target variable by finding the K nearest neighbors of the input data. |
| **Decision Tree Regression**      | A regression model structured as a decision tree, consisting of nodes and edges to predict the target variable. |
| **Random Forest Regression**      | A regression model that builds several decision trees using random subsets of training data and features, with bootstrapping and feature bagging techniques. |
| **XGBoost (Extreme Gradient Boosting) Regression** | A regression model using gradient boosting techniques with an ensemble learning approach to improve prediction accuracy. |
| **Support Vector Regression (SVR)** | A regression model that predicts target values using an approach similar to Support Vector Machines (SVM), aiming to find the best hyperplane that separates the data in feature space. |

### Model Evaluation: K-Fold Cross Validation

K-fold cross-validation is used to evaluate model performance. This method divides the dataset into 5 equal partitions, where the model is trained on 4 partitions and tested on the remaining partition. This process is repeated 5 times, each time with different partitions used for validation and training. The final model performance is calculated based on the average performance across all 5 iterations. 

This method helps avoid overfitting or underfitting and improves the model's generalization ability.


In [34]:
LinReg=LinearRegression()
Lasso=Lasso(random_state=42)
Ridge=Ridge(random_state=42)
KNN=KNeighborsRegressor()
Tree=DecisionTreeRegressor(random_state=42)
Forest=RandomForestRegressor(random_state=42)
XGBoost=xgb.XGBRegressor(random_state=42)
SVR=SVR()

models=[LinReg,Lasso,Ridge,KNN,Tree,Forest,XGBoost,SVR]

score_R2=[]
mean_R2=[]
std_R2=[]

score_RMSE=[]
mean_RMSE=[]
std_RMSE=[]

score_MAE=[]
mean_MAE=[]
std_MAE=[]

score_MAPE=[]
mean_MAPE=[]
std_MAPE=[]

crossval=KFold(n_splits=5)

for i in models:
    model_pipeline=Pipeline([
    ('preprocess',transformer),
    ('model',i)
    ])

    # R-Squared
    model_cv_R2=cross_val_score(model_pipeline,X_train,y_train,cv=crossval,scoring='r2')
    score_R2.append(model_cv_R2)
    mean_R2.append(model_cv_R2.mean())
    std_R2.append(model_cv_R2.std())

    # RMSE
    model_cv_RMSE=cross_val_score(model_pipeline,X_train,y_train,cv=crossval,scoring='neg_root_mean_squared_error')
    score_RMSE.append(model_cv_RMSE)
    mean_RMSE.append(model_cv_RMSE.mean())
    std_RMSE.append(model_cv_RMSE.std())

    # MAE
    model_cv_MAE=cross_val_score(model_pipeline,X_train,y_train,cv=crossval,scoring='neg_mean_absolute_error')
    score_MAE.append(model_cv_MAE)
    mean_MAE.append(model_cv_MAE.mean())
    std_MAE.append(model_cv_MAE.std())

    # MAPE
    model_cv_MAPE=cross_val_score(model_pipeline,X_train,y_train,cv=crossval,scoring='neg_mean_absolute_percentage_error')
    score_MAPE.append(model_cv_MAPE)
    mean_MAPE.append(model_cv_MAPE.mean())
    std_MAPE.append(model_cv_MAPE.std())

In [35]:
# Result of evaluation

kfold=pd.DataFrame({
    'Model': ['Linear Regression','Lasso','Ridge','KNN','Decision Tree','Random Forest','XGBoost','SVR'],
    'Mean R2': mean_R2,
    'Standar Deviasi R2': std_R2,
    'Mean RMSE': mean_RMSE,
    'Standard Deviasi RMSE': std_RMSE,
    'Mean MAE': mean_MAE,
    'Standard Deviasi MAE': std_MAE,
    'Mean MAPE': mean_MAPE,
    'Standard Deviasi MAPE': std_MAPE
}).set_index('Model').sort_values(by='Mean MAPE',ascending=False)

kfold

Unnamed: 0_level_0,Mean R2,Standar Deviasi R2,Mean RMSE,Standard Deviasi RMSE,Mean MAE,Standard Deviasi MAE,Mean MAPE,Standard Deviasi MAPE
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Decision Tree,0.807112,0.013265,-46270.785489,884.24817,-37259.323556,955.537365,-0.190018,0.001301
Random Forest,0.807865,0.01337,-46178.994331,885.653165,-37214.822617,965.821287,-0.190385,0.001295
XGBoost,0.806841,0.013167,-46303.876563,865.293541,-37297.524219,922.828733,-0.190406,0.001068
KNN,0.782673,0.017688,-49095.720081,1221.113625,-39484.083333,1473.590874,-0.202705,0.004851
Linear Regression,0.755169,0.010342,-52161.731578,561.053598,-42090.448273,748.236544,-0.216895,0.005181
Lasso,0.755168,0.010343,-52161.801519,561.102159,-42090.970775,748.139394,-0.216901,0.005173
Ridge,0.755153,0.010346,-52163.415479,562.70117,-42099.964056,747.786206,-0.217047,0.005091
SVR,-0.007083,0.005022,-105855.515131,2178.696754,-85494.536366,3745.076433,-0.559577,0.049023


## 4. Voting & Stacking
We will also combine multiple regression models to improve prediction performance. In this technique, several models are trained on the same training data, and their predictions are combined to make a final prediction.

The SVR model will not be used, as previous evaluation results showed that it has much higher mean RMSE, MAE, and MAPE values compared to other models, indicating poor prediction quality.

### Voting Regressor

In [36]:
# Voting Regressor

vc=VotingRegressor([
  ('model1',LinReg),
  ('model2',Lasso),
  ('model3',Ridge),
  ('model4',KNN),
  ('model5',Tree),
  ('model6',Forest),
  ('model7',XGBoost)
])

score_R2_voting=[]
mean_R2_voting=[]
std_R2_voting=[]

score_RMSE_voting=[]
mean_RMSE_voting=[]
std_RMSE_voting=[]

score_MAE_voting=[]
mean_MAE_voting=[]
std_MAE_voting=[]

score_MAPE_voting=[]
mean_MAPE_voting=[]
std_MAPE_voting=[]

crossval=KFold(n_splits=5)

model_pipeline3=Pipeline([
('preprocess',transformer),
('model',vc)
])

# R-Squared
model_cv_R2_voting=cross_val_score(model_pipeline3,X_train,y_train,cv=crossval,scoring='r2')
score_R2_voting.append(model_cv_R2_voting)
mean_R2_voting.append(model_cv_R2_voting.mean())
std_R2_voting.append(model_cv_R2_voting.std())

# RMSE
model_cv_RMSE_voting=cross_val_score(model_pipeline3,X_train,y_train,cv=crossval,scoring='neg_root_mean_squared_error')
score_RMSE_voting.append(model_cv_RMSE_voting)
mean_RMSE_voting.append(model_cv_RMSE_voting.mean())
std_RMSE_voting.append(model_cv_RMSE_voting.std())

# MAE
model_cv_MAE_voting=cross_val_score(model_pipeline3,X_train,y_train,cv=crossval,scoring='neg_mean_absolute_error')
score_MAE_voting.append(model_cv_MAE_voting)
mean_MAE_voting.append(model_cv_MAE_voting.mean())
std_MAE_voting.append(model_cv_MAE_voting.std())

# MAPE
model_cv_MAPE_voting=cross_val_score(model_pipeline3,X_train,y_train,cv=crossval,scoring='neg_mean_absolute_percentage_error')
score_MAPE_voting.append(model_cv_MAPE_voting)
mean_MAPE_voting.append(model_cv_MAPE_voting.mean())
std_MAPE_voting.append(model_cv_MAPE_voting.std())
    

In [37]:
# Result of Voting Evaluation

kfold_voting=pd.DataFrame({
    'Model': ['Voting Regressor'],
    'Mean R2': mean_R2_voting,
    'Standar Deviasi R2': std_R2_voting,
    'Mean RMSE': mean_RMSE_voting,
    'Standard Deviasi RMSE': std_RMSE_voting,
    'Mean MAE': mean_MAE_voting,
    'Standard Deviasi MAE': std_MAE_voting,
    'Mean MAPE': mean_MAPE_voting,
    'Standard Deviasi MAPE': std_MAPE_voting
}).set_index('Model')

kfold_voting

Unnamed: 0_level_0,Mean R2,Standar Deviasi R2,Mean RMSE,Standard Deviasi RMSE,Mean MAE,Standard Deviasi MAE,Mean MAPE,Standard Deviasi MAPE
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Voting Regressor,0.801248,0.010692,-46984.60594,550.603985,-38382.128182,710.306153,-0.197593,0.003136


### Stacking Regressor

In [38]:
# Stacking Regressor

sc=StackingRegressor([
  ('model1',LinReg),
  ('model2',Lasso),
  ('model3',Ridge),
  ('model4',KNN),
  ('model5',Tree),
  ('model6',Forest),
  ('model7',XGBoost)
],final_estimator=XGBoost)

score_R2_stacking=[]
mean_R2_stacking=[]
std_R2_stacking=[]

score_RMSE_stacking=[]
mean_RMSE_stacking=[]
std_RMSE_stacking=[]

score_MAE_stacking=[]
mean_MAE_stacking=[]
std_MAE_stacking=[]

score_MAPE_stacking=[]
mean_MAPE_stacking=[]
std_MAPE_stacking=[]

crossval=KFold(n_splits=5)

model_pipeline4=Pipeline([
('preprocess',transformer),
('model',sc)
])

# R-Squared
model_cv_R2_stacking=cross_val_score(model_pipeline4,X_train,y_train,cv=crossval,scoring='r2')
score_R2_stacking.append(model_cv_R2_stacking)
mean_R2_stacking.append(model_cv_R2_stacking.mean())
std_R2_stacking.append(model_cv_R2_stacking.std())

# RMSE
model_cv_RMSE_stacking=cross_val_score(model_pipeline4,X_train,y_train,cv=crossval,scoring='neg_root_mean_squared_error')
score_RMSE_stacking.append(model_cv_RMSE_stacking)
mean_RMSE_stacking.append(model_cv_RMSE_stacking.mean())
std_RMSE_stacking.append(model_cv_RMSE_stacking.std())

# MAE
model_cv_MAE_stacking=cross_val_score(model_pipeline4,X_train,y_train,cv=crossval,scoring='neg_mean_absolute_error')
score_MAE_stacking.append(model_cv_MAE_stacking)
mean_MAE_stacking.append(model_cv_MAE_stacking.mean())
std_MAE_stacking.append(model_cv_MAE_stacking.std())

# MAPE
model_cv_MAPE_stacking=cross_val_score(model_pipeline4,X_train,y_train,cv=crossval,scoring='neg_mean_absolute_percentage_error')
score_MAPE_stacking.append(model_cv_MAPE_stacking)
mean_MAPE_stacking.append(model_cv_MAPE_stacking.mean())
std_MAPE_stacking.append(model_cv_MAPE_stacking.std())

In [39]:
# Result of Stacking Evaluation

kfold_stacking=pd.DataFrame({
    'Model': ['Stacking Regressor'],
    'Mean R2': mean_R2_stacking,
    'Standar Deviasi R2': std_R2_stacking,
    'Mean RMSE': mean_RMSE_stacking,
    'Standard Deviasi RMSE': std_RMSE_stacking,
    'Mean MAE': mean_MAE_stacking,
    'Standard Deviasi MAE': std_MAE_stacking,
    'Mean MAPE': mean_MAPE_stacking,
    'Standard Deviasi MAPE': std_MAPE_stacking
}).set_index('Model')

kfold_stacking

Unnamed: 0_level_0,Mean R2,Standar Deviasi R2,Mean RMSE,Standard Deviasi RMSE,Mean MAE,Standard Deviasi MAE,Mean MAPE,Standard Deviasi MAPE
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Stacking Regressor,0.781767,0.02449,-49143.6,1739.926253,-39401.1875,1778.2718,-0.203574,0.006333


## 5. Summary

In [40]:
# Summary from result of all model evaluation

summary=pd.DataFrame({
        'Model':['Linear Regression','Lasso Regression','Ridge Regression','KNN Regressor','Decision Tree Regressor','Random Forest Regressor','XGBoost Regressor','SVR','Voting Regressor','Stacking Regressor'],
        'Mean R2': mean_R2+mean_R2_voting+mean_R2_stacking,
        'Standar Deviasi R2': std_R2+std_R2_voting+std_R2_stacking,
        'Mean RMSE': mean_RMSE+mean_RMSE_voting+mean_RMSE_stacking,
        'Standard Deviasi RMSE':  std_RMSE+std_RMSE_voting+std_RMSE_stacking,
        'Mean MAE': mean_MAE+mean_MAE_voting+mean_MAE_stacking,
        'Standard Deviasi MAE':  std_MAE+std_MAE_voting+std_MAE_stacking,
        'Mean MAPE': mean_MAPE+mean_MAPE_voting+mean_MAPE_stacking,
        'Standard Deviasi MAPE':  std_MAPE+std_MAPE_voting+std_MAPE_stacking}).set_index('Model').sort_values(by='Mean MAPE',ascending=False)
summary

Unnamed: 0_level_0,Mean R2,Standar Deviasi R2,Mean RMSE,Standard Deviasi RMSE,Mean MAE,Standard Deviasi MAE,Mean MAPE,Standard Deviasi MAPE
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Decision Tree Regressor,0.807112,0.013265,-46270.785489,884.24817,-37259.323556,955.537365,-0.190018,0.001301
Random Forest Regressor,0.807865,0.01337,-46178.994331,885.653165,-37214.822617,965.821287,-0.190385,0.001295
XGBoost Regressor,0.806841,0.013167,-46303.876563,865.293541,-37297.524219,922.828733,-0.190406,0.001068
Voting Regressor,0.801248,0.010692,-46984.60594,550.603985,-38382.128182,710.306153,-0.197593,0.003136
KNN Regressor,0.782673,0.017688,-49095.720081,1221.113625,-39484.083333,1473.590874,-0.202705,0.004851
Stacking Regressor,0.781767,0.02449,-49143.6,1739.926253,-39401.1875,1778.2718,-0.203574,0.006333
Linear Regression,0.755169,0.010342,-52161.731578,561.053598,-42090.448273,748.236544,-0.216895,0.005181
Lasso Regression,0.755168,0.010343,-52161.801519,561.102159,-42090.970775,748.139394,-0.216901,0.005173
Ridge Regression,0.755153,0.010346,-52163.415479,562.70117,-42099.964056,747.786206,-0.217047,0.005091
SVR,-0.007083,0.005022,-105855.515131,2178.696754,-85494.536366,3745.076433,-0.559577,0.049023


### Model Evaluation Results

Based on the evaluation results, the **Decision Tree Regressor**, **Random Forest Regressor**, and **XGBoost Regressor** models have the smallest mean RMSE, mean MAE, and mean MAPE values compared to other models. Among them, the **Decision Tree Regressor** is the best model. Smaller RMSE, MAE, and MAPE values indicate better model prediction quality. Additionally, the standard deviation values for RMSE, MAE, and MAPE are relatively small, indicating consistent results.

### Benchmark Models
- **Decision Tree Regressor**
- **Random Forest Regressor**
- **XGBoost Regressor**

These models will be used as benchmarks, and predictions will be made on the testing data.


## 6. Prediction From Testing Data With Benchmark 3 Best Models

In [41]:
models={
    'Random Forest':RandomForestRegressor(random_state=42),
    'Decision Tree':DecisionTreeRegressor(random_state=42),
    'XGBoost':xgb.XGBRegressor(random_state=42),
    # 'Voting Regression':VotingRegressor([('model1',LinReg),('model2',Lasso),('model3',Ridge),('model4',KNN),('model5',Tree),('model6',Forest),('model7',XGBoost)])
}

score_R2=[]
score_RMSE=[]
score_MAE=[]
score_MAPE=[]

for i in models:
    model=Pipeline([
        ('preprocessing', transformer),
        ('model', models[i])
        ])
    model.fit(X_train, y_train)
    y_pred=model.predict(X_test)

    score_R2.append(r2_score(y_test, y_pred))
    score_RMSE.append(np.sqrt(mean_squared_error(y_test, y_pred)))
    score_MAE.append(mean_absolute_error(y_test, y_pred))
    score_MAPE.append(mean_absolute_percentage_error(y_test, y_pred))

score_before_tuning=pd.DataFrame({'R2':score_R2,'RMSE': score_RMSE, 'MAE': score_MAE, 'MAPE': score_MAPE}, index=models.keys()).sort_values(by='MAPE')

score_before_tuning

Unnamed: 0,R2,RMSE,MAE,MAPE
XGBoost,0.783239,47985.30175,38951.3125,0.198598
Random Forest,0.781391,48189.460034,39010.786937,0.200358
Decision Tree,0.777481,48618.452969,39169.162646,0.202106


After making predictions on testing data, the XGBoost Regressor model is the best model because it has the smallest mean RMSE, mean MAE and mean MAPE values compared to other models.

## 7. Hyperparameter Tuning 

We will perform hyperparameter tuning to find the best parameters for our models to improve their performance. Based on the evaluation, the Decision Tree Regressor performed best during cross-validation, while the XGBoost Regressor performed best on the testing data.

Therefore, we will tune the hyperparameters of both the Decision Tree Regressor and XGBoost Regressor to improve their performance and select the best model.

### **7.a. Decision Tree Regressor Model**

A Decision Tree Regressor uses a tree structure to divide data into smaller and more homogeneous groups based on feature values. Each node in the tree predicts the target average value for the group of data entering that node.

#### **Parameters for Hyperparameter Tuning**

| **Parameter**        | **Description**                                                                                               | **Range for Tuning**        |
|----------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------|
| **splitter**         | The strategy used to split a node. Options: `best` (best split) or `random` (random split).                    | `best`, `random`            |
| **criterion**        | Function to measure the quality of a split. For regression, options: `MSE`, `MAE`, `Friedman MSE`, `Poisson`.   | `MSE`, `MAE`, `Friedman MSE`, `Poisson` |
| **max_depth**        | Maximum depth of the decision tree. Deeper trees may overfit the data.                                        | 1 to 50                     |
| **max_features**     | The maximum number of features used to split a node. Fewer features reduce overfitting.                        | 1 to 50                     |
| **min_samples_split**| Minimum number of data points required to split a node. Larger values reduce overfitting but may cause underfitting. | 1 to 20                     |
| **min_samples_leaf** | Minimum number of samples required at each leaf. Larger values reduce overfitting but may cause underfitting. | 1 to 20                     |


In [42]:
# Hyperparameter Decision Tree

hyperparam_tree={
        'modeling__splitter': ['best','random'],
        'modeling__criterion':['absolute_error','squared_error','friedman_mse','poisson'],
        'modeling__max_depth':[np.arange(1,51),None],
        'modeling__max_features':['auto','sqrt','log2',None],
        'modeling__min_samples_split':list(np.arange(2,21)),
        'modeling__min_samples_leaf':list(np.arange(1,21))
}

In [43]:
# Algoritma (benchmark model)

Tree=DecisionTreeRegressor(random_state=42)

pipe_model_tree = Pipeline([
        ('preprocessing', transformer),
        ('modeling', Tree)
 ])

crossval=KFold(n_splits=5)

# RandomizedSearch

randomsearch_tree=RandomizedSearchCV(
        estimator=pipe_model_tree,                # Model yang hendak di tuning
        param_distributions=hyperparam_tree,      # Hyperparameter
        cv=crossval,                                 # 5 fold cross validation
        scoring=['r2', 'neg_root_mean_squared_error', 'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error'], 
        n_jobs=-1,                                # Memaksimalkan processor
        refit='neg_mean_absolute_percentage_error',
        random_state=42
)

In [44]:
# Fit model dengan hyperparameter tuning pada data training

randomsearch_tree.fit(X_train,y_train)

In [45]:
# Hasil hyperparamer tuning dalam bentuk dataframe yang diurutkan berdasarkan MAPE

pd.DataFrame(randomsearch_tree.cv_results_).sort_values(by=['rank_test_neg_root_mean_squared_error', 'rank_test_neg_mean_absolute_error', 'rank_test_neg_mean_absolute_percentage_error'])

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_modeling__splitter,param_modeling__min_samples_split,param_modeling__min_samples_leaf,param_modeling__max_features,param_modeling__max_depth,param_modeling__criterion,...,std_test_neg_mean_absolute_error,rank_test_neg_mean_absolute_error,split0_test_neg_mean_absolute_percentage_error,split1_test_neg_mean_absolute_percentage_error,split2_test_neg_mean_absolute_percentage_error,split3_test_neg_mean_absolute_percentage_error,split4_test_neg_mean_absolute_percentage_error,mean_test_neg_mean_absolute_percentage_error,std_test_neg_mean_absolute_percentage_error,rank_test_neg_mean_absolute_percentage_error
0,0.04296,0.007641,0.016776,0.002124,best,11,3,,,poisson,...,783.168722,2,-0.190759,-0.196198,-0.192238,-0.198822,-0.197101,-0.195024,0.003036,2
5,0.036444,0.010218,0.013054,0.003584,best,18,15,,,squared_error,...,876.674908,3,-0.185296,-0.200411,-0.196622,-0.200156,-0.203301,-0.197157,0.006298,3
3,0.132511,0.015385,0.011768,0.004181,best,18,2,,,absolute_error,...,1365.511254,1,-0.180799,-0.185806,-0.184404,-0.182708,-0.180838,-0.182911,0.00197,1
9,0.02769,0.003147,0.01133,0.002071,best,10,4,log2,,friedman_mse,...,815.983676,4,-0.193109,-0.199732,-0.197655,-0.200178,-0.205436,-0.199222,0.00399,4
7,0.034109,0.004567,0.014445,0.00125,best,3,3,sqrt,,poisson,...,891.833713,5,-0.190828,-0.196512,-0.197516,-0.217651,-0.204412,-0.201384,0.009208,5
6,0.026046,0.00458,0.01167,0.002389,best,20,17,log2,,squared_error,...,2336.065518,6,-0.190067,-0.222414,-0.214595,-0.219354,-0.232179,-0.215722,0.014059,6
1,0.037646,0.002649,0.0,0.0,random,14,16,auto,,friedman_mse,...,,7,,,,,,,,7
2,0.030656,0.006656,0.0,0.0,best,14,3,sqrt,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...",absolute_error,...,,7,,,,,,,,7
4,0.030533,0.008589,0.0,0.0,random,16,8,auto,,poisson,...,,7,,,,,,,,7
8,0.028359,0.002767,0.0,0.0,random,18,5,auto,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...",squared_error,...,,7,,,,,,,,7


In [46]:
# Best_parameter

print('Parameter terbaik:')
randomsearch_tree.best_params_

Parameter terbaik:


{'modeling__splitter': 'best',
 'modeling__min_samples_split': 18,
 'modeling__min_samples_leaf': 2,
 'modeling__max_features': None,
 'modeling__max_depth': None,
 'modeling__criterion': 'absolute_error'}

In [47]:
# Best score 

print(f'Nilai MAPE setelah hyperparameter tuning: {randomsearch_tree.best_score_}')

Nilai MAPE setelah hyperparameter tuning: -0.18291110559653512


In [48]:
model = {'Decision Tree': DecisionTreeRegressor(random_state=42)}

# Define model terhadap estimator terbaik
tree_tuning = randomsearch_tree.best_estimator_

# Fitting model
tree_tuning.fit(X_train, y_train)

# Prediksi to data testing
y_pred_tree_tuning = tree_tuning.predict(X_test)

nilai_R2_tree_tuning=r2_score(y_test, y_pred_tree_tuning)
nilai_RMSE_tree_tuning=np.sqrt(mean_squared_error(y_test, y_pred_tree_tuning))
nilai_MAE_tree_tuning=mean_absolute_error(y_test, y_pred_tree_tuning)
nilai_MAPE_tree_tuning=mean_absolute_percentage_error(y_test, y_pred_tree_tuning)

score_after_tuning_tree = pd.DataFrame({'R2': nilai_R2_tree_tuning, 'RMSE': nilai_RMSE_tree_tuning, 'MAE': nilai_MAE_tree_tuning, 'MAPE': nilai_MAPE_tree_tuning}, index=model.keys())
score_after_tuning_tree

Unnamed: 0,R2,RMSE,MAE,MAPE
Decision Tree,0.768568,49582.590817,38131.631238,0.191178


In [49]:
# Before hyperparameter tuning

pd.DataFrame(score_before_tuning.loc['Decision Tree']).T

Unnamed: 0,R2,RMSE,MAE,MAPE
Decision Tree,0.777481,48618.452969,39169.162646,0.202106


In [50]:
# After hyperparameter tuning

score_after_tuning_tree

Unnamed: 0,R2,RMSE,MAE,MAPE
Decision Tree,0.768568,49582.590817,38131.631238,0.191178


After hyperparameter tuning, the Decision Tree Regressor model showed improved performance on the testing data. The RMSE, MAE, and MAPE values decreased, indicating smaller prediction errors and more accurate predictions, though the improvement was not very large.

The Decision Tree Regressor is suitable for predicting continuous values, like apartment prices in Daegu, South Korea. Therefore, we will select the tuned Decision Tree Regressor model as the best model for predicting apartment prices in Daegu.

### **7.b. XGBoost Regressor Model**

### XGBoost Hyperparameter Search Space

| **Hyperparameter**        | **Description**                                                                                   | **Range**             |
|---------------------------|---------------------------------------------------------------------------------------------------|-----------------------|
| **max_depth**              | The maximum depth of each tree in the XGBoost model.                                                | 1 to 10               |
| **learning_rate**          | The learning rate used by the XGBoost model.                                                      | 0.01 to 1             |
| **n_estimators**           | The number of trees in the XGBoost model.                                                          | 100 to 200            |
| **subsample**              | The percentage of the training data used for each tree in the XGBoost model.                      | 20% to 90%            |
| **gamma**                  | The minimum reduction in impurity required to split a leaf node in the XGBoost model.             | 1 to 10               |
| **colsample_bytree**       | The percentage of features used for each tree in the XGBoost model.                               | 10% to 90%            |
| **reg_alpha**              | The regularization alpha used in the XGBoost model.                                                | 0.001 to 10           |


In [51]:
# Tree depth
max_depth = list(np.arange(1, 11))

# Learning rate
learning_rate = list(np.arange(1, 100)/100)

# The number or tree
n_estimators = list(np.arange(100, 201))

# Percentage of rows per tree (of the total number of rows in the training set)
subsample = list(np.arange(2, 10)/10)

# Gamma (min_impurity_decrease)
gamma = list(np.arange(1, 11))

# Number of features used for each tree (as a percentage of the total number of columns in the training set)
colsample_bytree = list(np.arange(1, 10)/10)

# Alpha (regularization)
reg_alpha = list(np.logspace(-3, 1, 10))


# Hyperparam space XGboost
hyperparam_xgb = {
    'model__max_depth': max_depth, 
    'model__learning_rate': learning_rate,
    'model__n_estimators': n_estimators,
    'model__subsample': subsample,
    'model__gamma': gamma,
    'model__colsample_bytree': colsample_bytree,
    'model__reg_alpha': reg_alpha
}

In [52]:
# Benchmark the model with hyperparameter tuning
xgb_ = xgb.XGBRegressor(random_state=42, verbosity=0)

# Define algorithm chains
estimator_xgb = Pipeline([
        ('preprocessing', transformer),
        # ('scaler', scaler),
        ('model', xgb_)
        ])

crossval = KFold(n_splits=5)

# Hyperparameter tuning
randomsearch_xgb = RandomizedSearchCV(
    estimator= estimator_xgb, 
    param_distributions = hyperparam_xgb,
    # n_iter = 50,
    cv = crossval, 
    scoring = ['r2', 'neg_root_mean_squared_error', 'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error'], 
    n_jobs = -1,
    refit = 'neg_mean_absolute_percentage_error', # Only able to choose one metric for optimization
    random_state = 42 
)

In [53]:
# Fitting the training data to find the best parameters
randomsearch_xgb.fit(X_train, y_train)

In [54]:
# Melihat hasil tuning dalam bentuk dataframe. Diurutkan berdasarkan R-Squared
pd.DataFrame(randomsearch_xgb.cv_results_).sort_values(by=['rank_test_r2', 'rank_test_neg_root_mean_squared_error', 'rank_test_neg_mean_absolute_error', 'rank_test_neg_mean_absolute_percentage_error']).head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__subsample,param_model__reg_alpha,param_model__n_estimators,param_model__max_depth,param_model__learning_rate,param_model__gamma,...,std_test_neg_mean_absolute_error,rank_test_neg_mean_absolute_error,split0_test_neg_mean_absolute_percentage_error,split1_test_neg_mean_absolute_percentage_error,split2_test_neg_mean_absolute_percentage_error,split3_test_neg_mean_absolute_percentage_error,split4_test_neg_mean_absolute_percentage_error,mean_test_neg_mean_absolute_percentage_error,std_test_neg_mean_absolute_percentage_error,rank_test_neg_mean_absolute_percentage_error
2,1.162036,0.221888,0.007939,0.00073,0.9,0.007743,141,3,0.94,6,...,870.813908,6,-0.187777,-0.190409,-0.19024,-0.190598,-0.194548,-0.190715,0.002175,6
8,0.05186,0.005671,0.009773,0.002472,0.4,0.002783,166,5,0.19,2,...,1054.950172,1,-0.189175,-0.191337,-0.189855,-0.188766,-0.189665,-0.18976,0.000876,3
7,0.035813,0.003336,0.007314,0.000681,0.8,0.464159,120,3,0.68,4,...,1015.438695,3,-0.188368,-0.190879,-0.190544,-0.186851,-0.19182,-0.189692,0.001816,2
9,0.040162,0.007938,0.00882,0.002398,0.9,0.059948,105,5,0.38,3,...,990.817238,5,-0.189306,-0.19041,-0.191494,-0.190369,-0.191988,-0.190714,0.000941,5
5,1.150457,0.156532,0.008214,0.000947,0.8,3.593814,154,3,0.77,4,...,980.548954,2,-0.188502,-0.190345,-0.188716,-0.187473,-0.191926,-0.189392,0.001566,1


In [55]:
# Parameter terbaik 

print('Parameter terbaik:')
randomsearch_xgb.best_params_

Parameter terbaik:


{'model__subsample': 0.8,
 'model__reg_alpha': 3.593813663804626,
 'model__n_estimators': 154,
 'model__max_depth': 3,
 'model__learning_rate': 0.77,
 'model__gamma': 4,
 'model__colsample_bytree': 0.9}

In [56]:
# Cek skor dan params terbaik
print('XGBoost')
print('Best_score:', randomsearch_xgb.best_score_)
print('Best_params:', randomsearch_xgb.best_params_)

XGBoost
Best_score: -0.18939231634140014
Best_params: {'model__subsample': 0.8, 'model__reg_alpha': 3.593813663804626, 'model__n_estimators': 154, 'model__max_depth': 3, 'model__learning_rate': 0.77, 'model__gamma': 4, 'model__colsample_bytree': 0.9}


In [57]:
# Model XGBoost
model = {'XGBoost': xgb.XGBRegressor(random_state=42)}

# Define model terhadap estimator terbaik
xgb_tuning = randomsearch_xgb.best_estimator_

# Fitting model
xgb_tuning.fit(X_train, y_train)

# Predict test set
y_pred_xgb_tuning = xgb_tuning.predict(X_test)

# Simpan nilai metrics RMSE, MAE & MAPE setelah tuning
r2_xgb_tuning = r2_score(y_test, y_pred_xgb_tuning)
rmse_xgb_tuning = np.sqrt(mean_squared_error(y_test, y_pred_xgb_tuning))
mae_xgb_tuning = mean_absolute_error(y_test, y_pred_xgb_tuning)
mape_xgb_tuning = mean_absolute_percentage_error(y_test, y_pred_xgb_tuning)

score_after_tuning_xgb = pd.DataFrame({'R2': r2_xgb_tuning, 'RMSE': rmse_xgb_tuning, 'MAE': mae_xgb_tuning, 'MAPE': mape_xgb_tuning}, index=model.keys())
print('Testing Result After Tuning')
score_after_tuning_xgb

Testing Result After Tuning


Unnamed: 0,R2,RMSE,MAE,MAPE
XGBoost,0.78241,48076.943663,38658.09375,0.194595


In [58]:
# Before hyperparameter tuning

pd.DataFrame(score_before_tuning.loc['XGBoost']).T

Unnamed: 0,R2,RMSE,MAE,MAPE
XGBoost,0.783239,47985.30175,38951.3125,0.198598


In [59]:
# After hyperparameter tuning
score_after_tuning_xgb

Unnamed: 0,R2,RMSE,MAE,MAPE
XGBoost,0.78241,48076.943663,38658.09375,0.194595


After hyperparameter tuning, the XGBoost Regressor model showed improved performance on the testing data. The RMSE, MAE, and MAPE values decreased, indicating smaller prediction errors and more accurate predictions, though the improvement was not very large.

The XGBoost Regressor is a good choice for predicting continuous values, such as apartment prices in Daegu, South Korea. Therefore, we will select the tuned XGBoost Regressor model as the best model for predicting apartment prices in Daegu.


## 8. Residual Plot
The XGBoost model will be evaluated exploratively using residual plots to see whether the XGBoost model can predict values accurately or not. The residual plot is a graph that shows the difference between the actual value and the value predicted by the model.

In [60]:
plt.figure(figsize=(14, 8))
sns.scatterplot(x=y_test, y=y_pred_xgb_tuning)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)  # Diagonal line
plt.title('Actual vs. Predicted Price')
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.grid(True)
plt.show()


In [61]:
# Residual = y_actual - y_prediksi
residual = y_test-y_pred_xgb_tuning

df_residual = pd.DataFrame({
    'y_pred': y_pred_xgb_tuning,
    'residual': residual 
})

df_residual.head()

Unnamed: 0,y_pred,residual
2927,410250.84375,23527.15625
1427,289316.28125,67320.71875
2081,211153.546875,-9384.546875
352,287319.78125,84361.21875
1861,138235.546875,-18766.546875


In [62]:
# Check whether the residuals have outlier

sns.boxplot(data=df_residual,x=residual) 
plt.show()

Based on the boxplot above, the residuals have no outliers. This also means that RMSE, MAE, and MAPE are suitable for use as evaluation metrics in selecting the best model because outliers do not affect these three evaluation metrics.

In [63]:
# Residual plot

plt.figure(figsize=(14, 8))
sns.scatterplot(x=y_pred_xgb_tuning, y=residual)
plt.axhline(0, color='red', linestyle='--')
plt.title('Residual Plot')
plt.xlabel('Predicted Price')
plt.ylabel('Residuals (Actual - Predicted)')
plt.grid(True)
plt.show()

Based on the residual plot above, the residuals appear to be randomly distributed along the horizontal axis and do not have a particular pattern, indicating that the regression model is generally suitable for apartment price data in Daegu, South Korea.

## 9. Feature Importance
We will evaluate the features that influence apartment prices in Daegu, South Korea in the Decision XGBoost model through the feature importance function.

In [64]:
# Feature importances
feature_imp = pd.Series(xgb_tuning['model'].feature_importances_, transformer.get_feature_names_out()).sort_values(ascending = False)
feature_imp.to_frame(name='Feature Importances')

Unnamed: 0,Feature Importances
OneHotEncoding__HallwayType_terraced,0.85001
remainder__N_FacilitiesNearBy(ETC),0.068863
Robust__Size(sqf),0.018101
remainder__YearBuilt,0.018093
Robust__N_Parkinglot(Basement),0.010572
remainder__N_FacilitiesInApt,0.006543
OrdinalEncoding__TimeToSubway,0.006004
remainder__N_FacilitiesNearBy(PublicOffice),0.004933
BinaryEncoding__SubwayStation_3,0.003909
BinaryEncoding__SubwayStation_1,0.002972


In [65]:
# Plot feature importances
feature_imp = pd.Series(
    xgb_tuning['model'].feature_importances_, 
    index=transformer.get_feature_names_out()
).sort_values(ascending=False).head()

plt.figure(figsize=(10, 6))
feature_imp.plot(kind='barh')
plt.title('Top 5 Feature Importances')
plt.xlabel('Importance Score')
plt.gca().invert_yaxis()  # agar fitur paling penting di atas
plt.grid(True)
plt.tight_layout()
plt.show()

### Key Features Influencing Apartment Prices in Daegu (XGBoost Regressor - Tuned)

- **Type of Terraced Apartment**  
  The type of apartment (e.g., terraced) significantly affects the price, as it often reflects building quality and living standards.

- **Number of Nearby Facilities**  
  Apartments located near more facilities (like schools, stations, hospitals) tend to have higher prices due to better accessibility and convenience.

- **Size of the Apartment**  
  Larger apartments generally have higher prices. Size is one of the most direct indicators of value.


# Chapter VI : Conclusion
---

To predict apartment prices in Daegu, South Korea, three common regression evaluation metrics were used: **RMSE**, **MAE**, and **MAPE**. Among them, **MAPE** was selected as the main metric because it's easy to interpret—especially for stakeholders like real estate agents.

The final model chosen was a **tuned XGBoost Regressor**, which achieved a **MAPE of 19.4%**. This means the model's predictions are, on average, **19.4% off from actual prices**, which according to Lewis (1982), indicates **good forecasting accuracy**.

Even though the model performs well, it may still have **bias** due to **limited features** that don't fully represent all price-influencing factors in Daegu.

The **most important features** affecting price were:
- Apartment type (especially terraced apartments)
- Number of nearby facilities
- Apartment size

This model is useful for real estate agents to:
- Set more accurate selling prices
- Understand how characteristics affect price
- Predict price changes based on apartment features

Before using a regression model, the raw data was hard to interpret. Now, the model provides valuable **insights and predictions** that support better decision-making.


### Summary Table

| **Aspect**                          | **Details**                                                                 |
|------------------------------------|------------------------------------------------------------------------------|
| **Evaluation Metrics Used**        | RMSE, MAE, MAPE                                                              |
| **Main Metric Chosen**             | MAPE (Mean Absolute Percentage Error)                                       |
| **Final Model**                    | Tuned XGBoost Regressor                                                     |
| **MAPE Value**                     | 19.4%                                                                        |
| **MAPE Interpretation**            | Good accuracy (based on Lewis, 1982)                                        |
| **Price Range in Dataset**         | 32,743 won – 521,902 won                                                    |
| **Top Influential Features**       | - Apartment type (terraced)  <br> - Nearby facilities  <br> - Apartment size |
| **Model Usefulness**               | Helps set accurate prices, understand feature impact, predict price changes |
| **Limitation**                     | Possible bias due to limited features                                       |
| **Impact**                         | Real estate agents can make better decisions than with raw data alone       |

# Chapter VII : Recommendation
---

| **No.** | **Recommendation**                                                                                                                                                 | **Purpose**                                                                                                 |
|--------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------|
| 1      | Add more features such as floor level, year of sale, number of rooms (bedrooms, bathrooms, kitchen), furniture inclusiveness, etc.                                  | To better capture the factors that influence apartment prices and improve model accuracy.                  |
| 2      | Expand the dataset by collecting more recent and updated apartment price data in Daegu, South Korea.                                                                | To improve the relevance of the dataset, enable better pattern recognition, and enhance model performance. |


# Save Machine Learning
---

In [66]:
# Save model
import pickle

estimator = Pipeline([('preprocess', transformer), ('model', xgb.XGBRegressor())])
estimator.fit(X_train, y_train)

pickle.dump(estimator, open('Daegu_Apartment_XGB.sav', 'wb'))

In [67]:
# Load model
filename = 'Daegu_Apartment_XGB.sav'
loaded_model = pickle.load(open(filename, 'rb'))

In [68]:
mean_absolute_percentage_error(y_test,loaded_model.predict(X_test))

0.19859765470027924

In [69]:
np.sqrt(mean_squared_error(y_test,loaded_model.predict(X_test)))

47985.301749598286

In [72]:
train_model.to_excel('data_daegu_after_modelled.xlsx', index=False)