# Distance to the nearest MRT station in the Xindian district in Taiwan

#### Mikail Durrani, Merwise Hamidi, Raymond Li, Ethan Yang

### Introduction

In particularly in high-density areas in Taiwan, a considerable population shapes the housing market, exceeding 23 million people. Notably, a substantial portion of the populace relies on public transit, with over 2 million daily bus users and an MRT (Mass Rapid Transit) system serving millions. 

Source: https://metasub.org/city-profiles/taipei-taiwan/#:~:text=It%20consists%20of%20108%20stations,is%20Taiwan


Utilising the UCI Real Estate Valuation dataset, our project delves into the pivotal aspect of MRT proximity in Xindian which is one of the areas in Taiwan with high-demand for MRTs. 

<br>

 Thus, we posit the question, **What distance to the nearest MRT station can you expect in the Xindian district, using Transaction date, House age, Latitude, Longitude, and house price of the unit area?**


 

### Preliminary Exploratory Data Analysis

In [1]:
#Importing the Library
library(repr)
library(tidyverse)
library(tidymodels)
options(repr.matrix.max.rows = 10)
source('tests.R')
source("cleanup.R")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.5     [32m✔[39m [34mrsample     [39

ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


In [4]:
library(readxl)

In [37]:
#Reading data from excel file and changing col names
Sindian_RE <- read_excel("Group36Dsci100/Data/Real estate valuation data set.xlsx")               
new_col_names <- c("Number", "trans_date", "age", "dist_to_mrt", "no_of_convenience_stores", "latitude", 
                                         "longitude", 
                                         "house_price_of_unit_area")
colnames(Sindian_RE) <- new_col_names

#Selecting the columns we are going to use
sindian_re <- Sindian_RE |> select(trans_date, age, dist_to_mrt,latitude, longitude, house_price_of_unit_area)
sindian_re

trans_date,age,dist_to_mrt,latitude,longitude,house_price_of_unit_area
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
2012.917,32.0,84.87882,24.98298,121.5402,37.9
2012.917,19.5,306.59470,24.98034,121.5395,42.2
2013.583,13.3,561.98450,24.98746,121.5439,47.3
2013.500,13.3,561.98450,24.98746,121.5439,54.8
2012.833,5.0,390.56840,24.97937,121.5425,43.1
⋮,⋮,⋮,⋮,⋮,⋮
2013.000,13.7,4082.01500,24.94155,121.5038,15.4
2012.667,5.6,90.45606,24.97433,121.5431,50.0
2013.250,18.8,390.96960,24.97923,121.5399,40.6
2013.000,8.1,104.81010,24.96674,121.5407,52.5


In [None]:
#Splitting data into training and testing data

In [None]:
#Summarize data (eg: find means, show # of missing rows)

In [None]:
#Plot the distribution of each predictor variable 


### Methods

Through regression, we are predicting the distance to the nearest MRT a buyer would expect to find, based on a multitude of variables.
 The variables used in the data set are as follows
1. Transaction Date (YYYY. #M/121)
1. House age (YYYY)
1. Distance to the Nearest MRT Station (Meter)
1. Latitude (Degree)
1. Longitude (Degree)
1. House price of unit area (1000 New Taiwan Dollar/Ping2)
<br>

All columns are associated with a numeric value. There are 414 rows and no missing values in the data set.
To clean the data, we simplified the column names using colnames(), to make analysis less tedious. Then we used initial_split(data, prop = 0.75, strata = distance_to_nearst_mrt_station), which split 75% of our data into a training set, which we named sindian_training, and 25% into a testing data, sindian_test.  We then created a spec engine to train our data using nearest_neighbor and created a recipe using our variables.. We then made a workflow for our data, fitting it to the training data. Afterwards, we tune the grid, collect metrics, and filter for “rmse” to find the ideal K value. We then use our workflow data to predict our data, plot it on a scattergraph, and use a fitted line to predict the distance to the MRT.


### Expectations, outcomes, and significance

We expect that the transaction date and the real estate units' longitude and latitude will correlate with the distance to the MRT station. We also think that houses near the MRT stations will cost more. The idea here is that public transportation provides opportunities. By granting an individual that lives close to a station the freedom and flexibility to live how they choose, we belive it would increase housing prices. This is because MRTs are often affordable, helps connect to work or nightlife easily, helps avoid parking hassles, helps avoiding traffic, and etc.
<br>

The results can help plan future developments. If houses near MRT stations cost more, it could influence property prices and where people choose to live. This could also be used by people looking to purchase properties to look for the best prices for houses based on our data analysis.
<br>

Next steps involve making models to predict MRT distance using dates, house age, location, and prices. We'll explore how house age connects to MRT distance and see if specific areas have different property values. We'll also check how changes in transportation might affect property values and if MRT accessibility is fair for everyone.