In [255]:
library(tidyverse)
library(tidymodels)
library(janitor)
library(leaps)

# STAT 306 Group C3 Project

# The Data & Goal Analysis

The data being explored is the [Real Estate Valuation Data Set](https://archive.ics.uci.edu/dataset/477/real+estate+valuation+data+set) that explores real-estate prices in Sindian Dist., New Taipei City, Taiwan. This data consists of n=414 observations that contain various numerical and time-related features. This data was obtained through UC Irvine Machine Learning Repository.

The variables we have are: 
- X1 =  The transaction date in numerical units. For instance, 2013.250 equals March 2013, where the month is depicted as a fraction of the year (3/12 = 0.250) 
- X2 = The house age in years
- X3 = Distance to the nearest MRT station in metres (MRTs are metro transit systems) 
- X4 = Number of convenience stores in the living circle on foot by count (integer)
- X5 = Latitude in degree
- X6 = Longitude in degree

**The primary objective of this analysis is to determine how the real estate price is influenced by various factors such as house age, proximity to transportation (MRT), convenience store availability, and geographical location.**

# Reading In the Data

In [256]:
real_estate_data <- clean_names(real_estate_data)
head(real_estate_data)

df_split <- initial_split(real_estate_data, prop = 0.8, strata = y_house_price_of_unit_area)
df_train <- training(df_split)
df_test <- testing(df_split)

no,x1_transaction_date,x2_house_age,x3_distance_to_the_nearest_mrt_station,x4_number_of_convenience_stores,x5_latitude,x6_longitude,y_house_price_of_unit_area
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,2012.917,32.0,84.87882,10,24.98298,121.5402,37.9
2,2012.917,19.5,306.5947,9,24.98034,121.5395,42.2
3,2013.583,13.3,561.9845,5,24.98746,121.5439,47.3
4,2013.5,13.3,561.9845,5,24.98746,121.5439,54.8
5,2012.833,5.0,390.5684,5,24.97937,121.5425,43.1
6,2012.667,7.1,2175.03,3,24.96305,121.5125,32.1


# Feature Engineering & Model Development

In [257]:
# Time based features
df_train <- df_train %>%
  mutate(year = as.integer(x1_transaction_date)) %>%
  mutate(month = as.integer(round((x1_transaction_date - year) * 12), 1))

# Create location-based features
df_train$neighborhood <- kmeans(df_train[, c('x5_latitude', 'x6_longitude')], centers = 5)$cluster 

# Proximity-based features
df_train$distance_to_mrt_category <- cut(df_train$x3_distance_to_the_nearest_mrt_station,
                            breaks = c(-Inf, 250, 500, 750, 1000, Inf),
                            labels = c("under_250m", "250m_500m", "500m_750m", "750m_1000m", "over_1000m"))

model_df <- select(df_train, -c(no))

## Model Selection

In [270]:
s <- regsubsets(y_house_price_of_unit_area ~., data = model_df, method = "exhaustive", nvmax = 100)
ss <- summary(s)

rss <- ss$rss
adjr2 <- ss$adjr2
bic <- ss$bic

variables <- data.frame((ss$which))
variables$model <- rownames(variables)
rownames(variables) <- 1:nrow(variables)

rss <- ss$rss
adjr2 <- ss$adjr2
bic <- ss$bic

output <- cbind(variables, RSS=rss, AdjR2=adjr2, BIC=bic)

best_adjr2 <- which.max(output$AdjR2) 
print(best_adjr2)
print(max(output$AdjR2))
output %>% slice(best_adjr2)

[1] 12
[1] 0.6749975


X.Intercept.,x1_transaction_date,x2_house_age,x3_distance_to_the_nearest_mrt_station,x4_number_of_convenience_stores,x5_latitude,x6_longitude,year,month,neighborhood,distance_to_mrt_category250m_500m,distance_to_mrt_category500m_750m,distance_to_mrt_category750m_1000m,distance_to_mrt_categoryover_1000m,model,RSS,AdjR2,BIC
<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<chr>,<dbl>,<dbl>,<dbl>
True,True,True,False,True,True,True,True,True,True,True,True,True,True,12,17561.19,0.6749975,-306.684


## Interpretation Of The Results