In [1]:
library(tidyverse)
library(dplyr)
library(GGally)
library(readr)
getwd()

“package ‘dplyr’ was built under R version 4.3.2”
“package ‘stringr’ was built under R version 4.3.2”
── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
“package ‘GGally’ was built under R version 4.3.2”
Registered S3 me

### Intro

In this project, I will use relevant information of house to predict its price. 

Link of data: https://www.kaggle.com/competitions/simple-housing-price-prediction/data

| Variable Name   | Variable Type | Description                                                                                      |
|-----------------|---------------|--------------------------------------------------------------------------------------------------|
| house_id        | Integer       | ID variable                                                                                      |
| date            | Character     | Date of sale                                                                                     |
| location        | Character     | Location of house                                                                                |
| type            | Character     | Type of house. Options: '1 ROOM', '2 ROOM', '3 ROOM', '4 ROOM', '5 ROOM', 'EXECUTIVE', 'MULTI-GENERATION'                     |
| block           | Character     | The block that the house resides on                                                              |
| street          | Character     | The street that the house resides on                                                             |
| storey_range    | Character     | Which stories are occupied by the house                                                          |
| area_sqm        | Double        | Area       |
| flat_model      | Character     | Model of the flat, different letters represent different layouts, room architecture, etc.       |
| commence_date   | Integer       | When the house was put up for sale                                                               |
| price           | Double        | Target variable, indicates the price the house was sold for|

Through the EDA, I can visually examine the relationships between these variables. Subsequently, the dataset can be applied Additive Multiple Linear Regression (MLR) model to estimate coefficients with considering potential confounding factors. This study will focus on prediction, therefore, I may use an automated procedure such as backward selection to find variables that create the best model to make predictions.

### EDA

In [2]:
dat <- read.csv("train.csv")
head(dat)

Unnamed: 0_level_0,house_id,date,location,type,block,street,storey_range,area_sqm,flat_model,commence_date,price
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<int>,<dbl>
1,199577,2006-09,Raleigh,5 ROOM,107D,Agawan Court,07 TO 09,110,D,2003,313000
2,217021,2007-06,Fresno,3 ROOM,678,Cleo St,07 TO 09,64,N,1988,167000
3,308062,2010-09,Tucson,4 ROOM,5,E Pleasant View Way,10 TO 12,92,K,1976,430000
4,212465,2007-04,Austin,4 ROOM,326,Park Hollow Ln,10 TO 12,92,K,1977,303800
5,60654,2001-10,Honolulu,4 ROOM,794,Ala Puawa Place,04 TO 06,102,G,1998,212000
6,193658,2006-06,Riverside,4 ROOM,296,Jay Ct,07 TO 09,90,G,2000,248000


Since the date, type, and storey_range are hard to apply linear regression, I will summarize these variables and transform characters variables to factors.

In [3]:
# Summarized and Transformed characters variables to factors
dat$date <- substr(dat$date, 1, 4)
dat$type <- substr(dat$type, 1, 2)
dat$storey_range <- substr(dat$storey_range, 6, 8)

head(dat)
dat <- select(dat, house_id, date, location, type, block, street, storey_range, area_sqm, flat_model, commence_date, price) |>
              mutate(date = as.factor(date),
                     location = as.factor(location),
                     type = as.factor(type),
                     block = as.factor(block),
                     street = as.factor(street),
                     storey_range = as.factor(storey_range),
                     flat_model = as.factor(flat_model))

date.Namelevel = nlevels(dat$date)   
location.Namelevel = nlevels(dat$location) 
type.Namelevel = nlevels(dat$type) 
block.Namelevel = nlevels(dat$block) 
street.Namelevel = nlevels(dat$street) 
storey_range.Namelevel = nlevels(dat$storey_range) 
flat_model.Namelevel = nlevels(dat$flat_model) 


levels<- data.frame(date.Namelevel, location.Namelevel, type.Namelevel, block.Namelevel, street.Namelevel, storey_range.Namelevel, flat_model.Namelevel)
levels


Unnamed: 0_level_0,house_id,date,location,type,block,street,storey_range,area_sqm,flat_model,commence_date,price
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<int>,<dbl>
1,199577,2006,Raleigh,5,107D,Agawan Court,9,110,D,2003,313000
2,217021,2007,Fresno,3,678,Cleo St,9,64,N,1988,167000
3,308062,2010,Tucson,4,5,E Pleasant View Way,12,92,K,1976,430000
4,212465,2007,Austin,4,326,Park Hollow Ln,12,92,K,1977,303800
5,60654,2001,Honolulu,4,794,Ala Puawa Place,6,102,G,1998,212000
6,193658,2006,Riverside,4,296,Jay Ct,9,90,G,2000,248000


date.Namelevel,location.Namelevel,type.Namelevel,block.Namelevel,street.Namelevel,storey_range.Namelevel,flat_model.Namelevel
<int>,<int>,<int>,<int>,<int>,<int>,<int>
13,26,7,1984,522,14,16


We explored the levels for categorical variables. It can be seen that the levels of **block** and **street** variables have too many levels which hard to fit a linear model Therefore, **block** and **street** will be ignored.

In [4]:
# Final data set
dat<- select(dat, house_id, date, location, type, area_sqm, storey_range, flat_model, commence_date, price)
nrow(dat)

### Implementation of a proposed model

In [None]:
set.seed(2024)

train <- dat %>% dplyr::sample_frac(0.60)
test  <- dplyr::anti_join(dat, train, by = 'house_id')

training_dat<- train
testing_dat<- test
nrow(train)
nrow(test)



In [None]:
# Full OLS
full_OLS <- lm(price ~  date+location+type+area_sqm+storey_range+flat_model+commence_date, data = training_dat)

# Predict using the Full OLS model on the test set
testing_dat$.pred <- full_OLS %>%
    predict(testing_dat)

# calculate RMSE 
OLS_house_test <- testing_dat %>%
    mutate(pred_error = price - .pred)

OLS_house_rmse_aic <- sqrt(mean(OLS_house_test$pred_error^2)) %>%
    round(2)
OLS_house_rmse_aic

In [None]:
predict<- select(testing_dat, house_id, .pred) 
results <- data.frame(house_id = testing_dat$house_id, price = predict$.pred)
write_csv(results, 'house_price_predictions.csv')