In [None]:
library(tidyverse)
library(tidymodels)
library(repr)
library(ggplot2)
options(repr.matrix.max.rows = 6)
install.packages('kknn')
library("kknn")
library(GGally)

**Introduction**

Vancouver is reported to be the least affordable housing market in North America (Grigoryeva & Ley, 2019). Fluctuated with the health of the economy, housing prices are based on a variety of factors, such as size of the property and the installation status of amenities. However, there should always be things about a home that intrinsically make it higher value than others. **Removing economic fluctuations, what makes a house worth more than another, and how can we predict the value of a house?**

*Goal*: Predict housing prices based on its size, with consideration of additional attributes (bedrooms, bathroom, stories, and parking).

To address this question, we will be using the Housing Price Prediction dataset from Kaggle. This comprehensive data set includes 13 key attributes of residential properties, covering property size, architectural features, the presence of amenities, and infrastructures within the house. Our focus will be on the quantitative variables, such as areas, bedrooms, bathroom, stories, and parking. Analyzing these non-economic factors provides deeper insight into their influence on price predictions, potentially helping future homebuyers in budget planning.


**Methodology**

We will assess both KNN and linear regression models to determine the most accurate one. Both of the regression models predict the price of the homes given their predictor variables, the main one being the area of the home, in order to find the model which will yield the most accurate result given the data provided.  As is standard in the creation of a model, the dataset is initially split into a training and testing data frame.  The ideal number of nearest neighbours will need to be tuned for the KNN algorithm, and only afterwards will we fit the model onto our training set and predict its accuracy while comparing the predicted values of the predicted variables, and the original variables of the test set. 


In [None]:
url <- "https://raw.githubusercontent.com/Xela-debug/datasets/main/Housing.csv"
housing_df <- read_csv(url)

set.seed(2023)
housing_split <- initial_split(housing_df, prop = 0.75, strata = price)
housing_test <- testing(housing_split)
housing_train <- training(housing_split)

# Taking a glance into the training data set
head(housing_train)
tail(housing_train)

In [None]:
# Checking for NA values
housing_check_na <- housing_train |>
    summarize(check_for_na = sum(is.na(housing_train)))

housing_check_na