In [None]:
library(tidyverse)
library(tidymodels)
library(repr)
library(ggplot2)
options(repr.matrix.max.rows = 6)
install.packages('kknn')
library("kknn")
library(GGally)

**Introduction**

Vancouver is reported to be the least affordable housing market in North America (Grigoryeva & Ley, 2019). Fluctuated with the health of the economy, housing prices are based on a variety of factors, such as size of the property and the installation status of amenities. However, there should always be things about a home that intrinsically make it higher value than others. **Removing economic fluctuations, what makes a house worth more than another, and how can we predict the value of a house?**

*Goal*: Predict housing prices based on its size, with consideration of additional attributes (bedrooms, bathroom, stories, and parking).

To address this question, we will be using the Housing Price Prediction dataset from Kaggle. This comprehensive data set includes 13 key attributes of residential properties, covering property size, architectural features, the presence of amenities, and infrastructures within the house. Our focus will be on the quantitative variables, such as areas, bedrooms, bathroom, stories, and parking. Analyzing these non-economic factors provides deeper insight into their influence on price predictions, potentially helping future homebuyers in budget planning.


**Methodology**

We will assess both KNN and linear regression models to determine the most accurate one. Both of the regression models predict the price of the homes given their predictor variables, the main one being the area of the home, in order to find the model which will yield the most accurate result given the data provided.  As is standard in the creation of a model, the dataset is initially split into a training and testing data frame.  The ideal number of nearest neighbours will need to be tuned for the KNN algorithm, and only afterwards will we fit the model onto our training set and predict its accuracy while comparing the predicted values of the predicted variables, and the original variables of the test set. 


In [None]:
url <- "https://raw.githubusercontent.com/Xela-debug/datasets/main/Housing.csv"
housing_df <- read_csv(url)

set.seed(2023)
housing_split <- initial_split(housing_df, prop = 0.75, strata = price)
housing_test <- testing(housing_split)
housing_train <- training(housing_split)

# Taking a glance into the training data set
head(housing_train)
tail(housing_train)

In [None]:
# Checking for NA values
housing_check_na <- housing_train |>
    summarize(check_for_na = sum(is.na(housing_train)))

housing_check_na

In [None]:
# Scatterplot for price vs. area
housing_plot <- housing_train |>
    ggplot(aes(x = area, y = price)) +
    geom_point() +
    labs(x = "Area(sqft)", y = "House Price")+
    ggtitle("House Price vs. Area(sqft)")

housing_plot

In [None]:
# Examining the average affect of other quantitative datas on house price, including numbers of bedrooms, bathrooms, stories, parking spots.

mean_housing_df <- housing_df |>
    group_by(bedrooms) |>
    mutate(mean_price_per_bedroom = mean(price))
bedrooms_plot <- mean_housing_df |>
    ggplot(aes(x = bedrooms, y = mean_price_per_bedroom)) +
        geom_line() +
        labs(title = "Number of Bedrooms'\n Average Affect on House Price", x = "Number of Bedrooms", y = "Mean Price \n of all Homes with x Amount of Bedrooms")

mean_housing_df <- housing_df |>
    group_by(bathrooms) |>
    mutate(mean_price_per_bathroom = mean(price))
bathrooms_plot <- mean_housing_df |>
    ggplot(aes(x = bathrooms, y = mean_price_per_bathroom)) +
        geom_line() +
        labs(title = "Number of Bathrooms'\n Average Affect on House Price", x = "Number of Bathrooms", y = "Mean Price \n of all Homes with x Amount of Bathroos")

mean_housing_df <- housing_df |>
    group_by(stories) |>
    mutate(mean_price_per_stories = mean(price))
stories_plot <- mean_housing_df |>
    ggplot(aes(x = stories, y = mean_price_per_stories)) +
        geom_line() +
        labs(title = "Number of Stories'\n Average Affect on House Price", x = "Number of Stories", y = "Mean Price \n of all Homes with x Amount of Stories")

mean_housing_df <- housing_df |>
    group_by(parking) |>
    mutate(mean_price_per_parking = mean(price))
parking_plot <- mean_housing_df |>
    ggplot(aes(x = parking, y = mean_price_per_parking)) +
        geom_line() +
        labs(title = "Number of Parking Spots'\n Average Affect on House Price", x = "Number of Parking Spots", y = "Mean Price \n of all Homes with x Amount of Parking Spots")

#combining the graphs for clearer visualization
install.packages("gridExtra")
library(gridExtra)
grid.arrange(bedrooms_plot, bathrooms_plot, stories_plot, parking_plot, nrow = 2)

**Observations:** As expected, all graphs indicate that housing price increases as the quantitative variable (numbers of bedrooms, bathrooms, stories, parking spots)increase. However, each trends contains interesting variations that are worth looking into:

- **Bedrooms:** peak price is observed at 5 bedrooms, followed by a decrease thereafter
- **Bathrooms:** price increases significantly with more than 3 bathrooms 
- **Stories:** increase of prices rises steadily
- **Parking:** maximum price observed at 2 parking spots

These observations mirror insights from a study on housing price determinants in the USA. The research indicates that the square footage of a property exerts the most significant impact on housing price, closely followed by its location, and its number of bathrooms and bedrooms (Jafari & Akhavian, 2019). Remarkably, our findings align closely with these established research conclusions. Moreover, the study indicates that neighborhood characteristics, such as distance to open spaces and shopping malls, exert a comparatively weaker influence on housing prices. This conclusion further justified our effective choices of variables.


In [None]:
s