In [2]:
library(tidyverse)
library(tidymodels)
library(repr)
library(ggplot2)
options(repr.matrix.max.rows = 6)
install.packages('kknn')
library("kknn")
library(GGally)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mworkflows   [39m 1.1.2
[32m✔[39

ERROR: Error in library(GGally): there is no package called ‘GGally’


# Introduction

Vancouver is reported to be the least affordable housing market in North America (Grigoryeva & Ley, 2019). Fluctuated with the health of the economy, housing prices are based on a variety of factors, such as size of the property and the installation status of amenities. However, there should always be things about a home that intrinsically make it higher value than others. 
**Removing economic fluctuations, what makes a house worth more than another, and how can we predict the value of a house?**

*Goal*: Predict housing prices based on its size, with consideration of additional attributes (bedrooms, bathroom, stories, and parking).

To address this question, we will be using the Housing Price Prediction dataset from Kaggle. This comprehensive data set includes 13 key attributes of residential properties, covering property size, architectural features, the presence of amenities, and infrastructures within the house. Our focus will be on the quantitative variables, such as areas, bedrooms, bathroom, stories, and parking. Analyzing these non-economic factors provides deeper insight into their influence on price predictions, potentially helping future homebuyers in budget planning.


# Methodology

We will assess both KNN and linear regression models to determine the most accurate one. Both of the regression models predict the price of the homes given their predictor variables, the main one being the area of the home, in order to find the model which will yield the most accurate result given the data provided.  As is standard in the creation of a model, the dataset is initially split into a training and testing data frame.  The ideal number of nearest neighbours will need to be tuned for the KNN algorithm, and only afterwards will we fit the model onto our training set and predict its accuracy while comparing the predicted values of the predicted variables, and the original variables of the test set. 

This is an outline of the steps involved in reaching our final result: 
1. KNN regression model
2. Linear regression model
3. Compare both models and examine their accuracy
4. Examine and visualize the average effect of other quantitative variables on housing price
4. Final visualization

**Data Summary**

In [None]:
url <- "https://raw.githubusercontent.com/Xela-debug/datasets/main/Housing.csv"
housing_df <- read_csv(url)

set.seed(2023)
housing_split <- initial_split(housing_df, prop = 0.75, strata = price)
housing_test <- testing(housing_split)
housing_train <- training(housing_split)

# Taking a glance into the training data set
head(housing_train)
tail(housing_train)

In [None]:
# Checking for NA values
housing_check_na <- housing_train |>
    summarize(check_for_na = sum(is.na(housing_train)))

housing_check_na

**Initial Visualization**

In [None]:
# Scatterplot for price vs. area
housing_plot <- housing_train |>
    ggplot(aes(x = area, y = price)) +
    geom_point() +
    labs(x = "Area(sqft)", y = "House Price")+
    ggtitle("House Price vs. Area(sqft)")

housing_plot

**Initial Expectations:**
After initial exploratory data analysis, we noticed that the data was linear and expected a linear regression to be a good predictor if trained on the correct variables. Additionally, with such a direct relationship between price and variables such as property size, we thought a linear regression and its equation would be the best fit to accurately predict unknown values. This equation would be useful in understanding unknown house prices and would give good insight as to the impact of certain variables like property size. Our expectation will be examined in further data anylsis.

### 1. KNN Regression Model

In [7]:
print("insert KNN model")

[1] "insert KNN model"


### 2. Linear Regression Model

In [8]:
print("insert Linear  reg model")

[1] "insert Linear  reg model"


### 3. Comparing KNN and Linear Regression Model

In [None]:
#Accuracy for knn regression model
knn_summary <- knn_best_fit |>
    predict(housing_test) |>
    bind_cols(housing_test) |>
    metrics(truth = price, estimate = .pred)

#Accuracy for linear regression model
lm_results <- housing_fit |>
    predict(housing_test) |>
    bind_cols(housing_test) |>
    metrics(truth = area, estimate = .pred)

comparison_rmspe <- cbind(knn_summary, lm_results)
comparison_rmspe

### 4. Visualizations of Other Quantitative Variables vs Housing Price

In [3]:
# Examining the average effect of other quantitative datas on house price, including numbers of bedrooms, bathrooms, stories, parking spots.

mean_housing_df <- housing_df |>
    group_by(bedrooms) |>
    mutate(mean_price_per_bedroom = mean(price))
bedrooms_plot <- mean_housing_df |>
    ggplot(aes(x = bedrooms, y = mean_price_per_bedroom)) +
        geom_line() +
        labs(title = "Number of Bedrooms'\n Average Affect on House Price", x = "Number of Bedrooms", y = "Mean Price \n of all Homes with x Amount of Bedrooms")

mean_housing_df <- housing_df |>
    group_by(bathrooms) |>
    mutate(mean_price_per_bathroom = mean(price))
bathrooms_plot <- mean_housing_df |>
    ggplot(aes(x = bathrooms, y = mean_price_per_bathroom)) +
        geom_line() +
        labs(title = "Number of Bathrooms'\n Average Affect on House Price", x = "Number of Bathrooms", y = "Mean Price \n of all Homes with x Amount of Bathroos")

mean_housing_df <- housing_df |>
    group_by(stories) |>
    mutate(mean_price_per_stories = mean(price))
stories_plot <- mean_housing_df |>
    ggplot(aes(x = stories, y = mean_price_per_stories)) +
        geom_line() +
        labs(title = "Number of Stories'\n Average Affect on House Price", x = "Number of Stories", y = "Mean Price \n of all Homes with x Amount of Stories")

mean_housing_df <- housing_df |>
    group_by(parking) |>
    mutate(mean_price_per_parking = mean(price))
parking_plot <- mean_housing_df |>
    ggplot(aes(x = parking, y = mean_price_per_parking)) +
        geom_line() +
        labs(title = "Number of Parking Spots'\n Average Affect on House Price", x = "Number of Parking Spots", y = "Mean Price \n of all Homes with x Amount of Parking Spots")

#combining the graphs for clearer visualization
install.packages("gridExtra")
library(gridExtra)
grid.arrange(bedrooms_plot, bathrooms_plot, stories_plot, parking_plot, nrow = 2)

ERROR: Error in group_by(housing_df, bedrooms): object 'housing_df' not found


**Observations:** As expected, all graphs indicate that housing price increases as the quantitative variable (numbers of bedrooms, bathrooms, stories, parking spots)increase. However, each trends contains interesting variations that are worth looking into:

- **Bedrooms:** peak price is observed at 5 bedrooms, followed by a decrease thereafter
- **Bathrooms:** price increases significantly with more than 3 bathrooms 
- **Stories:** increase of prices rises steadily
- **Parking:** maximum price observed at 2 parking spots

These observations mirror insights from a study on housing price determinants in the USA. The research indicates that the square footage of a property exerts the most significant impact on housing price, closely followed by its location, and its number of bathrooms and bedrooms (Jafari & Akhavian, 2019). Remarkably, our findings align closely with these established research conclusions. Moreover, the study indicates that neighborhood characteristics, such as distance to open spaces and shopping malls, exert a comparatively weaker influence on housing prices. This conclusion further justified our effective choices of variables.

### 5. Final Visualization

In [9]:
print('insert final scatter plot (KNN) ')

[1] "insert final scatter plot (KNN) "


# Discussion

### Regression Outcome 
**this is just an outline for guiding our answers**
- summarize what you found
- discuss whether this is what you expected to find?

### Impact and Limitations 

With this model, we can more accurately predict unknown housing prices and find out if listed houses are either undervalued or overvalued. This can be helpful to both sellers and buyers.

*For sellers*, this model can act as a benchmark for what price to list a house at. Even in the presence of a realtor, the model can act as a second opinion and give sellers additional peace of mind if both the realtor and model agree on a fair price. 

*For buyers*, this model can help screen for good deals. If one is looking for a house with three bedrooms and the listed price is lower than what the model outputs, it may be the best option on the market. It could also act as an indicator for buyers to do their due diligence. There’s a chance that the house is listed below its supposed true value because there is something wrong with the house, i.e. poor foundation. The model acts as a professional in pricing whilst the buyer shops for houses, ultimately supporting them throughout the entire process.

The limitations of the KNN model however, include an inability to see how much one factor affects the housing price. Had the linear regression model been a better predictor, we could have gained additional insight into what factors heavily influence the price, and used those insights to power buyers/sellers decisions. For example, if the number of bedrooms significantly increased the listing price, it could be worth it for a seller to repurpose an empty room into a bedroom. 

### Future Questions
- Is it cheaper to buy a house with amenities included or buy one without and install yourself?
- What other variables can impact a house(e.g. age of the house, color)?
- While the study suggests a weaker impact of neighborhood characteristics, could specific neighborhood features or amenities in Vancouver exhibit a stronger correlation with housing prices compared to broader area analysis?




# References

- Housing data set from: https://www.kaggle.com/datasets/harishkumardatalab/housing-price-prediction/data

- Grigoryeva, Idaliya, and David Ley. 2019. “The Price Ripple Effect in the Vancouver Housing Market.” Urban Geography 40 (8): 1168–90. https://doi.org/10.1080/02723638.2019.1567202.

- Jafari, Amirhosein , and Reza Akhavian . 2019. 
“Driving Forces for the US Residential Housing Price: A Predictive Analysis.” June 18, 2019. https://www.emerald.com/insight/content/doi/10.1108/BEPAM-07-2018-0100/full/html.