Title

Introduction

- Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your report
- clearly state the question you tried to answer with your project
- identify and describe the dataset that was used to answer the question

Airbnb is an online service that connects hosts with properties, to travelers that are interested in renting for short term homestays.The host of the home is the determinant of the price, but will set it according to an array of factors, such as location, view, cleaning service fee etc. For our report, we are interested in whether particular factors, such as the type of room being offered, the capacity of the rental, distance to the city center, and to the nearest metro station, influence the price, and if so, in what way. This will be that question we will try to answer, and using these factors, we will attempt to predict  the price of a private room Airbnb rental in our city of choice, being Paris, on any given weekday. 

The dataset we will be using is paris_weekdays.csv from Airbnb Prices in European Cities posted on https://www.kaggle.com/datasets/thedevastator/airbnb-prices-in-european-cities?resource=download&select=paris_weekdays.csv. The dataset is a .csv file with 3129 observations and 20 vector variables. We will be using 4 of the variables to make our predictions: dist, metro_dist, person_capacity, and room_type will be used to predict realSum (the price). All the variables are double (dbl) except room_type, which is a character (chr).


Methods & Results

- describe in written English the methods you used to perform your analysis from beginning to end that narrates the code the does the analysis.
- your report should include code which:
- loads data from the original source on the web 
- wrangles and cleans the data from it's original (downloaded) format to the format necessary for the planned analysis
- performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis 
- creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
- performs the data analysis
- creates a visualization of the analysis 
- note: all tables and figure should have a figure/table number and a legend

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)


In [None]:
paris_dataset <- read_csv("https://raw.githubusercontent.com/chadsc79/dsci-100-2022w2-group-7-section-005/main/paris_weekdays.csv")
paris_dataset

In [None]:
#select variables to predict the data and filter for private room
paris_clean_dataset <- select(paris_dataset, dist, metro_dist, person_capacity, room_type, realSum)|>
filter(room_type == "Private room") |>
filter(person_capacity == 2)
paris_clean_dataset

In [None]:
#find the min/max of the training dataset
options(digits= 4)
min_max_prices <- summarize(paris_clean_dataset,
          minimum_price = min(realSum),
          maximum_price = max(realSum))
min_max_prices

In [None]:
#create price ranges to be predicted with the training set
paris_clean_dataset$price_range <- cut(paris_clean_dataset$realSum, breaks = c(0, 150, 250, 500, 1000, 2000, 5000, 10000, 15000))
paris_clean_dataset

In [None]:
#find the training set average of each variable
options(digits=4)
summarized_paris_dataset <- summarize(paris_clean_dataset,
                                      avg_dist = mean(dist),
                                      avg_metro_dist = mean(metro_dist),
                                      avg_price = mean(realSum))
summarized_paris_dataset

In [None]:
#scatterplot of distance from the city centre and distance from the metro for private room types for private rooms, categorized by price
options(repr.plot.width = 14, repr.plot.height = 8)
dataset_visual <- paris_clean_dataset|>
ggplot( aes(x = dist, y = metro_dist, color = price_range))+
geom_point(alpha = 0.4) +
labs(x = "Distance from the city centre", y = "Distance from the metro", color = "Price")
dataset_visual

In [None]:
#split the data into a testing set and a training set
set.seed(250)

paris_split <- initial_split(paris_clean_dataset, prop = .75, strata = price_range)  
paris_train <- training(paris_split) 

paris_test <- testing(paris_split)
paris_train

In [None]:
#pre-process the data
paris_recipe <- recipe(price_range ~., data = paris_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

#tune classifier
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |> 
         set_engine("kknn") |>  
         set_mode("classification")

#create a tible for the k values
k_vals <- tibble(neighbors = seq(from = 1, to = 20, by = 2))

#making a v-fold
paris_vfold <- vfold_cv(paris_train, v = 5, strata = price_range)

#creating a workflow and and labels
knn_results <- workflow() |>
  add_recipe(paris_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = paris_vfold, grid = k_vals) |>
  collect_metrics()
knn_results

#Calculating accuracies
accuracies <- knn_results |>
  filter(.metric == "accuracy")

# Plot of k values against their respective accuracies
cross_val_plot <- accuracies |> 
    ggplot(aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(x = "Neighbors", y = "Accuracy Estimate") +
    theme(text = element_text(size = 20))

Discussion

- summarize what you found
- discuss whether this is what you expected to find?
- discuss what impact could such findings have?
- discuss what future questions could this lead to?

References

- At least 2 citations of literature relevant to the project (format is your choice, just be consistent across the references).
- Make sure to cite the source of your data as well.