**Title**

Predicting the real price of a two-person private room for a Parisian Airbnb rental on a weekday

**Introduction**

Airbnb is an online service that connects hosts with properties for rent to travellers interested in renting short-term homestays. The host of the home is the determinant of the price but will set it according to an array of factors, such as location, view, and cleaning service fee. 

***We are interested in whether factors such as the distance to the city centre and the nearest metro station influence the price of an Airbnb rental in Paris on a weekday***. This will be the question we will try to answer, and using these factors, we will see its influence in the city of our choice, Paris, for a private 2 person room on any given weekday. 

According to Jones (2023), particularly the factors of public transportation and to the city centre play a great deal in convenience when it comes to travel. Paul Swinney, a writer at the Guardian, even stated that proximity to the city centre drives up renting costs, given the higher productivity and economic activity associated with the city (Swinney, 2011). Because of the importance of these factors, with convenience, efficacy, and overall higher enjoyment of the city at the heart of it, we decided to choose these variables to analyze the price of our rental. 

In order to see the relationship between these variables and the Airbnb rental price, we will be analyzing the paris_weekdays.csv dataset from Airbnb Prices in European Cities posted on the Kaggle website (https://www.kaggle.com/datasets/thedevastator/airbnb-prices-in-european-cities?resource=download&select=paris_weekdays.csv) (Gyódi & Nawaro, 2021). The dataset is a .csv file with 3129 observations and twenty vector variables. We will be using four of the variables in our analysis, we'll be using two of them, the capacity of the rental property (person_capacity), and the type of room of the rental (room_type) to filter the dataset and two others, the distance to the city centre (dist) and the distance to a metro station (metro_dist) to predict the price of the rental property (realSum). All the variables are double (dbl) except room_type, which is a character (chr).

**Methods & Results**

*Methods*

In [4]:
#load in packages needed for analysis
set.seed(250)

library(tidyverse)
library(repr)
library(tidymodels)

options(repr.matrix.max.rows = 6)

ERROR: Error in library(tidymodels): there is no package called ‘tidymodels’


In [None]:
#read in Paris dataset from website to jupyter notebook
paris_dataset <- read_csv(
    "https://raw.githubusercontent.com/chadsc79/dsci-100-2022w2-group-7-section-005/main/paris_weekdays.csv")
paris_dataset

**Table 1.** The dataset of Airbnb rentals for Paris on a weekday.

We wrangled our data to include only the columns we will be using for our analysis, such as "dist," "metro_dist," "person_capacity," "room_type," and "realSum". Furthermore, we filtered it to include only the "Private room" data, where capacity is two, as this is the most prevalent capacity for the room_type of interest.

In [None]:
#select variables to predict the data and filter for private room and two-person capacity
paris_clean_dataset <- select(paris_dataset, dist, metro_dist, person_capacity, room_type, realSum)|>
    filter(room_type == "Private room") |>
    filter(person_capacity == 2)
paris_clean_dataset

We found minimum and maximum prices in our data to divide realSum into acceptable ranges for the classification analysis.

In [None]:
#find the min/max of the training dataset
options(digits = 4)

min_max_prices <- summarize(paris_clean_dataset,
          minimum_price = min(realSum),
          maximum_price = max(realSum))
min_max_prices

**Table 2.** Paris dataset filtered for private rooms with a two-person capacity using distance to the city centre (dist), distance to a metro station (metro_dist), and price of the Airbnb rental (realSum).

Based on the minimum and maximum prices found, we created a new column, "price_range," to be used for the classification.

In [2]:
#create price ranges to be predicted with the training set
paris_clean_dataset$price_range <- cut(paris_clean_dataset$realSum, breaks = c(0, 200, 500, 1000,2000, 15000))
paris_clean_dataset

ERROR: Error in cut(paris_clean_dataset$realSum, breaks = c(0, 200, 500, 1000, : object 'paris_clean_dataset' not found


**Table 3.** Dataset including the new column, price range (price_range).

In [3]:
#find the training set average of each variable
options(digits = 4)

summarized_paris_dataset <- summarize(paris_clean_dataset,
                                      avg_dist = mean(dist),
                                      avg_metro_dist = mean(metro_dist),
                                      avg_price = mean(realSum))
summarized_paris_dataset

ERROR: Error in summarize(paris_clean_dataset, avg_dist = mean(dist), avg_metro_dist = mean(metro_dist), : object 'paris_clean_dataset' not found


**Table 4.** Average distance to the city centre (avg_dist), the average distance to a metro station (avg_metro_dist), and average price (avg_price) of the Paris weekday dataset.

In [None]:
#scatterplot of distance from the city centre and distance from the metro for private room types, categorized by price
options(repr.plot.width = 14, repr.plot.height = 8)

dataset_visual <- paris_clean_dataset|>
    ggplot( aes(x = dist, y = metro_dist, color = price_range))+
    geom_point(alpha = 0.4) +
    labs(x = "Distance from the city centre", y = "Distance from the metro", color = "Price range")+
    ggtitle("Distance from the city centre vs distance from the metro for Airbnb's classified by price range")+
    theme(axis.text = element_text(size = 15),
    axis.title = element_text(size = 15),
    plot.title = element_text(size = 18))
dataset_visual

**Figure 1.** The graph plots the distance from the city centre on the x-axis against the distance from the metro on the y-axis. We are using the price range to differentiate the points to see whether the Airbnb available with particular conditions falls in a cheaper price range or not. Here we can see the points in the lower left corner have the most favourable conditions according to a general person since these points have the least distance from transportation and the city centre. However, even though these points have favourable conditions, most belong to the cheaper end of the spectrum. One thing to note is that our average price is 274 which falls in the (200,500] price bracket. Even though the points in the left corner may seem to be on the cheaper end of the spectrum, from the graph, it can be seen that most of the points are green, so they are near the average.

We have split our data into training and testing sets to evaluate the performance of the model on unseen data. The primary purpose is to prevent overfitting and ensure the model generalizes well to new, unseen data.

In [None]:
#split the data into a testing set and a training set

paris_split <- initial_split(paris_clean_dataset, prop = .80, strata = price_range)  

paris_test <- testing(paris_split)

paris_train <- training(paris_split)
paris_train

**Table 5.** The Paris training dataset used to train the K-value.

In [None]:
#pre-process the data
paris_recipe <- recipe(price_range ~ dist + metro_dist, data = paris_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

#tune classifier
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |> 
         set_engine("kknn") |>  
         set_mode("classification")

#create a tibble for the k values
k_vals <- tibble(neighbors = seq(from = 1, to = 40, by = 2))

#make a v-fold
paris_vfold <- vfold_cv(paris_train, v = 10, strata = price_range)

#create a workflow and labels
knn_results <- workflow() |>
  add_recipe(paris_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = paris_vfold, grid = k_vals) |>
  collect_metrics()
knn_results

**Table 6.** The results of the workflow using a vfold of 10.

In [5]:
#calculate accuracies
accuracies <- knn_results |>
  filter(.metric == "accuracy") 
accuracies

ERROR: Error in filter(knn_results, .metric == "accuracy"): object 'knn_results' not found


**Table 7.** Table of the accuracy of K-neighbours using the standard error (std_err).

In [6]:
#plot of k values against their respective accuracies
cross_val_plot <- accuracies |> 
    ggplot(aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(x = "Neighbours", y = "Accuracy Estimate") +
    ggtitle("Estimated accuracy versus the number of neighbors") +
    theme(axis.text = element_text(size = 15),
    axis.title = element_text(size = 15),
    plot.title = element_text(size = 18)) 
cross_val_plot

ERROR: Error in ggplot(accuracies, aes(x = neighbors, y = mean)): object 'accuracies' not found


**Figure 2.** Plot of the estimated accuracy for each K value. 

In [7]:
#tune the classifier with optimal k value
knn_best_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 25) |>
  set_engine("kknn") |>
  set_mode("classification")

#create a workflow
paris_fit <- workflow() |>
  add_recipe(paris_recipe) |>
  add_model(knn_best_spec) |>
  fit(data = paris_train)

#get the prediction column
paris_predictions <- predict(paris_fit, paris_test) |> 
    bind_cols(paris_test)
paris_predictions

ERROR: Error in set_mode(set_engine(nearest_neighbor(weight_func = "rectangular", : could not find function "set_mode"


**Table 8.** Predictions of the price of the Airbnb rentals in Paris on a weekday using an optimized K value of 25.

In [None]:
#compare the accuracy of predictions to the true values in the test set
paris_acc <- paris_predictions |> 
    metrics(truth = price_range, estimate = .pred_class) |> 
    select(.metric, .estimate) |> 
    head(1)
paris_acc

**Table 9.** The reported accuracy estimate of the K value selected for analysis.

*Results*

Using a v-fold of ten on the Paris weekday training data, the optimal K-value found was 25. There resultant accuracy of the data analysis using a K-value of 25 was found to be 0.5659. Even though this K-value did not have the highest accuracy, it was used since there was not a drastic change in the accuracy of its neighbours (Figure 2). Sometimes the accuracy spiked up because we are using ten folds here, so the same data is getting used again. Hence, a K-value with relatively stable accuracy amongst the surrounding K-values is better.

**Discussion**

Through our analysis of predicting the price of a two-person private room in Paris on a weekday, we found that the distance from the Airbnb rental to the city centre and the distance to the nearest metro station helped predict the actual price. We expected private room rentals with a two-person capacity closer to the city centre and closer to the metro would be more expensive than those farther away. However, the optimized k-value we used in our analysis gave us a low accuracy. We expected to have a higher accuracy to be sure about the determinates predicting the price range of an Airbnb rental. 

This may indicate that other factors beyond distance to the city centre and distance to the metro station affect the price of an Airbnb rental. According to Zhang et al. (2017), other factors that increase the price of an Airbnb rental include: the rental having a positive Airbnb reputation via positive reviews, the rental having many reviews, and the distance from the rental to landmarks (not just city centre and metro stations). Also, host and rental attributes, rental amenities and services, and rules regarding the rental all affect the price. Not including some or all of these factors affects the accuracy of the analysis, thus causing lower accuracy.

Any impacts from this study will help Airbnb hosts determine the private room price that they charge for their rental based on the criteria we used in the study. It will also help renters gauge how much to expect to pay when staying in areas similar to those in the study. This will also help Airbnb focus its advertising on criteria that matters to renters and hosts to create more business, increasing their profits.

Using the factors we did to determine the price of Airbnb rentals and the low accuracy returned may lead Airbnb, Airbnb renters or hosts, or other interested parties to ask questions such as:

Are there better price predictors for an Airbnb rental that were not used, such as cleanliness rating and overall guest satisfaction? Do these findings also occur in other Airbnb cities? Is there a difference in price for Airbnb rentals in urban areas compared to rural areas using the same criteria? Do the same criteria for weekday rental prices also apply to weekend rentals?

**References**

Jones, R. (2022, October 3). How to find the best location for your next Airbnb - AirHost Academy. AirHost Academy - The Airbnb Host Blog for Tips and Secrets. Retrieved April 9, 2023, from https://airhostacademy.com/how-to-find-the-best-location-for-your-next-airbnb/ 

Gyódi, K., & Nawaro, Ł. (2021, April 8). Airbnb Prices in European Cities: Determinants of Price by Room Type, Location, Cleanliness Rating, and More. Retrieved February 20, 2023, from https://www.kaggle.com/datasets/thedevastator/airbnb-prices-in-european-cities?resource=download&amp;select=paris_weekdays.csv. 

Swinney, P. (2011, November 9). Statsblog: What role do city centres play in local economic growth? The Guardian. Retrieved April 9, 2023, from https://www.theguardian.com/local-government-network/statsblog/2011/nov/09/statsblog-role-of-city-centres 

Zhang, Z., Chen, R., Han, L., & Yang, L. (2017). Key Factors Affecting the Price of Airbnb Listings: A Geographically Weighted Approach. Sustainability, 9(9), 1635. https://doi.org/10.3390/su9091635