# Analysis of Superhost Status from Airbnb Data Across Amsterdam, Athens, and Berlin
##### Data sourced from:
Gyódi, K., & Nawaro, Ł. (2021). Determinants of Airbnb prices in European cities: A spatial econometrics approach (Supplementary Material) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.4446043

In this report, we will use a real-world dataset, specifically on Airbnb listings in Amsterdam, Athens, and Berlin in 2019.

This dataset provides a comprehensive look at Airbnb prices in some of the most popular European cities. Each listing is evaluated for various attributes to capture an in-depth understanding of Airbnb prices on both weekdays and weekends. This data set can offer insight into how global markets are affected by social dynamics and geographical factors which, in turn, determine pricing strategies for optimal profitability.

This analysis aims to answer two distinct questions regarding Superhost status on Airbnb:

1) What factors are most associated with a host's Superhost status?

2) How does Superhost status influence the cost of a rental?

#### Question 1: What factors are most associated with a host's Superhost status?
Although Airbnb does directly state the requirements of a Superhost as (Airbnb, n.d):

Host at least 100 nights across 10 or more unique reservations

Maintain a response rate of at least 90%

Maintain a cancellation rate at or below 1% with the exception of valid reasons

Maintain a rating of at least 4.8 out of 5

It is worth investigating what additional factors influence Superhost status. For example, perhaps certain cities or attraction indexes influence the demographic of guests that book the property, impacting the host's ability to achieve the requirements listed above. This analysis aims to identify these additional hidden factors that are statistically and practically associated with one's Superhost status.

Additionally, it can be a tool for travelers to compare standards more uniformly across different regions. If the Superhost status is dramatically higher due to a particular criterion, then it could potentially indicate that it isn't worth spending significantly more on the property versus non-Superhost properties if that criterion isn't a priority for them. This can allow guests to make more informed purchasing decisions.

#### Question 2: How does Superhost status influence the cost of a rental?
It is cited that guests seem to be willing to pay more for Superhost accommodations in Hong Kong (Liang, 2017). Does this also apply across European cities, making the trend more rigorously revalidated and universal? If so, how much more are guests willing to pay? Is the difference statistically and/or practically significant? How does the distribution of rental price compare between hosts of either status (for example, do the top 5% of non-Superhosts make more than the average Superhost)? Which strata see the most improvement by acquiring a Superhost badge? The goal of this analysis is to conclude a comprehensive analysis and understanding of what most hosts can expect to gain by becoming a Superhost.

In [None]:
library(tidyverse)
library(repr)
library(infer)
library(cowplot)
library(broom)
library(GGally)
library(AER)
install.packages("arm")
library(arm)
library(tidymodels)
library(glmnet)
library(dplyr)
library(gridExtra)

In [None]:
# Main developer: Evan Barr

## Initial loading and wrangling. Ensure directory matches. 
amsterdam_weekdays <- read.csv("amsterdam_weekdays.csv") %>% as_tibble() %>% mutate(city = "amsterdam", day_type = "weekday")
amsterdam_weekends <- read.csv("amsterdam_weekends.csv") %>% as_tibble() %>% mutate(city = "amsterdam", day_type = "weekend")

athens_weekdays <- read.csv("athens_weekdays.csv") %>% as_tibble() %>% mutate(city = "athens", day_type = "weekday")
athens_weekends <- read.csv("athens_weekends.csv") %>% as_tibble() %>% mutate(city = "athens", day_type = "weekend")

berlin_weekdays <- read.csv("berlin_weekdays.csv") %>% as_tibble() %>% mutate(city = "berlin", day_type = "weekday")
berlin_weekends <- read.csv("berlin_weekends.csv") %>% as_tibble() %>% mutate(city = "berlin", day_type = "weekend")

airbnb <- bind_rows(amsterdam_weekdays, amsterdam_weekends, 
                   athens_weekdays, athens_weekends, 
                   berlin_weekdays, berlin_weekends) %>% 
                        mutate(room_type = as.factor(room_type), room_shared = as.factor(room_shared), 
                               multi = as.factor(multi), biz = as.factor(biz),
                               room_private = as.factor(room_private), host_is_superhost = as.factor(host_is_superhost), 
                                city = as.factor(city), day_type = as.factor(day_type)) |>
    dplyr::select(-X)
head(airbnb)


In [None]:
## Main Devolper: 

pair_plot <- airbnb %>% 
    dplyr::select(person_capacity, cleanliness_rating, guest_satisfaction_overall, bedrooms) %>%
    ggpairs()

pair_plot




In [None]:
# Main developer: Zhuo Liu

# Use GVIF() to deselect variables whose GVIF is greater than sqrt(5)

set.seed(5033)

# Since room_shared and room_private are perfectly corrolated to room_type, we deselect room_shared and room_private.
airbnb_clean <- airbnb |>
    na.omit() |>
    dplyr::select(-room_shared, -room_private)
    # mutate(room_sharedYes = if_else(room_shared == "True",1,0),
    #       room_privateYes = if_else(room_private == "True",1,0),
    #       host_is_superhostYes = if_else(host_is_superhost == "True",1,0),
    #       cityathens = if_else(city == "athens",1,0),
    #       cityberlin = if_else(city == "berlin",1,0),
    #       day_typeweekend = if_else(day_type == "weekend",1,0)) |>
    # select(-room_type, -room_shared, -room_private, -host_is_superhost, -city, -day_type)

head(airbnb_clean)

# Split data for variable selection to prevent double dipping
split <- initial_split(data = airbnb_clean, prop = 0.3)
variable_selection_df <- training(split)
inference_df <- testing(split)

# Calculate GVIF (right column) for each variable
vif_res <- glm(host_is_superhost ~ ., data = variable_selection_df, family = binomial) |>
    vif() |>
    round(4)

vif_res
# X_train <- as.matrix(training_df[,-1])
# Y_train <- as.matrix(training_df[,1])

# X_test <- as.matrix(testing_df[,-1])
# Y_test <- as.matrix(testing_df[,1])

# cv_LASSO <- cv.glmnet(
#   x = X_train, y = Y_train,
#   alpha = 1,
#   lambda = exp(seq(-21, 21, 0.1))
# )

# lambda <- cv_LASSO$lambda.min
# lambda

# coef(cv_LASSO, s = "lambda.min")

# test_pred_LASSO_min <- 
#             predict(cv_LASSO, 
#             newx = X_test, 
#             s = "lambda.min")
# LASSO_prediction <- tibble(Y_test,LASSO_prediction = test_pred_LASSO_min)
# head(LASSO_prediction)

### Variable selection process
First, remove room_shared and room_private variables in the whole dataset since they have a perfect linear relation with room_type. Then, split the dataset into 2 parts: one part is for variable selection, and the other part is for inference to avoid the post-inference problem. Then, fit a full logistic model for all variables. Then, keep all variables whose GVIF is smaller than $\sqrt{5}$. For other variables, since attr_index, attr_index_norm, rest_index, and rest_index_norm are correlated with each other, but not with other variables, we keep one of those variables. Also, since lng, lat, and city are correlated with each other, but not with other variables, we keep one of those variables. Therefore, we drop attr_index, rest_index, rest_index_norm, lng, and lat in our inference model.

### Evaluation plan
First, we obtain the GVIF of every selected variable in our inference model and compare all of them with $\sqrt{5}$ to check whether obvious multicolinearity exists. Then, we use family = quasibinomial to check the dispersion parameter and compare it with 1. Then, we draw a binned residual plot to check whether points in this plot are between the bounds.    

In [None]:
# Main developer: Zhuo Liu

# Use variables whose GVIF (right column) is less than or equal to sqrt(5) to conduct a hypothesis test
# Build model for inference
inference_model_for_superhost <- glm(host_is_superhost ~ .-attr_index
                                     -rest_index
                                     -rest_index_norm
                                     -lng 
                                     -lat, data = inference_df, family = binomial)

inference_model_for_superhost_res <- tidy(inference_model_for_superhost, exponentiate = TRUE) |>
    mutate_if(is.numeric,round,4)

inference_model_for_superhost_res

# Determine whether there is multicolinearity in our inference model
vif_res_inference <- inference_model_for_superhost |>
    vif() |>
    round(4)

vif_res_inference

# Checking overdispersion problem

quasi_model_for_superhost <- glm(host_is_superhost ~ .-attr_index
                                     -rest_index
                                     -rest_index_norm
                                     -lng 
                                     -lat, data = inference_df, family = quasibinomial)

summary(quasi_model_for_superhost)

# Drawing binned residual plots

y_resid <- residuals(inference_model_for_superhost)
x_fit <- inference_model_for_superhost$fitted

residual_plot <- binnedplot(x_fit, y_resid)
residual_plot

summary(inference_model_for_superhost)$deviance
# Calculate test_RMSE of LASSO
# n <- nrow(LASSO_prediction)
# test_RMSE_LASSO <- LASSO_prediction |> mutate(Y_test = as.numeric(Y_test),
#                                        	LASSO_prediction = as.numeric(LASSO_prediction)) |>
#     summarize(RMSE = sqrt(sum((Y_test - LASSO_prediction) ^ 2) / n))
# test_RMSE_LASSO

In [None]:
# Main developer: Zhuo Liu

# Calculate GVIF (right column) for each variable for the MLR model in Q2 using the dataset for variable selection
vif_res_MLR <- lm(realSum ~ ., data = variable_selection_df) |>
    vif() |>
    round(4)

vif_res_MLR





# residual_plot

In [None]:
# Main developer: Zhuo Liu

# Use variables whose GVIF (right column) is less than or equal to sqrt(5) to conduct a hypothesis test for Q2
# Build model for inference for Q2

inference_model_for_price <- lm(realSum ~ .-attr_index
                                     -rest_index
                                     -rest_index_norm
                                     -lng 
                                     -lat, data = inference_df)

inference_model_for_price_res <- tidy(inference_model_for_price) |>
    mutate_if(is.numeric,round,4)

inference_model_for_price_res

# Determine whether there is multicolinearity in our inference model
vif_inference_model_for_price <- inference_model_for_price |>
    vif() |>
    round(4)

vif_inference_model_for_price

# Drawing residual plot
inference_df_with_prediction <- predict(inference_model_for_price,inference_df) |>
    bind_cols(inference_df) |>
    rename("predicted_price" = "...1") |>
    mutate(residual = realSum - predicted_price)

residual_plot <- inference_df_with_prediction |>
    ggplot(aes(x = predicted_price, y = residual)) +
    geom_point() +
    labs(x = "Predicted Price", y = "Residual", title = "Residual Plot") +
    lims(x = c(-100,1000), y = c(-1000,1000))

residual_plot

#### Citations:
Gunter, U. (2018). What makes an Airbnb host a superhost? Empirical evidence from San Francisco and the Bay Area. ScienceDirect https://www.sciencedirect.com/science/article/abs/pii/S026151771730242X

Liang, C., Schuckert, M., & Law, R. (2017). Be a “Superhost”: The importance of badge systems for peer-to-peer rental accommodations. Science Direct. https://www.sciencedirect.com/science/article/abs/pii/S0261517717300079?via%3Dihub

Airbnb. (n.d.). What's required to be a Superhost. Airbnb Help Centre. https://www.airbnb.ca/help/article/829?locale=en&_set_bev_on_new_domain=1737946113_EAODE5N2ZhNDM0NT

In [None]:
numerical_vars <- c( "person_capacity", "cleanliness_rating", "guest_satisfaction_overall", "bedrooms", "dist", "metro_dist", "attr_index_norm", "rest_index_norm", "lng", "lat")
categorical_vars <- c("room_type", "room_shared", "room_private", "multi", "biz","city", "day_type")

plots <- lapply(numerical_vars, function(var) {
  ggplot(airbnb, aes_string(x = "host_is_superhost", y = var)) +
    geom_boxplot() +
    labs(title = paste(var),
         x = "Superhost Status", y = var)
})

#plots <- lapply(categorical_vars, function(var) {
#  ggplot(airbnb, aes_string(x = "host_is_superhost", y = var)) +
#    geom_boxplot() +
#    labs(title = paste(var, "vs SuperHostStatus"),
#         x = "Superhost Status", y = var)
#})

do.call(grid.arrange, c(plots, ncol = 3))
