# Predict Vancouver's Airbnb Price
by Bill Wan
2024/01/05

In [2]:
import pandas as pd
from myst_nb import glue

## Summary

In this Airbnb price prediction project, I utilized Vancouver Airbnb data sourced from Inside Airbnb. To enhance user interaction and understanding, I developed an interactive dashboard that provides a visual representation of the dataset.

I undertook a comprehensive workflow that included exploratory data analysis (EDA), preprocessing, feature engineering, model building, feature selection, hyperparameter optimization, and model testing. Our aim was to develop an effective machine learning model to predict Airbnb prices. The models considered in the project include Polynomial Ridge, LGBM, Random Forest, and variations of Random Forest models. 

Noteworthy findings include the LGBM model demonstrating a mean test score of 0.6, suggesting promising predictive capabilities. Feature importance analysis and model interpretation were facilitated using SHAP (SHapley Additive exPlanations), providing insights into the contribution of different features to model predictions.

## Introduction and Understanding the data

The data is from [Inside Airbnb](http://insideairbnb.com/get-the-data/) capturing all the listings in Airbnb in Vancouver uptill 13 December 2023. It contains 6695 listings and is described by 75 different columns, including information about the host, the text description, the location of the listing, and the rating of comments.The primary goal is to leverage these features for predicting listing prices.

 the distribution of key numeric features within the dataset, offering insights into the characteristics of the Airbnb listings under consideration.

```{figure} ../result/plots/numeric_features_plot.png
---|
width: 800px
name: plot_of_numeric_feature
---
Distributions comparison of numeric features
```

Below is the correlation matrix of some the numeric features.

```{figure} ../result/plots/correlation_matrix.png
---|
width: 800px
name: correlation matrix
---
The correlation matrix of numeric features
```

The visualizations reveal several noteworthy observations:

1. Skewed Price Distribution: The distribution of the target variable (price) appears highly skewed, indicating a potential benefit from normalization for improved model performance.

2. Correlation in Review Scores: Review scores exhibit significant correlation with each other. To streamline the model, it is advisable to retain only one representative score.

3. Redundancy in Host Listing Count: A notable correlation exists between 'host_listing_count' and 'host_total_listing_count.' To enhance model simplicity, retaining only one of these features is suggested.

These insights guide our approach towards refining the dataset for optimal model performance and interpretability.

In addition, I have used Tableau to create an interactive dashboard which you can also access [here](https://public.tableau.com/app/profile/bill.wan6088/viz/Dashboard_Vancouver_airbnb/Dashboard1?publish=yes).

In [None]:
%%html
<div class='tableauPlaceholder' id='viz1704569104740' style='position: relative'><noscript><a href='#'><img alt='Dashboard 1 ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Da&#47;Dashboard_Vancouver_airbnb&#47;Dashboard1&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Dashboard_Vancouver_airbnb&#47;Dashboard1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Da&#47;Dashboard_Vancouver_airbnb&#47;Dashboard1&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-GB' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1704569104740');                    var vizElement = divElement.getElementsByTagName('object')[0];                    if ( divElement.offsetWidth > 800 ) { vizElement.style.width='1000px';vizElement.style.height='827px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.width='1000px';vizElement.style.height='827px';} else { vizElement.style.width='100%';vizElement.style.height='1277px';}                     var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

## Analysis
The project evaluates several machine learning models, encompassing Polynomial Ridge, LGBM, and various Random Forest iterations. The pipeline incorporates diverse data processing and engineering techniques:

1. Text Data Processing: Utilization of CountVectorizer for two text columns, enhancing model comprehension of textual information.

2. Feature Scaling: Application of scaling on numeric features for improved model convergence and performance.

3. One-Hot Encoding: Transformation of categorical data through one-hot encoding for effective model integration.

4. Feature Engineering: Introduction of new features based on sentiment analysis, capturing the positivity of listing descriptions and neighborhood overviews.

5. Target Variable Transformation: Log transformation applied to the target variable for improved model fitting.

The dataset is partitioned into an 80% training set and a 20% test set. LGBM emerges as the top-performing model, prompting further refinement through hyperparameter tuning. The hyperparameter is optimized using 10-fold cross-validation, with the test score serving as the classification metric.

Below is the result of the performance of every model.

In [5]:
scores = pd.read_csv('../result/tables/output.csv')
glue("model_scores", scores)

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,fit_time,score_time,test_score,train_score
0,poly_ridge,mean,0.904,0.03,-210813800.0,0.329
1,poly_ridge,std,0.142,0.001,666651800.0,0.144
2,lgbm,mean,0.652,0.031,0.434,0.663
3,lgbm,std,0.027,0.001,0.222,0.054
4,random_forest,mean,1.147,0.045,0.365,0.573
5,random_forest,std,0.069,0.002,0.204,0.065
6,rf_model_based,mean,6.1,0.061,0.365,0.573
7,rf_model_based,std,2.015,0.014,0.204,0.065
8,rf_rfecv,mean,6.425,0.056,0.365,0.573
9,rf_rfecv,std,2.464,0.008,0.204,0.065


## Results & Discussion

In the final stages of our Airbnb price prediction project, the LGBM model demonstrated the best performance, achieving a robust test score of 0.6 on the entire training dataset. The feature importance analysis highlighted key factors influencing predicted prices, with room availability and minimum nights emerging as top contributors. Notably, our innovative feature, capturing the positivity of the description text, secured a position among the top 5 influential features. These findings underscore the model's ability to discern crucial patterns in Airbnb pricing, providing a solid foundation for further refinement and application. Moving forward, ongoing exploration of feature engineering and fine-tuning holds the potential for even more accurate predictions.


```{figure} ../result/plots/feature_importance.png
---|
width: 800px
name: feature_importance
---
Top 10 feature accoring to their importance
```

### Waterfall Analysis
To look at individual observation, let's delve into the fifth test data to unravel the intricacies of our LGBM model's prediction. The forecasted price, standing at 67.785 (inverse log of 4.231), positions itself below the average. The accompanying waterfall graph unveils the driving factors behind this prediction. This Airbnb listing accommodates 1.248 fewer people and mandates an additional 0.451 minimum nights. Furthermore, it is crucial to note that the property does not represent an entire apartment, contributing to the nuanced and reasoned outcome. 
This transparent analysis provides valuable insights into the nuanced considerations influencing our model's pricing predictions.

```{figure} ../result/plots/waterfall.png
---|
width: 800px
name: waterfall_analysis
---
Top waterfall analysis of one listing
```