# Predict students' dropout and academic success

by Bill Wan
2024/01/05

In [1]:
import pandas as pd
from myst_nb import glue

## Summary


In this Airbnb price prediction project, I haven chosen to use the data of Vancouver Airbnb from [Inside Airbnb](http://insideairbnb.com/get-the-data/). I have also build a interactive dashboard so that viewers can have a better understanding about the data.

I undertook a comprehensive workflow that included exploratory data analysis (EDA), preprocessing, feature engineering, model building, feature selection, hyperparameter optimization, and model testing. Our aim was to develop an effective machine learning model to predict Airbnb prices. The models considered in the project include Polynomial Ridge, LGBM, Random Forest, and variations of Random Forest models. 

Noteworthy findings include the LGBM model demonstrating a mean test score of 0.6, suggesting promising predictive capabilities. Feature importance analysis and model interpretation were facilitated using SHAP (SHapley Additive exPlanations), providing insights into the contribution of different features to model predictions.

## Introduction and Understanding the data

The data is from [Inside Airbnb](http://insideairbnb.com/get-the-data/) capturing all the listings in Airbnb in Vancouver uptill 13 December 2023. It contains 6695 listings and is described by 75 different columns, including information about the host, the text description, the location of the listing, and the rating of comments. I aim to use these features to predict its price. 

Below is the distribution of some of the numeric features.

```{figure} ../result/plots/numeric_features_plot.png
---|
width: 800px
name: plot_of_numeric_feature
---
Distributions comparison of numeric features
```

Below is the correlation matrix of some the numeric features.

```{figure} ../result/plots/correlation_matrix.png
---|
width: 800px
name: correlation matrix
---
The correlation matrix of numeric features
```

From the charts above, the following conclusion can be made:
1. The target variable (price) is highly skewed. It might be better to normalized it.
2. It seems that the review score is highly correlated to each other, so it would be wise to only keep one of them.
3. There is also high correlation between host_listing_count and host_total_listing_coung. So we only keep one of them.

In addition, I have used Tableau to create an interactive dashboard.

<!-- Embed Tableau Visualization -->
<div class='tableauPlaceholder' id='viz1704512271559' style='position: relative'>
  <noscript>
    <a href='#'><img alt='Dashboard 1 ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;RR&#47;RRQTYPS4R&#47;1_rss.png' style='border: none' /></a>
  </noscript>
  <object class='tableauViz' style='display:none;'>
    <param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' />
    <!-- Other params from the embed code -->
  </object>
</div>

## Methods

### Analysis
The models considered in the project include Polynomial Ridge, LGBM, Random Forest, and variations of Random Forest models. 
Several data processing and engineering techniques were used in this piepline, including a 
1. Countvectorizer for the two columns of text data 
2. Scaling on the numeric features
3. One-hot-encoding on the categorical data
4. Feature engineering of two text columns, getting the positvity of the description of the listing, and its neighbourhood
5. Log transformation of the target variable.

Data was split with 80% being partitioned into the training set and 20% being partitioned into the test set. LGBM appears to be the best performing model, and therefore hyperparameter tuning is employed. The hyperparameter $K$ was chosen using 10-fold cross validation with the test score as the classification metric.

Below is the result of the performance of every model.

In [2]:
scores = pd.read_csv('../result/tables/output.csv')
glue("model_scores", scores)

Unnamed: 0,Model,`,fit_time,score_time,test_score,train_score
0,poly_ridge,mean,0.904,0.03,-210813800.0,0.329
1,poly_ridge,std,0.142,0.001,666651800.0,0.144
2,lgbm,mean,0.652,0.031,0.434,0.663
3,lgbm,std,0.027,0.001,0.222,0.054
4,random_forest,mean,1.147,0.045,0.365,0.573
5,random_forest,std,0.069,0.002,0.204,0.065
6,rf_model_based,mean,6.1,0.061,0.365,0.573
7,rf_model_based,std,2.015,0.014,0.204,0.065
8,rf_rfecv,mean,6.425,0.056,0.365,0.573
9,rf_rfecv,std,2.464,0.008,0.204,0.065


## Results & Discussion

In the final stages of our Airbnb price prediction project, the LGBM model demonstrated exceptional performance, achieving a robust test score of 0.6 on the entire training dataset. The feature importance analysis highlighted key factors influencing predicted prices, with room availability and minimum nights emerging as top contributors. Notably, our innovative feature, capturing the positivity of the description text, secured a position among the top 5 influential features. These findings underscore the model's ability to discern crucial patterns in Airbnb pricing, providing a solid foundation for further refinement and application. Moving forward, ongoing exploration of feature engineering and fine-tuning holds the potential for even more accurate predictions.


```{figure} ../result/plots/feature_importance.png
---|
width: 800px
name: feature_importance
---
Top 10 feature accoring to their importance
```

### Waterfall Analysis
To look at individual observation, let's delve into the fifth test data to unravel the intricacies of our LGBM model's prediction. The forecasted price, standing at 67.785 (inverse log of 4.231), positions itself below the average. The accompanying waterfall graph unveils the driving factors behind this prediction. This Airbnb listing accommodates 1.248 fewer people and mandates an additional 0.451 minimum nights. Furthermore, it is crucial to note that the property does not represent an entire apartment, contributing to the nuanced and reasoned outcome. 
This transparent analysis provides valuable insights into the nuanced considerations influencing our model's pricing predictions.

```{figure} ../result/plots/waterfall.png
---|
width: 800px
name: waterfall_analysis
---
Top waterfall analysis of one listing
```