Skip to content

Predicting Next Booking Destinations for Airbnb Users. Feel free to access the Streamlit App in the link below.

License

Notifications You must be signed in to change notification settings

brunodifranco/project-airbnb-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Predicting Next Booking Destinations for Airbnb Users

A Classification Project

Obs: The business problem is fictitious, although both company and data are real.

The in-depth Python code explanation is available in this Jupyter Notebook.

1. Airbnb and Business Problem

Airbnb is an online marketplace for short-term homestays, and their business model consists of charging a commission for each booking. So they can better understand their customers behaviors and most desired booking locations a Data Scientist was hired, in order to predict the five most likely countries for a USA user to make their next booking. Airbnb provided data from over 200 thousand users, split in two different datasets (more information in Section 2), so the predictions could be made for around 61 thousand users. There are 12 possible outcomes of the destination country: 'USA', 'France', 'Canada', 'Great Britain', 'Spain', 'Italy', 'Portugal', 'New Zealand', 'Germany' and 'Australia', as well as 'NDF' (which means there wasn't a booking) and 'other countries'.

2. Data Overview

The data was split in users and sessions data, which is the internet browsing information. The Initial features descriptions are available below:

Users

Sessions

Feature Definition
id
user id
date_account_created
the date of account creation
timestamp_first_active
timestamp of the first activity
date_first_booking
date of first booking
gender
user's gender
age
user's age
signup_method
method of signing up e.g. facebook, google
signup_flow
the page a user came to signup up from
language
international language preference
affiliate_channel
what kind of paid marketing
affiliate_provider
where the marketing is e.g. google, craigslist
first_affiliate_tracked
first marketing the user interacted with
signup_app
signup app e.g. Web, Android
first_device_type
first device type used e.g. Windows, IPhone, Android
first_browser
first browser used e.g. Chrome, FireFox, Safari
country_destination
target variable
Feature Definition
user_id
same as 'id' in users table
action
action performed e.g. show, search_results
action_type
action type performed e.g. view, click
action_detail
action detail e.g. confirm_email_link
device_type
device used on each action
secs_elapsed
the time between two actions recorded

The data was collected from Kaggle.

3. Assumptions

  • Out of 'action', 'action_type', 'action_detail' only 'action_type' was kept due to their high correlation and because they seem to represent similar events. The choice for 'action_type' is due to it having only 28 unique values, unlike 'action' and 'action_detail' that have hundreds, which made encoding easier later.

  • Missing values on 'first_affiliate_tracked' were replaced with 'untracked', as it would be the most logical replacement in this instance.

  • Missing values on 'age' were replaced with the ages median.

  • 'date_first_booking' was dropped since it doesn't exist in the new users dataset.

4. Solution Plan

4.1. How was the problem solved?

To predict the five most likely countries for a USA user to make their next booking the following steps were performed:

  • Understanding the Business Problem: Understanding the main objective we are trying to achieve and plan the solution to it.

  • Collecting Data: Collecting data from Kaggle.

  • Data Cleaning: Checking data types and Nan's. Other tasks such as: renaming columns, dealing with outliers, fixing missing values, changing data types, etc.

  • Feature Engineering: Creating new features from the original ones, so that those could be used in the ML model. The full new features created with their definitions are available here.

  • Exploratory Data Analysis (EDA): Exploring the data in order to obtain business experience, look for data inconsistencies, useful business insights and find important features for the ML model. This process is split in Univariate, Bivariate (Checking Hypotheses) and Multivariate Analysis. The univariate analysis was done by using the Pandas Profiling library. The report is available for download here. The top business insights found are available in Section 5.

  • Data Preparation: Applying Rescaling Techniques in the data, as well as Enconding Methods, to deal with categorical variables.

  • Feature Selection: Selecting the best features to use in the ML model by using Random Forest.

  • Machine Learning Modeling and Model Evaluation: Training Classification Algorithms. The best model was selected to be improved via Bayesian Optimization with Optuna. More information in Section 6.

  • Model Deployment and Results : Providing a list of the five most likely destinations predictions for 61 thousand USA Airbnb users, as well as graphical analysis of the predictions by age, gender and overall analysis. This is the project's Data Science Product, and it can be accessed from anywhere in a Streamlit App. In addition to that, if new data from new users comes in, it's easy to get new predictions, as a Flask application using Render Cloud was built. More information in Section 7.

4.2. Tools and techniques used:

5. Top Business Insights

  • 1st - Users take less than 2 days, on average, from first active in the platform to creating an account, considering all destinations.

drawing


  • 2nd - The number of accounts created goes up during the spring.

drawing


  • 3rd - Women made over 15% more bookings for countries other than USA, in comparison to men.

drawing


6. Machine Learning Models

Initially, seven models were trained using cross-validation, so we can provide predictions on the five most likely countries for a US Airbnb user to book their next destinations: Logistic Regression, Decision Tree, Random Forest, Extra Trees, AdaBoost, XGBoost and Light GBM.

The initial cross validation performance of all seven algorithms are displayed below:

Model NDCG at K
Light GBM 0.8496 +/- 0.0006
XGBoost 0.8482 +/- 0.0004
Random Forest 0.8451 +/- 0.0006
AdaBoost 0.8429 +/- 0.0019
Extra Trees 0.8390 +/- 0.0008
Logistic Regression 0.8377 +/- 0.0010
Decision Tree 0.7242 +/- 0.0023

Where K is equal to 5, given our business problem.

The Light GBM was chosen as a final model, since it's fast to train and tune, whilst being also the one with the best result without any tuning. In addition to that, it's much better for deployment, as it's much lighter than a XGBoost or Random Forest for instance, especially given the fact that we're using a free deployment cloud. More information in Section 7.

Instead of using cross-validation, which uses only the training dataset, we tuned the model's hyperparameters by comparing its performance on the test dataset, which was split before Data Preparation, to avoid Data Leakage. After tuning LGBM's hyperparameters using Bayesian Optimization with Optuna the model performance has improved, as expected:

Before Tuning Final Model
Model NDCG at K
Light GBM 0.8514
Model NDCG at K
Light GBM 0.8542

Metrics Definition and Interpretation

As the goal in this project is to predict not only the most likely next booking destination for each user, but the five most likely ones the Normalized discounted cumulative gain (NDCG) at rank K was chosen.

NDCG at K “measures the performance of a recommendation system based on the graded relevance of the recommended entities. It varies from 0.0 to 1.0, with 1.0 representing the ideal ranking of the entities.” Therefore, for this instance (where k equals 5), it not only measures how well we can predict the five most likely next booking locations for each user, but also how well can rank them from the most likely to the least.

7. Model Deployment and Results

The model deployment was performed in three steps:

  • Step 1: The original data (both datasets in Section 2) was saved in a PostgreSQL Database from Neon.tech.

  • Step 2: A Flask application was built using Render Cloud , on which it extracts the original data from that PostgreSQL Database, cleans and transforms the data, loads the saved ML model, creates predictions for each user and adds these predictions back in a different table in the same Database. Let's name this table 'df_pred' for the sake of the explanation.

  • Step 3: Streamlit retrieves the df_pred data from the Database and displays it in a table inside Streamlit with filters, where you can find the five most likely destinations predictions for the 61 thousand USA Airbnb users. In addition to that, graphical analysis of the predictions were built, split by age, gender and overall analysis. This is the project's Data Science Product, and it can be accessed from anywhere in a Streamlit App.

Click on the respective icon to access the link
Streamlit App Flask App
Streamlit App Flask

The Flask App is particularly useful for when new data comes in, as we can get new predictions with a click of a button, so it can be later retrieved by the Streamlit App. The Streamlit App code is available here and the Flask App code can be seen here.

Because the deployment was made in a free cloud (Render Cloud) the Flask App's functionality could be slow, in the other hand, the main deployment product, which is the Streamlit App, should work quickly.

8. Conclusion

In this project the main objective was accomplished:

We managed to provide a list of the five most likely destinations predictions for 61 thousand USA Airbnb users, as well as graphical analysis of the predictions by age, gender and overall analysis. This can all be found in a Streamlit App, for better visualization. Also, a Flask application was built for when new data comes in, making it possible to get new predictions easily. In addition to that, three interesting and useful insights were found through Exploratory Data Analysis (EDA), so that those can be properly used by Airbnb.

9. Next Steps

Further on, this solution could be improved by a few strategies:

  • Creating even more features from the existing ones.
  • Try other classification algorithms, such as Neural Networks.
  • Using a paid Cloud, such as AWS.

Contact