Skip to content

In this project, we aimed to predict the price of Airbnb listings in New York City using Ensemble Learning techniques on a Kaggle dataset. Our goal was to train and tune the hyperparameters of 14 methods (such as Decision trees, Random Forest or XGBoost, etc.) and combine their predictions using Stacking and Voting algorithms.

License

Notifications You must be signed in to change notification settings

adel-R/Ensemble2023

Repository files navigation

contributors license

Tous Ensemble - Ensemble Learning project

This project is the final assignment of the Ensemble Learning class of 2023 at CentraleSupélec as part of the Master in Data Sciences & Business Analytics. It consists in 2 parts:

  • Predict Airbnb Prices in New York City using several ensemble methods seen in class.
  • Implement a Decision Tree from scratch on Python allowing to deal with both a regression task and a classification task.

Table of Contents
  1. About The Project
  2. Installation
  3. Usage
  4. Roadmap
  5. Acknowledgments

About The Project

Airbnb Prices in New York City prediction Decision Tree from scratch on Python

Airbnb has become a popular alternative to traditional hotels, allowing individuals to list their properties as rental places. However, determining the optimal price for an Airbnb listing can be challenging for hosts, especially in large cities like New York, where the number of listings is substantial. To help hosts set competitive prices and improve their occupancy rates, accurate prediction of Airbnb prices is crucial. In this project, we aimed to predict the price of Airbnb listings in New York City using Ensemble Learning techniques on a Kaggle dataset. Our goal was to train and tune the hyperparameters of 14 methods (such as Decision trees, Random Forest or XGBoost, etc.) and combine their predictions using Stacking and Voting algorithms, two popular Ensemble techniques, that we developed ourselves and which performed better in comparison to Scikit-Learn’s implementation. We have evaluated the performance of our Ensemble models using several metrics such as MAE (Mean Absolute Error), RMSE (Root Mean Squared Er- decision tree algorithms that were compared with Scikit- Learn’s implementation on the task of predicting prices of Airbnb listings as well as on 4 other datasets. The results of this project confirmed that Stacking, Voting and Boosting are very interesting ensemble techniques that, if associated with proper feature engineering, could allow to provide valuable insights to Airbnb hosts in New York City for better decision making.

(back to top)

Built With

  • Pandas
  • Numpy
  • Scikit-learn
  • Spacy
  • Matplotlib
  • Seaborn
  • Graphviz

(back to top)

Installation

  1. Clone the repo
    git clone https://github.com/adel-R/Ensemble2023
  2. Install the packages contained in the requirements.txt file
  • Unix/macOS
    python -m pip install -r requirements.txt
  • Windows
    py -m pip install -r requirements.txt

(back to top)

Usage

Decision Tree from scratch

Airbnb Price prediction

Final_Results = pd.DataFrame(all_scores)
Final_Results = Final_Results[['Model', 'R2', 'MAE', 'MSE', 'RMSE', 'MAPE', 'error_ratio_rmse', 'error_ratio_mae']] 
Final_Results.sort_values('R2', ascending = False)

Classification task

run_classification(load_digits())

Regression task

run_regression(load_diabetes())

(back to top)

Repository tree structure

.
│   Airbnb_Price_Prediction_Project.pdf
│   global_results.ipynb
│   README.md
│   requirements.txt
│
├───.ipynb_checkpoints
│       Airbnb_Price_Prediction_Project-First_Experimentations_Amine_Zaamoun-checkpoint.ipynb
│       draft_adel - Copy-checkpoint.ipynb
│       draft_adel-checkpoint.ipynb
│       global_results-checkpoint.ipynb
│       Untitled-checkpoint.ipynb
│
├───catboost_info
│   │   catboost_training.json
│   │   learn_error.tsv
│   │   time_left.tsv
│   │
│   └───learn
│           events.out.tfevents
│
├───dataset
│       AB_NYC_2019.csv
│       airbnb-listings.csv
│       name_tsne.csv
│       text_tsne.csv
│
├───decision_tree_from_scratch
│   │   Decision_Tree.py
│   │   Test.py
│   │   Test_of_homemade_decision_tree.ipynb
│   │   tree_classification_digits_dataset.png
│   │   tree_classification_iris_dataset.png
│   │   tree_regression_Airbnb_dataset.png
│   │   tree_regression_california_housing.png
│   │   tree_regression_diabetes_dataset.png
│   │
│   ├───.ipynb_checkpoints
│   │       homemade_decision_tree-checkpoint.ipynb
│   │
│   └───__pycache__
│           Decision_Tree.cpython-39.pyc
│
├───drafts
│   │   Airbnb_Price_Prediction_Project-First_Experimentations_Amine_Zaamoun.ipynb
│   │   decision_tree_scratch_draft.ipynb
│   │   draft_adel.ipynb
│   │
│   └───.ipynb_checkpoints
│           draft_adel-checkpoint.ipynb
│
├───functions
│   │   functions.py
│   │
│   └───__pycache__
│           functions.cpython-39.pyc
│
├───img
│       .gitignore
│       Airbnb_NYC-prices.jfif
│       decision_tree_from_scratch-viz.jfif
│       New_York_City_.png
│       plot_Airbnb_Price_NYC.png
│
├───models
│   │   adaboost_tuning.ipynb
│   │   bagging_tuning.ipynb
│   │   catboost_tuning.ipynb
│   │   decision_tree_from_scratch.ipynb
│   │   decision_tree_tuning.ipynb
│   │   extremely_randomized_forest_tuning.ipynb
│   │   lgbm_tuning.ipynb
│   │   random_forest_tuning.ipynb
│   │   sk_gradient_boosting_tuning.ipynb
│   │   sk_hist_gradient_boosting_tuning.ipynb
│   │   stacking_tuning.ipynb
│   │   tree_regression_Airbnb_dataset.png
│   │   voting_tuning.ipynb
│   │   xgboost_tuning.ipynb
│   │
│   ├───.ipynb_checkpoints
│   │       adaboost_tuning-checkpoint.ipynb
│   │       bagging_tuning-checkpoint.ipynb
│   │       catboost_tuning  TO DO-checkpoint.ipynb
│   │       decision_tree_from_scratch-checkpoint.ipynb
│   │       decision_tree_from_scratch_tuning-checkpoint.ipynb
│   │       decision_tree_tuning-checkpoint.ipynb
│   │       draft_adel - Copy-checkpoint.ipynb
│   │       extremely_randomized_forest_tuning-checkpoint.ipynb
│   │       lgbm_tuning-checkpoint.ipynb
│   │       random_forest_tuning-checkpoint.ipynb
│   │       sk_gradient_boosting_tuning-checkpoint.ipynb
│   │       sk_hist_gradient_boosting_tuning-checkpoint.ipynb
│   │       stacking_tuning-checkpoint.ipynb
│   │       voting_tuning-checkpoint.ipynb
│   │       xgboost_tuning-checkpoint.ipynb
│   │
│   ├───saved_models
│   │       adaboost_params.json
│   │       bagging_params.json
│   │       catboost_params.json
│   │       decision_tree_params.json
│   │       extremely_randomized_forest_params.json
│   │       homemade_tree_params.json
│   │       lgbm_params.json
│   │       lgbm_tuned.txt
│   │       random_forest_params.json
│   │       sk_gradient_boosting_params.json
│   │       sk_hist_gradient_boosting_params.json
│   │       vote_params.json
│   │       xgb_model.json
│   │       xgb_params.json
│   │
│   └───saved_scores
│           homemade_decision_tree_score.json
│           homemade_stacking_scores.json
│           sk_stacking_scores.json
│
└───seq2vec_tsne
    │   nlp_tsne_embedding_of_texts.ipynb
    │
    └───.ipynb_checkpoints
            nlp_tsne_embedding of texts-checkpoint.ipynb
            nlp_tsne_embedding_of_texts-checkpoint.ipynb

The global_results.ipynb notebook summarizes all the results obtained, and allows to re-fit all the saved models.

The experimented models are tuned and saved in separate notebooks in the folder models.

All the notebooks rely on helper functions stored in the folder functions.

The decision tree algorithm coded from scratch, the tsne embeddings performed on textual data and some draft notebooks have been stored in separate folders.

(back to top)

Contact

Acknowledgments

(back to top)

About

In this project, we aimed to predict the price of Airbnb listings in New York City using Ensemble Learning techniques on a Kaggle dataset. Our goal was to train and tune the hyperparameters of 14 methods (such as Decision trees, Random Forest or XGBoost, etc.) and combine their predictions using Stacking and Voting algorithms.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages