This project is the final assignment of the Ensemble Learning class of 2023 at CentraleSupélec as part of the Master in Data Sciences & Business Analytics. It consists in 2 parts:
- Predict Airbnb Prices in New York City using several ensemble methods seen in class.
- Implement a Decision Tree from scratch on Python allowing to deal with both a regression task and a classification task.
Table of Contents
Airbnb has become a popular alternative to traditional hotels, allowing individuals to list their properties as rental places. However, determining the optimal price for an Airbnb listing can be challenging for hosts, especially in large cities like New York, where the number of listings is substantial. To help hosts set competitive prices and improve their occupancy rates, accurate prediction of Airbnb prices is crucial. In this project, we aimed to predict the price of Airbnb listings in New York City using Ensemble Learning techniques on a Kaggle dataset. Our goal was to train and tune the hyperparameters of 14 methods (such as Decision trees, Random Forest or XGBoost, etc.) and combine their predictions using Stacking and Voting algorithms, two popular Ensemble techniques, that we developed ourselves and which performed better in comparison to Scikit-Learn’s implementation. We have evaluated the performance of our Ensemble models using several metrics such as MAE (Mean Absolute Error), RMSE (Root Mean Squared Er- decision tree algorithms that were compared with Scikit- Learn’s implementation on the task of predicting prices of Airbnb listings as well as on 4 other datasets. The results of this project confirmed that Stacking, Voting and Boosting are very interesting ensemble techniques that, if associated with proper feature engineering, could allow to provide valuable insights to Airbnb hosts in New York City for better decision making.
- Clone the repo
git clone https://github.com/adel-R/Ensemble2023
- Install the packages contained in the
requirements.txt
file
- Unix/macOS
python -m pip install -r requirements.txt
- Windows
py -m pip install -r requirements.txt
Final_Results = pd.DataFrame(all_scores)
Final_Results = Final_Results[['Model', 'R2', 'MAE', 'MSE', 'RMSE', 'MAPE', 'error_ratio_rmse', 'error_ratio_mae']]
Final_Results.sort_values('R2', ascending = False)
run_classification(load_digits())
run_regression(load_diabetes())
.
│ Airbnb_Price_Prediction_Project.pdf
│ global_results.ipynb
│ README.md
│ requirements.txt
│
├───.ipynb_checkpoints
│ Airbnb_Price_Prediction_Project-First_Experimentations_Amine_Zaamoun-checkpoint.ipynb
│ draft_adel - Copy-checkpoint.ipynb
│ draft_adel-checkpoint.ipynb
│ global_results-checkpoint.ipynb
│ Untitled-checkpoint.ipynb
│
├───catboost_info
│ │ catboost_training.json
│ │ learn_error.tsv
│ │ time_left.tsv
│ │
│ └───learn
│ events.out.tfevents
│
├───dataset
│ AB_NYC_2019.csv
│ airbnb-listings.csv
│ name_tsne.csv
│ text_tsne.csv
│
├───decision_tree_from_scratch
│ │ Decision_Tree.py
│ │ Test.py
│ │ Test_of_homemade_decision_tree.ipynb
│ │ tree_classification_digits_dataset.png
│ │ tree_classification_iris_dataset.png
│ │ tree_regression_Airbnb_dataset.png
│ │ tree_regression_california_housing.png
│ │ tree_regression_diabetes_dataset.png
│ │
│ ├───.ipynb_checkpoints
│ │ homemade_decision_tree-checkpoint.ipynb
│ │
│ └───__pycache__
│ Decision_Tree.cpython-39.pyc
│
├───drafts
│ │ Airbnb_Price_Prediction_Project-First_Experimentations_Amine_Zaamoun.ipynb
│ │ decision_tree_scratch_draft.ipynb
│ │ draft_adel.ipynb
│ │
│ └───.ipynb_checkpoints
│ draft_adel-checkpoint.ipynb
│
├───functions
│ │ functions.py
│ │
│ └───__pycache__
│ functions.cpython-39.pyc
│
├───img
│ .gitignore
│ Airbnb_NYC-prices.jfif
│ decision_tree_from_scratch-viz.jfif
│ New_York_City_.png
│ plot_Airbnb_Price_NYC.png
│
├───models
│ │ adaboost_tuning.ipynb
│ │ bagging_tuning.ipynb
│ │ catboost_tuning.ipynb
│ │ decision_tree_from_scratch.ipynb
│ │ decision_tree_tuning.ipynb
│ │ extremely_randomized_forest_tuning.ipynb
│ │ lgbm_tuning.ipynb
│ │ random_forest_tuning.ipynb
│ │ sk_gradient_boosting_tuning.ipynb
│ │ sk_hist_gradient_boosting_tuning.ipynb
│ │ stacking_tuning.ipynb
│ │ tree_regression_Airbnb_dataset.png
│ │ voting_tuning.ipynb
│ │ xgboost_tuning.ipynb
│ │
│ ├───.ipynb_checkpoints
│ │ adaboost_tuning-checkpoint.ipynb
│ │ bagging_tuning-checkpoint.ipynb
│ │ catboost_tuning TO DO-checkpoint.ipynb
│ │ decision_tree_from_scratch-checkpoint.ipynb
│ │ decision_tree_from_scratch_tuning-checkpoint.ipynb
│ │ decision_tree_tuning-checkpoint.ipynb
│ │ draft_adel - Copy-checkpoint.ipynb
│ │ extremely_randomized_forest_tuning-checkpoint.ipynb
│ │ lgbm_tuning-checkpoint.ipynb
│ │ random_forest_tuning-checkpoint.ipynb
│ │ sk_gradient_boosting_tuning-checkpoint.ipynb
│ │ sk_hist_gradient_boosting_tuning-checkpoint.ipynb
│ │ stacking_tuning-checkpoint.ipynb
│ │ voting_tuning-checkpoint.ipynb
│ │ xgboost_tuning-checkpoint.ipynb
│ │
│ ├───saved_models
│ │ adaboost_params.json
│ │ bagging_params.json
│ │ catboost_params.json
│ │ decision_tree_params.json
│ │ extremely_randomized_forest_params.json
│ │ homemade_tree_params.json
│ │ lgbm_params.json
│ │ lgbm_tuned.txt
│ │ random_forest_params.json
│ │ sk_gradient_boosting_params.json
│ │ sk_hist_gradient_boosting_params.json
│ │ vote_params.json
│ │ xgb_model.json
│ │ xgb_params.json
│ │
│ └───saved_scores
│ homemade_decision_tree_score.json
│ homemade_stacking_scores.json
│ sk_stacking_scores.json
│
└───seq2vec_tsne
│ nlp_tsne_embedding_of_texts.ipynb
│
└───.ipynb_checkpoints
nlp_tsne_embedding of texts-checkpoint.ipynb
nlp_tsne_embedding_of_texts-checkpoint.ipynb
The global_results.ipynb
notebook summarizes all the results obtained, and allows to re-fit all the saved models.
The experimented models are tuned and saved in separate notebooks in the folder models
.
All the notebooks rely on helper functions stored in the folder functions
.
The decision tree algorithm coded from scratch, the tsne embeddings performed on textual data and some draft notebooks have been stored in separate folders.