![Screenshot 2024-04-13 at 14 43 50](https://private-user-images.githubusercontent.com/101648535/322206245-ac35c087-15f2-48b9-9df2-e71d4850948b.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE5ODU5NDYsIm5iZiI6MTcyMTk4NTY0NiwicGF0aCI6Ii8xMDE2NDg1MzUvMzIyMjA2MjQ1LWFjMzVjMDg3LTE1ZjItNDhiOS05ZGYyLWU3MWQ0ODUwOTQ4Yi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzI2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcyNlQwOTIwNDZaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0wNTI2NWQ3NDU5NTU1ODI2M2YyNWVjN2EzM2ZkODQwY2YyZDIyODU1NGMxZWQ4YTUxMTljMWZjM2VkYTQ2NzFlJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.J3UZQxSCZB3Ttr1lIWGY_gR3Tmx7XZZ-LTQIJCYGX2k)
The goal of this project is to create a classifier and see how accurately it can predict song genres. Taking a dataset from Spotify [Pandya, 2022], which is al- ready using machine learning algorithms for these purposes, can help assess if the resulting model can be considered apt for a large-scale business or is more appropriate for a smaller audio streaming market player.
- SVM
- Decision Tree
- Gaussian Naive Bayes
- K-nn
- MLP
- Multinomial Naive Bayes
- Nearest Centroids
- Random Forest
- XGBoost
The contents of the repository are the following:
- data/ → datasets used for this project
- spotify_data: the original Spotify Tracks Dataset
- spotify_clean: dataset without only one genre assigned to each song (generated by using the data-cleaning notebook)
- spotify_simplified: dataset with only 18 unique genres in total (generated by using the clustering notebook)
- data_report: exploratory data analysis for the original dataset
- figures/ → figures generated for the presentation and report (generated using the plots notebook)
- ml_methods/ → notebooks with different machine learning algorithms explored for the project
- baseline → implement the majority and rule-based baselines
- clustering → reduce the number of genres in the dataset to only 18 via a combination of agglomerative clustering and manual input
- data-cleaning → choose only one genre for every song in the dataset that appeared with multiple genres
- data-exploration → visualize the features of the dataset and propose preprocessing steps
- hyperparemter-optimization → hyperparameter optimization implemented using GridSearchCV
- plots → generate plots for the report and presentation
- Activate your virtual environment
- Run the following command to install all the dependencies needed for this project:
pip install -r requirements.txt
- Inspect the code for the different algorithms that were explored (stored under ml_methods/)
Team 1
- Elizaveta Nosova (1983805)
- Miguel Samaniego (1980439)
- Nico Sharei (1986818)
- Julian Ament (1981511)
- Artem Bisliouk (1978986)
- Jannik Kranz (1981766)