Music Mood Classification Using the Million Song Dataset
If you want to see a quick summary of the methods & results, here are the slides.
A detailed technical report is available as a PDF
- Clone/download ZIP from https://github.com/bhavika/JoyDivision.git
- cd JoyDivision
- If you want to create a virtual environment, run virtualenv <the name of the environment, say 'tekwani'>
- To begin using the virtual environment, you must activate it. $ source tekwani/bin/activate
- Now install the packages specified in requirements.txt. You can do this using pip freeze > requirements.txt (freeze the current state of the environment) pip install -r requirements.txt
Running the solution
Download the dataset - I used to host it on a Google Drive link which I'm no longer able to continue with, so please contact me for the files.
Move the downloaded files to tekwani/data and check that it contains the following files:
-fullset.pkl -train.pkl -test.pkl
The folder JoyDivision/explore contains plots, output for different estimators' grid search results, scripts to handle h5 files and a list of getter methods for h5 files (hdf5_getters.txt)
evaluation.py is the final file that generates results as shown in the report.
It will run 6 models for 3 feature combinations - audio and descriptive, descriptive only, audio only.
create_dataset.pybuilds the dataset and randomly splits it into a 60-40 distribution fof train and test sets.
models.pyevaluates feature importance for ensemble estimators and performs cross validation for all estimators.
read_h5.pyis used to pull data out of HDF5 files. To run this, you need to download the Million Song Subset and place it in
spotify.pysearches for Track IDs for the songs we have labeled in Spotify's database.
spotify_audio_features.pyfetches Danceability, Energy, Speechiness, Acousticness, Valence and Instrumentalness for all the track IDs we were able to get from
rfe.pygets the ranking of features. I use the number of optimal features (n) obtained here to select the top n features in
feature_importance.pyfor the estimators.
feature_combinations.pyevaluates the importance of features when compared to the groups they're combined with. For e.g., Descriptive features paired with timbre, audio features paired with descriptive, etc
get_train_test.pyserves the train and test sets (*.pkl) to any file that imports it.
scratch\labels.csvcontains the full list of songs and the labels I assigned to them.
scratch\models.outcontains the output for
models.py. These are only cross validation results.
learning_curve.pyplots the training score and cross validation score for an SVM with a linear kernel.
*.outfiles - output files.
hdf5_getters.pyis an interface provided along with the Million Song Dataset by LabROSA (Columbia University). It is used to read HDF5 files which is the initial form of the Million Song Dataset.
- All files named
gridsearch_.pyare used to do a hyperparameter search for the models used. These models are ADABoostClassifier, ExtraTreesClassifier, GradientBoostingClassifier, SVM, KNearestNeighbour and RandomForestClassifier.