Hidden Hits
-
Hidden Hits 1 - Data Collection (written by Mir Adnan Mahmood) contains code used to generate a dataset containing information on songs that appeared in the Billboard Hot 100's year end list from 2010 to 2020. The dataset contains the following features:
- Information on Title, Artists, Year of appearance on Hot 100 and Rank in Hot 100 are scraped through Billboard's website
- Information on Lyrics is scraped (using information on song title name and artist names) through Genius.com
The final output for this notebook is the dataset
hot_100_lyrics.csv
file.
-
Hidden Hits 2 - Spotify-Kaggle Dataset Clean (written by Mir Adnan Mahmood) contains code that extracts song features for songs that appear on the Billboard Hot 100s Year End lists from the Kaggle Dataset created by Yamac Eren Ay found here. The code first creates a subset of the Hot 100's data
tracks_hot100.csv
and supplements it with another dataset that contains song features for songs that did not appear on the Hot 100s list. This dataset is found intracks_not100_sub.csv
. -
Hidden Hits 3 - Hot 100 Classification (written by Khyati Malik and JongHoon Shin) contains the code to run the classification algorithm that predicts whether or not a song is a Hot 100 or a Not 100 song using a Random Forest Classifier. Results for different models and features are published in the
NewUpdate
folder. Note: This file is the same as thehot_100_classification_v1
notebook in theNewUpdate
folder. -
Hidden Hits 4 - TF-IDF Measure and Clustering (written by Mike Pierce and Ritwik Banerji) contains code to quantize the space of lyrics using TF-IDF measure, then group songs using KMeans clustering.
All old notebooks and data files are in the scraps
folder.
The .gitignore
file ensures that (1) we don't upload any of the .ipynb_checkpoints
files,
and (2) we don't upload a directory named data
where we should keep our large data sets
because GitHub won't let us upload too large of files.
- Mir Adnan Mahmood — Data Collection and Cleaning
- Ritwik Banerji — Unsupervised Learning: classification of songs via lyrics
- Mike Pierce — Unsupervised Learning: classification of songs via lyrics
- Khyati Malik — Supervised Learning: Classification of Hot-or-Not songs based on sonic features
- JongHoon Shin — Supervised Learning: Classification of Hot-or-Not songs based on sonic features
- Rabail Chandio — Post-processing and consolidation
This project was created as part of the Erdös Institute's DataScience Bootcamp 2021.