# Predicting Top 10 Billboard Hot 100 Hits

## Intro

The Billboard Hot 100 chart has tracked the most popular songs in the United States since its inception in 1958. This project centers on predicting songs' chart performance by leveraging Spotify's audio features and various popularity metrics sourced from Spotify. Through in-depth analysis of datasets that include audio characteristics such as tempo, danceability, and energy, along with Spotify's popularity scores and streaming data, the project aims to create a model capable of forecasting a song's potential success on the Hot 100 chart. By studying these relationships and identifying key factors that drive a song's charting potential, the project seeks to uncover patterns and insights that illustrate the evolving landscape of popular music over the years.

The project is a continuation of my previous Data Science project, where I performed a thorough analysis on the songs from the Billboard Hot 100 chart. It takes a close look at different tendencies over the years, performance of different genres, usage of different words, correlations between features, lyric sentiment etc.

The goal of this project is to make an attempt at predicting which songs can climb to the top 10 of the Billboard Hot 100 and find out if their audio features contribute to their charting, by also taking into account different Spotify popularity metrics. It aims to also see if the chart position can even be effectively predicted, or if it comes from outside unpredictable factors.

## Workflow

### 1. Gathering artist information

Using the Spotify API, I was able to collect information about the artists' Spotify popularity and followers. They were used as two new features, representing a popularity metric for the main artist of each song, as artists' popularity often has an impact on the chart performance.

### 2. Data analysis

In this part, I explored the correlations that the Spotify audio features and popularity metrics and chart performance have with each other. I also took a look at the features' evolution over the years and their distributions.

### 3. Feature engineering

I expanded the dataset with a total of three additional columns. One represents the total number of songs that the artist of the corresponding song has on the chart. The second one is a category indicating a very popular artist, having a high number of total songs and a high percentage of top 10 songs. The third one was engineered through separating the data in clusters based on just the audio features and evaluating the percentage of top 10 songs that each cluster contains.

### 4. Data preprocessing

The initial data had different bounds and very uneven distributions with a lot of outliers. For that reason the data was scaled and some of the features were transformed using square root or logarithm and then clipped in an upper and lower bound.

### 5. Top 10 Hit Classification

Different models are tested, leading to the conclusion that the Random Forest Classifier is performs the best. The target classes are imbalanced, hindering the model performance. After data oversampling and hyperparameter optimization, the model's performance is improved, but stil leaves a lot to be desired. 

### 6. Spotify Popularity Prediction

After trying top 10 songs classification, an attempt was made at predicting Spotify popularity, to compare if it is easier to identify. Overall led to a model that is good, but still not great.


## Structure

### Notebooks

`gathering_data.ipynb` - find data for artist spotify popularity and followers using the Spotify API.
`data_analysis.ipynb` - analysis of the available data, exploring relations, distributions and trends.
`data_manipulation.ipynb` - includes preprocessing and feature engineering.
`predicting_top_10.ipynb` - creating the prediction model for classifying top 10 hits and optimizing it.
`predicting_popularity.ipynb` - a fast side attempt at predicting the Spotify popularity.

### Modules
`artist_data.py` - functions used for interacting with the Spotify API.
`utils.py` - functions for printing fast model evaluations.

### Data

`filled_spotify.csv` - the initial dataset, containing songs, their artists, chart performance, audio.
`filled_artists_info.csv` - the dataset with two additional columns showing artists' popularity and followers on Spotify.
`preprocessed_data.csv`

## Conclusion

After a long battle with choosing the right parameters, the model got to a decent state, where it can missclassify less songs as big hits, while still being able to identify a good portion of the actual hits. Because of the imbalanced classes, the model performs worse, as top 10 songs are harder to identify as for example top 50. A fast attempt at predicting the Spotify popularity led to a decent result, but with not that high $R^2$ score, meaning that there is stil room for better predictions.

For the model, it was easy to identify most of the top 10 songs, but that came at the cost also missclassifying a **lot** more of the non hit songs. Optimizing it led to less songs being identified, but also **much** less missclassifications being made, which is the desirable result. 

Overall the project fulfilled its goal of finding out if songs can be classified based on their audio features. The answer is... kind of. Clustering the songs based on their audio features led to defining clusters with different hit song percentage in them. While the audio features do contribute to the song position, the model put much more weight on the songs' Spotify popularity, which proved to be much more important than their respective artist's popularity or followers. Of course these are just metrics from Spotify. There are many other streaming services which can also contribute differently to the chart position.

In the end a conclusion can be made that how a song sounds isn't the end all be all factor for it being popular. It does give some insight, but overall the song's chart performance is just as dependant on marketing, trending on different social media and even some luck.


## Future Improvements

The project laid a solid ground on the goal of predicting chart performance. There are many ways that it can be expanded. First of all, not all approaches for model optimization were tested, meaning that the Random Forest Classifier can stil be tuned to perform even better, given different parameters and differently preprocessed data. It is even possible that another model could outperform it, given the proper hyperparameter optimization. The most important continuation of this project is to include other popularity metrics from social media like TikTok, Youtube, X etc. They should have a very positive effect on the top 10 hit classification.

There is also the possibility of remaking this project with a different approach - trying to predict which songs can enter the Billboard chart, using a dataset containing songs that are in and out of it. The goal is again similar, but the results could be very different.

## Used Resources

https://www.tandfonline.com/doi/full/10.1080/23270012.2023.2239824#d1e165 - Predicting song popularity based on Spotify's audio features

https://medium.com/@tariq.tq/how-to-handle-categorical-variables-in-clustering-1daa3b05bf25 - Song Popularity Predictor

https://tomasezequielrau.medium.com/clustering-forecasting-spotify-songs-audio-features-5b2c21f0a6b9 - Clustering & Forecasting | Spotify Songs Audio Features

https://medium.com/@tariq.tq/how-to-handle-categorical-variables-in-clustering-1daa3b05bf25 - How to Handle Categorical Variables in Clustering

https://web.stanford.edu/class/cs168/l/l4.pdf#:~:text=The%20goal%20is%20to%20re-represent%20points%20in%20high-dimensional,of%20lossy%20compression%2C%20tailored%20to%20approximately%20preserve%20distances - CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction

https://www.geeksforgeeks.org/machine-learning-outlier/ - How to Detect Outliers in Machine Learning

https://www.youtube.com/watch?v=jmAuVP_UOn0 - The A to Z of dealing with Outliers | Data Preprocessing | Data Science

https://medium.com/sfu-cspmp/surviving-in-a-random-forest-with-imbalanced-datasets-b98b963d52eb - Surviving in a Random Forest with Imbalanced Datasets

https://scikit-learn.org/stable/index.html - The scikit-learn library with documentation for the used models, metrics and preprocessing techniques.

https://developer.spotify.com/documentation/web-api - Spotify API documentation

https://www.v7labs.com/blog/precision-vs-recall-guide - Precision vs. Recall: Differences, Use Cases & Evaluation

