# Predicting a Song's Popularity and Genre Based on it's Audio Features

##### by Christian, Shria, Ricky, Amelia, Julia, Laura

In this report, we present our data science project:
"Predicting a Song's Popularity and Genre Based on Audio Features."

Our goal was to use audio features provided by the Spotify API to predict a song's popularity and genre. 

We collected data on songs that have been processed through the Spotify API, cleaned that data, tested several machine learning models, evaluated the models, and used visualizations to gain insights into the relationships between audio features and popularity/genre.

At the end we have an interactive script that allows you to search a song and artists name and you can see how our model evaluates it's popularity and genre.

# Data Collection and Cleaning:

The data for our project was collected from Kaggle.com.

We found a dataset named: "Spotify - All Time Top 2000s Mega Dataset"
https://www.kaggle.com/datasets/iamsumat/spotify-top-2000s-mega-dataset

The data in the dataset appears to be sourced from the Spotify API an includes the artists "Top Genre" as the genre of the song. There is no "genre" data within the Spotify API for individual songs, instead you can only find an artists overall genre.

Here's what the data looks like

In [21]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
import seaborn as sns
import os
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from youtubesearchpython import VideosSearch
import numpy as np

In [22]:
df = pd.read_csv('Spotify-2000.csv')
df

Unnamed: 0,Index,Title,Artist,Top Genre,Year,Beats Per Minute (BPM),Energy,Danceability,Loudness (dB),Liveness,Valence,Length (Duration),Acousticness,Speechiness,Popularity
0,1,Sunrise,Norah Jones,adult standards,2004,157,30,53,-14,11,68,201,94,3,71
1,2,Black Night,Deep Purple,album rock,2000,135,79,50,-11,17,81,207,17,7,39
2,3,Clint Eastwood,Gorillaz,alternative hip hop,2001,168,69,66,-9,7,52,341,2,17,69
3,4,The Pretender,Foo Fighters,alternative metal,2007,173,96,43,-4,3,37,269,0,4,76
4,5,Waitin' On A Sunny Day,Bruce Springsteen,classic rock,2002,106,82,58,-5,10,87,256,1,3,59
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1989,1990,Heartbreak Hotel,Elvis Presley,adult standards,1958,94,21,70,-12,11,72,128,84,7,63
1990,1991,Hound Dog,Elvis Presley,adult standards,1958,175,76,36,-8,76,95,136,73,6,69
1991,1992,Johnny B. Goode,Chuck Berry,blues rock,1959,168,80,53,-9,31,97,162,74,7,74
1992,1993,Take Five,The Dave Brubeck Quartet,bebop,1959,174,26,45,-13,7,60,324,54,4,65


# Now to clean the data:

### Notice there are too many genres for us to train a model with

Having too many genres would have made it more difficult for the machine learning model to accurately predict the genre of a song. By reducing the number of genres to a smaller, more manageable set, the model can better learn the patterns in the data and make more accurate predictions.

In [23]:
print(df['Top Genre'].unique())

['adult standards' 'album rock' 'alternative hip hop' 'alternative metal'
 'classic rock' 'alternative pop rock' 'pop' 'modern rock'
 'detroit hip hop' 'alternative rock' 'dutch indie' 'garage rock'
 'dutch cabaret' 'permanent wave' 'classic uk pop' 'dance pop'
 'modern folk rock' 'dutch pop' 'dutch americana' 'alternative dance'
 'german pop' 'afropop' 'british soul' 'irish rock' 'disco' 'big room'
 'art rock' 'danish pop rock' 'neo mellow' 'britpop' 'boy band'
 'carnaval limburg' 'arkansas country' 'latin alternative' 'british folk'
 'celtic' 'chanson' 'celtic rock' 'hip pop' 'east coast hip hop'
 'dutch rock' 'blues rock' 'electro' 'australian pop' 'belgian rock'
 'downtempo' 'reggae fusion' 'british invasion' 'finnish metal'
 'canadian pop' 'bow pop' 'dutch hip hop' 'dutch metal' 'soft rock'
 'acoustic pop' 'acid jazz' 'dutch prog' 'candy pop' 'operatic pop'
 'trance' 'scottish singer-songwriter' 'mellow gold' 'alternative pop'
 'dance rock' 'atl hip hop' 'eurodance' 'blues' 'canad

## After looking through the genres, we chose the main ones we wanted to look at and labeled all others as "unknown"

The logic is: If the name 'pop' occurs in the song's specific genre, we will classify it as a pop song.

Same for all other major genres that we noticed occuring.

These are the only genres we want to look at out of all of these.



In [24]:
genres = ['country', 'dance', 'electro', 'folk', 'hip hop', 'jazz', 'metal', 'other', 'pop', 'reggae', 'rock', 'soul']

for i in range(len(df['Top Genre'])):
    genre = df['Top Genre'][i].lower()
    for g in genres:
        if g in genre:
            df['Top Genre'][i] = g
            break
    else:
        df['Top Genre'][i] = 'unknown'


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Top Genre'][i] = 'unknown'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Top Genre'][i] = g


## Here is the dataset now. Notice the "Top Genre" column has different data than it did before

In [25]:
df

Unnamed: 0,Index,Title,Artist,Top Genre,Year,Beats Per Minute (BPM),Energy,Danceability,Loudness (dB),Liveness,Valence,Length (Duration),Acousticness,Speechiness,Popularity
0,1,Sunrise,Norah Jones,unknown,2004,157,30,53,-14,11,68,201,94,3,71
1,2,Black Night,Deep Purple,rock,2000,135,79,50,-11,17,81,207,17,7,39
2,3,Clint Eastwood,Gorillaz,hip hop,2001,168,69,66,-9,7,52,341,2,17,69
3,4,The Pretender,Foo Fighters,metal,2007,173,96,43,-4,3,37,269,0,4,76
4,5,Waitin' On A Sunny Day,Bruce Springsteen,rock,2002,106,82,58,-5,10,87,256,1,3,59
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1989,1990,Heartbreak Hotel,Elvis Presley,unknown,1958,94,21,70,-12,11,72,128,84,7,63
1990,1991,Hound Dog,Elvis Presley,unknown,1958,175,76,36,-8,76,95,136,73,6,69
1991,1992,Johnny B. Goode,Chuck Berry,rock,1959,168,80,53,-9,31,97,162,74,7,74
1992,1993,Take Five,The Dave Brubeck Quartet,unknown,1959,174,26,45,-13,7,60,324,54,4,65


## Now we need to do a little bit of cleaning on the "Popularity" part of the data

Notice that popularity is a number between 0-100.

From the documentation, Spotify API states that "Popularity" is:

"The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past."

We want to be able to classify between a highly popular song and a averagely popular song. Below we are trying to figure out what level of popularity we would consider to be "High" vs. "Average" popularity.

In [26]:
print(len(df))
middle = (df['Popularity'] > 70).sum()

print(middle)
print(middle/len(df))


1994
507
0.2542627883650953


## Cleaning the data for popularity now

For this dataset, since the top ~25% (500/2000) of songs are above 70 popularity rating, we will be using that as the cutoff for "High" vs. "Average" popularity.

The number 25% we came up with. We thought it would be appropriate to use for this dataset due to it consisting of the "Top 2000 songs on Spotify" If a song was in the top 25% of the most popular songs we classified it to be "High" in popularity. We found a popularity score of 70 to be the 25% cutoff.

In [27]:
df['Popularity'] = np.where(df['Popularity'] < 70, 'Average', 'High')
df

Unnamed: 0,Index,Title,Artist,Top Genre,Year,Beats Per Minute (BPM),Energy,Danceability,Loudness (dB),Liveness,Valence,Length (Duration),Acousticness,Speechiness,Popularity
0,1,Sunrise,Norah Jones,unknown,2004,157,30,53,-14,11,68,201,94,3,High
1,2,Black Night,Deep Purple,rock,2000,135,79,50,-11,17,81,207,17,7,Average
2,3,Clint Eastwood,Gorillaz,hip hop,2001,168,69,66,-9,7,52,341,2,17,Average
3,4,The Pretender,Foo Fighters,metal,2007,173,96,43,-4,3,37,269,0,4,High
4,5,Waitin' On A Sunny Day,Bruce Springsteen,rock,2002,106,82,58,-5,10,87,256,1,3,Average
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1989,1990,Heartbreak Hotel,Elvis Presley,unknown,1958,94,21,70,-12,11,72,128,84,7,Average
1990,1991,Hound Dog,Elvis Presley,unknown,1958,175,76,36,-8,76,95,136,73,6,Average
1991,1992,Johnny B. Goode,Chuck Berry,rock,1959,168,80,53,-9,31,97,162,74,7,High
1992,1993,Take Five,The Dave Brubeck Quartet,unknown,1959,174,26,45,-13,7,60,324,54,4,Average


## Some durations have commas in them and in order for sk-learn to process the data, it needs to read it as a float

In [28]:
df['Length (Duration)'] = df['Length (Duration)'].str.replace(',', '').astype(float)

## Now the data is completely preprocessed and is ready for splitting and training a model with

In [29]:
#Popularity

pop = df.drop(columns=['Title','Artist','Top Genre','Year'],axis=1)

pop_y_data = pop['Popularity']

pop = pop.drop('Popularity',axis=1)

pop_x_data = pop




#Genres

gen = df.drop(columns=['Title','Artist','Year','Popularity'],axis=1)

gen_y_data = gen['Top Genre']

gen = gen.drop('Top Genre',axis=1)

gen_x_data = gen

## Here is what our x_data looks like at the moment, cleaned and ready to split up

In [30]:
x_data = gen_x_data.drop('Index',axis=1)

x_data

Unnamed: 0,Beats Per Minute (BPM),Energy,Danceability,Loudness (dB),Liveness,Valence,Length (Duration),Acousticness,Speechiness
0,157,30,53,-14,11,68,201.0,94,3
1,135,79,50,-11,17,81,207.0,17,7
2,168,69,66,-9,7,52,341.0,2,17
3,173,96,43,-4,3,37,269.0,0,4
4,106,82,58,-5,10,87,256.0,1,3
...,...,...,...,...,...,...,...,...,...
1989,94,21,70,-12,11,72,128.0,84,7
1990,175,76,36,-8,76,95,136.0,73,6
1991,168,80,53,-9,31,97,162.0,74,7
1992,174,26,45,-13,7,60,324.0,54,4


## Here is what our Population Y Data looks like at the moment, cleaned and ready to split up

In [31]:
pop_y_data

0          High
1       Average
2       Average
3          High
4       Average
         ...   
1989    Average
1990    Average
1991       High
1992    Average
1993    Average
Name: Popularity, Length: 1994, dtype: object

## Here is what our Genre X data looks like at the moment, cleaned and ready to split up

In [32]:
gen_y_data

0       unknown
1          rock
2       hip hop
3         metal
4          rock
         ...   
1989    unknown
1990    unknown
1991       rock
1992    unknown
1993    unknown
Name: Top Genre, Length: 1994, dtype: object

## Time to split and train each model.

We decided to use a Random Forest Classifier as we found that it is useful for a dataset that is a little bit unbalanced. Since we had 75% "Average Popular" and 25% "High Popular" songs, we thought this was inbalanced and a classifier that takes this into account would be useful we though.

Also for our Genre data, that is quite unbalanced as well so again the Random Forest Classifier seems useful.

You can see the balance of our data below:

In [33]:
print(df['Top Genre'].value_counts())

rock       788
unknown    506
pop        303
dance      154
metal       93
soul        45
folk        32
hip hop     29
country     17
electro     12
reggae      12
jazz         3
Name: Top Genre, dtype: int64


In [34]:
print(pop_y_data.value_counts())

Average    1441
High        553
Name: Popularity, dtype: int64


In [35]:
# Train/test split for popularity classification
X_train_pop, X_test_pop, y_train_pop, y_test_pop = train_test_split(x_data, pop_y_data, test_size=0.2, random_state=42)

# Train/test split for genre classification
X_train_genre, X_test_genre, y_train_genre, y_test_genre = train_test_split(x_data, gen_y_data, test_size=0.2, random_state=42)


# Model for popularity classification
popularity_classifier = RandomForestClassifier()
popularity_classifier.fit(X_train_pop, y_train_pop)
y_pred_pop = popularity_classifier.predict(X_test_pop)

# Model for genre classification
genre_classifier = RandomForestClassifier()
genre_classifier.fit(X_train_genre, y_train_genre)
y_pred_genre = genre_classifier.predict(X_test_genre)


## Now we can evaluate each model

## Popularity classification

In [36]:
accuracy_pop = accuracy_score(y_test_pop, y_pred_pop)
precision_pop = precision_score(y_test_pop, y_pred_pop, average='weighted')
recall_pop = recall_score(y_test_pop, y_pred_pop, average='weighted')
f1_pop = f1_score(y_test_pop, y_pred_pop, average='weighted')

print("Popularity Classification:")
print(f"Accuracy={accuracy_pop}, Precision={precision_pop}, Recall={recall_pop}, F1 Score={f1_pop}")

print('\n\n')

cv_scores_pop = cross_val_score(popularity_classifier, x_data, pop_y_data, cv=5)
print(f"Cross-validation scores for Popularity Classification: {cv_scores_pop}")
print(f"Average cross-validation score for Popularity Classification: {cv_scores_pop.mean():.2f}")

print('\n\n')

cm_pop = confusion_matrix(y_test_pop, y_pred_pop)

# Confusion matrix for genre classification
pop_cm_df = pd.DataFrame(cm_pop, columns=np.unique(pop_y_data), index=np.unique(pop_y_data))
print("Pop Confusion Matrix:")
print(pop_cm_df)


Popularity Classification:
Accuracy=0.7393483709273183, Precision=0.6998230342882295, Recall=0.7393483709273183, F1 Score=0.6772438198535359



Cross-validation scores for Popularity Classification: [0.71679198 0.69924812 0.72180451 0.72681704 0.71105528]
Average cross-validation score for Popularity Classification: 0.72



Pop Confusion Matrix:
         Average  High
Average      280    12
High          92    15


The metrics shown above are evaluation metrics used to assess the performance of our popularity classification model. 

- Accuracy: This measures the proportion of correct predictions made by the model. In this case, the accuracy is 0.7293233082706767, which means that the model correctly predicts the popularity of 72.93% of the songs in the test set.
- Precision: Precision is a metric that measures the fraction of positive predictions that are actually correct. The precision of 0.673234891113074 in this case means that the model correctly predicts the popularity of songs as high 67.32% of the time.
- Recall: Recall is a metric that measures the fraction of positive examples that are correctly detected by the model. In this case, the recall of 0.7 293233082706767 means that the model correctly identifies 72.93% of the high popularity songs in the test set.
- F1 Score: F1 Score is a weighted average of Precision and Recall. The F1 Score of 0.6588925099482931 in this case indicates that the model's precision and recall are fairly balanced.

Cross-validation is a technique used to assess the performance of a model by dividing the data into multiple folds, using each fold as a test set, and using the remaining folds as the training set. In this case, the average cross-validation score is 0.72, which is relatively high, indicating that the model generalizes well to new data.

The confusion matrix provides a more detailed view of the model's performance. In this case, the confusion matrix shows that the model correctly predicted 280 songs as having an average popularity, but incorrectly predicted 96 songs as having a high popularity. On the other hand, the model correctly predicted 11 songs as having a high popularity but incorrectly predicted 12 songs as having an average popularity.

To improve the performance of the model, we could try to balance the dataset so that it has equal numbers of high and average popularity songs. Additionally, we could try using other feature selection and feature engineering techniques to create more meaningful features for the model to learn from. Additionally, we could try using other models and compare their performance with the random forest classifier to see if they perform better on this dataset.

## Genre classification

In [37]:
# Evaluating genre classification
accuracy_genre = accuracy_score(y_test_genre, y_pred_genre)
precision_genre = precision_score(y_test_genre, y_pred_genre, average='weighted')
recall_genre = recall_score(y_test_genre, y_pred_genre, average='weighted')
f1_genre = f1_score(y_test_genre, y_pred_genre, average='weighted')
print()
print("Genre Classification:")
print(f"Accuracy={accuracy_genre}, Precision={precision_genre}, Recall={recall_genre}, F1 Score={f1_genre}")

print('\n\n')

# Cross-validation for genre classification
cv_scores_genre = cross_val_score(genre_classifier, x_data, gen_y_data, cv=5)
print(f"Cross-validation scores for Genre Classification: {cv_scores_genre}")
print(f"Average cross-validation score for Genre Classification: {cv_scores_genre.mean():.2f}")

print('\n\n')

# Confusion matrix for genre classification
genre_classes = np.unique(y_test_genre)
genre_cm = confusion_matrix(y_test_genre, y_pred_genre, labels=genre_classes)
genre_cm_df = pd.DataFrame(genre_cm, columns=genre_classes, index=genre_classes)
print("Genre Confusion Matrix:")
print(genre_cm_df)


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))



Genre Classification:
Accuracy=0.42105263157894735, Precision=0.34717827984480704, Recall=0.42105263157894735, F1 Score=0.35872826035969296



Cross-validation scores for Genre Classification: [0.36090226 0.36842105 0.42857143 0.40601504 0.39949749]
Average cross-validation score for Genre Classification: 0.39



Genre Confusion Matrix:
         country  dance  electro  folk  hip hop  metal  pop  rock  soul  \
country        0      0        0     0        0      0    0     1     0   
dance          0      1        0     0        0      0    5    20     0   
electro        0      0        0     0        0      0    1     0     0   
folk           0      0        0     0        0      0    0     1     0   
hip hop        0      0        0     0        2      0    1     0     0   
metal          0      0        0     0        0      1    1    13     0   
pop            0      3        0     0        0      0    5    36     0   
rock           0      0        0     0        0      3    9 

The evaluation metrics for the genre classification problem are lower compared to the popularity classification problem. The accuracy score of 0.4461 means that out of all the songs that were predicted by the model, 44.61% of the predictions were correct. Precision score of 0.3718 means that out of all the songs that were classified as positive (belonging to a specific genre), only 37.18% were actually positive. Recall score of 0.4461 means that out of all the actual positive songs, only 44.61% were correctly classified by the model. F1 score of 0.3813 is the harmonic mean of precision and recall and gives a single number that summarizes both the precision and recall.

The cross-validation scores for the genre classification are also lower than the popularity classification with an average cross-validation score of 0.40. The cross-validation score measures the model's ability to generalize to new unseen data and a lower score indicates that the model is not performing as well on new data.

The confusion matrix shows the number of songs that were correctly and incorrectly classified for each genre. From the matrix, we can see that the model is not performing well for all the genres, with most of the songs being classified as 'unknown'. It seems it does do quite well for "Rock" songs though. The reason why being because we had significantly more rock songs in our dataset than any other. The unbalance of data was not good for the accuracy of our model, however it helped us to understand that having significantly more of one type of data will make it easier for the model to predict it. 

## Tying it all together

Below we have an interactive script that allows you to search a song and artists name and you can see how our model evaluates it's popularity and genre.

Please give it a run and see what you think! Be sure to spell the name of the song and artist correctly.

In [38]:
# DON'T STEAL MY CREDENTIALS :)
client_id = '733b9bea3f154a83be8779e9180bf784'
client_secret = '2c5a00122b8940a9969092fc798bb595'

os.environ['SPOTIPY_CLIENT_ID'] = client_id
os.environ['SPOTIPY_CLIENT_SECRET'] = client_secret

spotify = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials())

song_name = input('Enter a song name: ')
artist_name = input('Enter the artist of the song: ')
query = f"track:{song_name} artist:{artist_name}"
result = spotify.search(query, type='track', limit=1)


track = result['tracks']['items'][0]
song_id = track['id']
popularity = track['popularity']


audio_features = spotify.audio_features([song_id])[0]

nrgy = audio_features['energy']
dnce = audio_features['danceability']
BPM = audio_features['tempo']
dB = audio_features['loudness']
live = audio_features['liveness']
val = audio_features['valence']
dur = audio_features['duration_ms']
acous = audio_features['acousticness']
spch = audio_features['speechiness']

final_features = [BPM, nrgy, dnce, dB, live, val, dur, acous, spch]
feature_names = ['BPM', 'Energy', 'Danceability', 'Loudness', 'Liveness', 'Valence', 'Duration (ms)', 'Acousticness', 'Speechiness']


print(f"Audio features for {song_name} by {artist_name}:\n")
for feature, value in zip(feature_names, final_features):
    print(f"{feature}: {value}")
    
print('\n\n\n')

final_features_2d = np.array(final_features).reshape(1, -1)

popularity_prediction = popularity_classifier.predict(final_features_2d)
genre_prediction = genre_classifier.predict(final_features_2d)

print(f"Popularity Prediction: {popularity_prediction[0]}")
print(f"Spotify's Actual Popularity Rating (<70 we consider average popularity): {popularity}")
print(f"Genre Prediction: {genre_prediction[0].title()}")


videosSearch = VideosSearch(query, limit=1)
data = videosSearch.result()
link = data['result'][0]['link']

print('\n\nSONG LINK')
print(link)

Enter a song name: rocket man
Enter the artist of the song: elton john
Audio features for rocket man by elton john:

BPM: 136.571
Energy: 0.532
Danceability: 0.601
Loudness: -9.119
Liveness: 0.0925
Valence: 0.341
Duration (ms): 281613
Acousticness: 0.432
Speechiness: 0.0286




Popularity Prediction: Average
Spotify's Actual Popularity Rating (<70 we consider average popularity): 83
Genre Prediction: Pop






SONG LINK
https://www.youtube.com/watch?v=DtVBCG6ThDk


## Overall, this project helped us to gain insight into creating our own machine learning models from scratch to predict things we find interesting.

We see potential for a project where an artist can create a new song, get it's audio features, and then run it through our model to see what it's predicted "Popularity" will be along with what genre it would be classified into on Spotify.

This could be a future use of a model like this.

## This could be done better than we did it

 To improve the accuracy of our classifiers, we can consider the following options:

1. Collect more data: Having more data, especially for underrepresented genres, can help the model to learn better and improve its accuracy.

2. Feature engineering: Adding or modifying features that are more relevant to the problem can help the model to learn better.

3. Model selection: Trying out different models and comparing their performance can help to find a better model for the problem.

4. Balancing data more