#Analyze Spotify Genre
##In this notebook, I fed the data into a multi-class classification algorithm to predict the genre of a song based on the audible attributes. For the purpose of limiting the genres to classify, I filtered for songs within just the top 7 most popular genres according to the genre analysis done in previous sections. 

##With some optimizations, the algorithm was able to predict a songs genre with about 80% accuracy in 1 prediction and with about 89% accuracy with 2 predictions. I was extremely satisfied with this result given the difficulty of 7-way classification, especially considering that I started with an accuracy of around 58% before model optimization and data changes.

## Data Preprocessing:

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from google.colab import files
uploaded = files.upload()

In [None]:
track_table = pd.read_csv("cleaned_tracks_both.csv")
track_table.dropna(inplace=True)

A quick reminder about what genres we're trying to classify and the the number of tracks per genre:

In [None]:
track_table.groupby("master_popular_genre").track_id.count()

In [None]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
data_numerics = pd.concat([track_table.select_dtypes(include=[np.number]),track_table['master_popular_genre']],axis=1, sort=False)

In [None]:
#move genre to front
track_table_dummies = pd.get_dummies(data_numerics,columns=['key','time_signature',],drop_first=True)
track_table_dummies = track_table_dummies.drop('loudness', axis=1)
track_table_dummies.head()
genre = track_table_dummies['master_popular_genre']
track_table_dummies.drop('master_popular_genre', axis=1,inplace = True)
track_table_dummies.insert(0, 'master_popular_genre', genre)

One more look at the data we're using to train the model:

In [None]:
track_table_dummies.head()

In [None]:
X = track_table_dummies.iloc[:,1:]
y = track_table_dummies.iloc[:,0]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Training the model:

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
genre_order = ['country', 'hiphop', 'house', 'indie', 'pop', 'r&b', 'rock']

(Creating a few helper methods for printing model results before building the models)

In [None]:
def print_accuracy(genre_index, conf_matrix):
  print("Accuracy predicting", genre_order[genre_index], ":", conf_matrix[genre_index,genre_index]/(sum(conf_matrix[genre_index,:])))

In [None]:
def print_grid_results(grid, x_test, y_test):
  conf_matrix = confusion_matrix(y_test, grid.predict(X_test), labels=genre_order)
  print("The best score is {}".format(grid.best_score_))
  print("The best hyper parameter setting is {}".format(grid.best_params_))
  print("Model Accuracy:", accuracy_score(y_test,grid.predict(X_test)))
  print()
  for i in range(0,len(genre_order)):
    print_accuracy(i,conf_matrix)
  fig, ax = plt.subplots(figsize=(12,10)) 
  sns.heatmap(conf_matrix.astype('float') / conf_matrix.sum(axis=1)[:, np.newaxis], annot=True,fmt='.2%', cmap='Blues',xticklabels=genre_order,yticklabels=genre_order, ax=ax)
  ax.set(xlabel='Predicted Genre', ylabel='True Genre')
  ax.set_title("Proportional Genre Confusion Matrix")


### Model Selection

Piggybacking off the success of the Gradient Boost from the previous exploration, I initially tried using a Gradient Boost classifier for the multi-class classification. However, the Gradient Boost it took 5-10 minutes per iteration. With cross-validation and a grid search of 3-4 different variables, this optimization took hours and yielded a poor testing accuracy of around 58%.

From there I trained several other models using different classifiers like logistic regressions, SVMs, different boosting methods, and tree methods. I found random forest to be by far the most accurate in its performance, so I started with just a rudimentary Random Forest model with 500 estimators.

In [None]:
X = track_table_dummies.iloc[:,1:]
y = track_table_dummies.iloc[:,0]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

rf_param_grid = {
       #'n_estimators': range(350, 700, 100),
       'n_estimators': [500]
}
grid_rf = GridSearchCV(RandomForestClassifier(), rf_param_grid, cv=3, verbose=2).fit(X_train, y_train)

Let's look at a confusion matrix of how this crude Random Forest model predicted our genres and how the model performed overall:

In [None]:
conf_matrix = confusion_matrix(y_test, grid_rf.predict(X_test), labels=genre_order)
fig, ax = plt.subplots(figsize=(10, 10))
print(plot_confusion_matrix(X=X_test,y_true=y_test, labels=genre_order,estimator=grid_rf, ax=ax,values_format = 'd'))

Based on the above confusion matrix, we can see a few things:

1. Pop understandbly dominates the dataset. This somewhat obscures the confusion matrix because there are so many more pop rows than any other genre. We can see that based on the vertical light shade on the pop column (,4), our model is just blindly predicting pop in many cases where it's unsure. This poses a bit of an issue that I address on later.
2. Conversely, R&B is significantly underrepresented in the dataset. Only 769 tracks are predicted correctly as R&B, while 532 R&B tracks are incorrectly labeled as pop.
3. While the absolute number of tracks predicted in each square of the confusion matrix is interesting, it would be more useful to see the proportion of true tracks in each square. For example, I want to see the percentage of true R&B tracks that are predicted as R&B, rather than the absolute number of tracks, so the matrix is more balanced. Thus in the following block, I outputted the accuracy scores as well as the proportional confusion matrix.

In [None]:
print_grid_results(grid_rf,X_test,y_test)

The rudamentary RF model gives us a baseline performance of 68% in predicting the correct genre of a track.

Additionally, this proportional confusion matrix is a bit more readable. Additionally, this further emphasizes the model's tendancy to overpredict pop due to its prevalence in the dataset. In fact, when we look at what genre is most often miscategorized for each track, we see the

In [None]:
for count,genre_preds in enumerate(conf_matrix):
  print("True Genre:", genre_order[count])
  genre_preds[list(genre_preds).index(max(genre_preds))] = 0
  incorrect_guess = genre_order[list(genre_preds).index(max(genre_preds))]
  print("Most Common Incorrect Prediction:", incorrect_guess)
  print("Percent of All Incorrect Predictions as", incorrect_guess, (max(genre_preds) / sum(genre_preds)))
  print()

Clearly, pop is dominating the dataset and overpowering the importance of less frequently occuring genres like R&B and house. Let's address that...

## Balancing the Dataset

Before optimizing the model's hyperparameters, I wanted to try to even out the distribution of tracks of each genre. I chose to test several outcomes and monitor how the model performed with each:
1. Only downsampling pop tracks
2. Only upsampling lesser occuring genres and leaving pop tracks as they are
3. Downsampling pop tracks AND upsampling lesser occuring genres

### 1. Downsampling pop to 10,000 tracks



In [None]:
from sklearn.utils import resample

In [None]:
df_nonpop = track_table_dummies[track_table_dummies['master_popular_genre']!='pop']
df_pop = track_table_dummies[track_table_dummies['master_popular_genre']=='pop']
samp = 10000
# Downsample majority class
df_pop_downsampled = resample(df_pop, 
                                 replace=False,    # sample without replacement
                                 n_samples=10000) # reproducible results
 
# Combine minority class with downsampled majority class
df_pop_downsampled = pd.concat([df_pop_downsampled, df_nonpop])
 
# Display new class counts
df_pop_downsampled.master_popular_genre.value_counts()

In [None]:
X = df_pop_downsampled.iloc[:,1:]
y = df_pop_downsampled.iloc[:,0]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

rf_param_grid = {
       #'n_estimators': range(350, 700, 100),
       'n_estimators': [650]
}
grid_rf_under = GridSearchCV(RandomForestClassifier(), rf_param_grid, cv=3, verbose=2).fit(X_train, y_train)

In [None]:
print("Undersampling pop tracks:")
print_grid_results(grid_rf_under,X_test,y_test)

While thye confusion matrix looks a bit more balanced and there isn't a strong overprediction in the pop genre, the accuracy did not improve much -- it's still 68%.

### 2. Upsampling r&b, house, indie, and rock to 10,000 tracks
Upsampling these genres added 1000, 2500, 3500, and 4500 to the rock, indie, house, and r&b genres respectively, to bring them all to 10000 tracks each.

In [None]:
def oversample(frame, genre, numTracks):
  df = frame[frame['master_popular_genre']==genre]
  df_upsampled = resample(df, replace=True, n_samples=numTracks)
  return df_upsampled

In [None]:
samp = 10000

df_rb = oversample(track_table_dummies,"r&b", samp)
df_house = oversample(track_table_dummies,"house", samp)
df_indie = oversample(track_table_dummies,"indie", samp)
df_rock = oversample(track_table_dummies,"rock", samp)

df_rest = track_table_dummies[track_table_dummies['master_popular_genre']!='r&b']
df_rest = df_rest[df_rest['master_popular_genre']!='house']
df_rest = df_rest[df_rest['master_popular_genre']!='indie']
df_rest = df_rest[df_rest['master_popular_genre']!='rock']

# Combine minority class with downsampled majority class
df_upsampled = pd.concat([df_rest, df_rb, df_house, df_indie, df_rock])
 
# Display new class counts
df_upsampled.master_popular_genre.value_counts()

In [None]:
X = df_upsampled.iloc[:,1:]
y = df_upsampled.iloc[:,0]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

rf_param_grid = {
       #'n_estimators': range(350, 700, 100),
       'n_estimators': [650]
}
grid_rf_up = GridSearchCV(RandomForestClassifier(), rf_param_grid, cv=3, verbose=2).fit(X_train, y_train)

In [None]:
print("Oversampling less frequent genres:")
print_grid_results(grid_rf_up,X_test,y_test)

Wow! With only upsampling the less common genres, the accuracy of the model jumps all the way to nearly 80%. This is a huge improvement from the 68% test accuracy of the original model on the imbalanced dataset.

## 3. Downsampling and upsampling all genres to ~10000 tracks

In [None]:
df_nonpop = df_upsampled[df_upsampled['master_popular_genre']!='pop']
df_pop = df_upsampled[df_upsampled['master_popular_genre']=='pop']
samp = 10000
# Downsample majority class
df_downsampled = resample(df_pop, 
                                 replace=False,    # sample without replacement
                                 n_samples=10000) # reproducible results
 
# Combine minority class with downsampled majority class
df_both = pd.concat([df_downsampled, df_nonpop])
 
# Display new class counts
df_both.master_popular_genre.value_counts()

In [None]:
X = df_both.iloc[:,1:]
y = df_both.iloc[:,0]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

rf_param_grid = {
       #'n_estimators': range(350, 700, 100),
       'n_estimators': [650]
}
grid_rf_balanced = GridSearchCV(RandomForestClassifier(), rf_param_grid, cv=3, verbose=2).fit(X_train, y_train)

In [None]:
print("Undersampling pop tracks and oversampling uncommon genres:")
print_grid_results(grid_rf_balanced,X_test,y_test)

Combining both upsampling and downsampling surprisingly lowers the accuracy slightly overall, but we can see that the predictions are a bit more balanced. The model does not default to pop as often, but the accuracy in predicting true pop trakcs clearly suffered.

##Fully tuning the hyperparameters with the balanced dataset

Now that the dataset is better balanced, I wanted to leverage a grid search to tune the hyperparameters of the random forest model to improve performance as much as possible.

One thing to consider here is which of the above datasets to use to train the final model. There are really only 2 options of what to use, as the first method of only undersampling pop was significantly less accurate. The following two had roughly the same test accuracy (~80%).
* Method #2, in which underrepresented genres were upsampled
* Method #3, in which underrepresented genres were upsampled AND pop was downsampled

The benefits of using Method #2 is that the model has been trained on a higher proportion of pop relative to the other tracks. Given the prevalance of pop music in general, one might consider this a more accurate representation of the data that this model would typically see if given a real-world sample of tracks on Spotify. Because there are more pop tracks than any other, pop tracks would be more likely to occur. This model would be trained under similar circumstances. So if the use case of the model was to intake random tracks on Spotify to categorize them on genre, this approach would make sense.

The benefits of using Method #3 is that this model likely has a better pure understanding of what audible tracks make up each genre, besides pop. Because it was fed a roughly equal number of tracks of each genre, it won't have the same biases toward pop as method #2. However, as visualized by the confusion matrix in the method 2 section, the model is fairly poor at predicting pop tracks successfully, yielding only a 48% accuracy compared to around 70% from Method #2. However, since it's more accurate in every other other genre, the overall accuracy is about the same. If the goal is to build a model that can take any 1 track, equally likely to be of any genre, this model will typically perform better. However, if fed a real-world distribution of Spotify tracks, it would likely perform worse given the prevalence of pop.

I chose method #3 to build the model, as my goal was to most accurately predict a single song's genre without considering the liklihood that a randomly selected song would be pop. I think the approach of balancing all input tracks to around 10,000 makes the most sense for this purpose

In [None]:
X = df_both.iloc[:,1:]
y = df_both.iloc[:,0]


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

rf_param_grid = {
          #'bootstrap': [True, False],
          'bootstrap': [False],
          # 'max_depth': [10, 50, 100, None],
          'max_depth': [100],
          #'max_features': ['auto', 'sqrt'],
          'max_features': ['auto'],
          #'n_estimators': [600, 1000, 1400, 1800, 2000]
          'n_estimators': [1400]
}
grid_rf_opt = GridSearchCV(RandomForestClassifier(), rf_param_grid, cv=5).fit(X_train, y_train)

In [None]:
print("Undersampling pop tracks and oversampling uncommon genres:")
print_grid_results(grid_rf_opt,X_test,y_test)

After running the grid search to tune each hyperparameter, performance barely improved. Performance predicting pop did improve to about 52%, however, overall accuracy is still about 80%. Regardless, improving from a 58% Gradient Boost, to a 68% Random Forest, to an 80% tuned random forest is still great improvement and is a satisfying result for a 7-way genre classification.

##Let's see some predictions

Now, let's analyze the performance a bit more and take a look at what tracks the model is right and wrong about. We can even listen to some of them on Spotify and see how we would classify them ourselves.

In [None]:
preds = []
for prediction in grid_rf_opt.predict_proba(X_test):
  pred_ind = []
  prediction = list(prediction)
  pred_ind.append(max(prediction))
  pred_ind.append(prediction.index(max(prediction)))
  preds.append(pred_ind)

for pred in preds:
  pred[1] = genre_order[pred[1]]

In [None]:
in_list = []
for i in range(0,len(preds)):
  in_list.append(int(list(y_test)[i] == preds[i][1]))

In [None]:
prediction_conf = []
predicted_genre = []

for i in range(0,len(y_test)):
  prediction_conf.append(preds[i][0])
  predicted_genre.append(preds[i][1])


prediction_frame = pd.DataFrame({"Prediction Confidence":prediction_conf,
                                 "Predicted Correctly":in_list,
                                 "True Genre": y_test,
                                 "Predicted Genre": predicted_genre,
                                 "Track Name": track_table.track_name[y_test.index],
                                 "Artist Name": track_table.art_name[y_test.index]})

Let's first look at what songs the model was confident in. There are many songs that the RF outputted a predicted probability of 1, meaning that the model was very confident in these predictions.

In [None]:
prediction_frame.drop_duplicates().sort_values(by="Prediction Confidence", ascending=False).head(20)

Here is the track by Jorja Smith that the model correctly predicted as R&B: [The One - Jorja Smith](https://open.spotify.com/track/1Ahp4PZ1vzdbzBCedUrsqI)

Here is a track by George Strait that the model correctly predicted as Country: [Easy Come, Easy Go - George Strait](https://open.spotify.com/track/0hqXlHVE94CTwXXWRdikbY?si=e9e81cb391b24512)

These both make sense, as they sound quintessentially R&B and country. More interesting will be looking at what tracks the model predicted *incorrectly*. Let's take a look at a few of those.

In [None]:
prediction_frame[prediction_frame['Predicted Correctly'] == 0].drop_duplicates().sort_values(by="Prediction Confidence", ascending=False).head(20)

Unsurprisingly, a ton of these missed predictions are true-genre pop. This is clearly the model's worst category at predicting, thought this was expected after we went with the Method #2 approach. Let's listen to a few of these in particular:

The first one in the list sounds a lot like house to me, despite its genre label being pop: [J'ai Envie De Toi - Armin van Buuren](https://open.spotify.com/track/1Foo16rQ7mTzEk2Fb0CIOv?si=935e925f56eb4554) It makes sense why the model would have chosen house, as there are no lyrics and it has tons of elements of house music.

Let's look at a track that isn't true-genre pop: [Tee Pees 1-12 - Father John Misty](https://open.spotify.com/track/0iOtvXw6nRQmtUBiZm9YY6?si=71ff0f491e83441c) Again, no surprise here! The song has a number of country elements like the violin and tempo. This is another case where I would arguably side more with the classification than with the "True Genre", although the line between indie and country can certainly be fuzzy at times.

Finally, howabout the last row above, in which the model predicted a rock track as house -- that seems like an unexpected confusion: [Pharaohs & Pyramids - Cut Copy](https://open.spotify.com/track/5KtZsj1eaI4uZBBYUOw3zC?si=2807bf8b32fc4540). Honestly, this is a pretty unusual song. I imagine if we allowed more genres, this may fall under some techno category, as it is certainly a mix of house and pop elements. Again, though, I don't quite see this as rock, so perhaps a lot of these errors are due to the erroneous labeling of the true values.

#Predict 2 Genres
As a last little experiment, I wanted to see how accurate the model would be if it was able to make 2 predictions for a track -- the 2 genres which it views as the most likely for that track.

In [26]:
# Add prediction of 2nd highest probability genre
two_preds = []
for prediction in grid_rf_opt.predict_proba(X_test):
  prediction = list(prediction)
  two_maxes = []
  two_maxes.append(prediction.index(max(prediction)))
  prediction[prediction.index(max(prediction))] = 0
  two_maxes.append(prediction.index(max(prediction)))
  two_preds.append(two_maxes)

In [27]:
# Convert probability to genre name
for pred in two_preds:
  pred[0] = genre_order[pred[0]]
  pred[1] = genre_order[pred[1]]

In [28]:
# Create list to evaluate accuracy
in_list = []
for i in range(0,len(two_preds)):
  in_list.append(int(list(y_test)[i] in two_preds[i]))

In [29]:
print("Accuracy:" , sum(in_list) / len(in_list))

Accuracy: 0.8898551966412476


With 2 predictions, the model improves about 10% and has an 89% accuracy. That's pretty awesome, considering there are 7 different genres to consider for this multiclass labeling. Pretty cool! I wonder what ~11% of tracks are mislabeled here too. These would be tracks where the model really missed badly.

In [30]:
pred_correctly = []
true_genre = []
predicted_genre = []

for i in range(0,len(y_test)):
  pred_correctly.append(in_list[i])
  true_genre.append(list(y_test)[i])
  predicted_genre.append(two_preds[i])

prediction_frame = pd.DataFrame({"Predicted Correctly":pred_correctly,
                                 "True Genre": true_genre,
                                 "Predicted Genres": predicted_genre,
                                 "Track Name": track_table.track_name[y_test.index],
                                 "Artist Name": track_table.art_name[y_test.index]})

In [31]:
display(prediction_frame[prediction_frame["Predicted Correctly"] == 0])

Unnamed: 0,Predicted Correctly,True Genre,Predicted Genres,Track Name,Artist Name
16294,0,pop,"[hiphop, r&b]",Comb,Skizzy Mars
28261,0,house,"[rock, pop]",Miracle,Madeon
64738,0,rock,"[country, hiphop]",Take a Sip,Skrizzly Adams
16204,0,hiphop,"[pop, house]",Salute,Future
31629,0,indie,"[hiphop, pop]",Devil Don't You Fool Me,Josh Farrow
...,...,...,...,...,...
15727,0,hiphop,"[house, indie]",Boogieman,Childish Gambino
66194,0,pop,"[house, rock]",Wild Life,OneRepublic
27208,0,house,"[pop, country]",Light Me Up,RL Grime
719,0,country,"[pop, rock]",Remind Me (with Carrie Underwood),Brad Paisley


Let's check some of these out:
*   [Take a Sip - Skrizzly Adams](https://open.spotify.com/track/3zgzbGes5o5X0ExPhj0zpl?si=49562c0c03cc47c4) - This one makes sense as a country/hiphop track, as it's a bit of a fusion between rap and country rock
*   [Salute - Future](https://open.spotify.com/track/1tjpoAROSHmr9QLb7Ibqoq?si=8b03d220e0a74389) - Not really sure what happened on this one. This is about as hiphop as you can get with the trap drums and rapping. This was a pretty bad miss. The house prediction was especially confusing.
*   [Boogieman - Childish Gambino](https://open.spotify.com/track/0SunFlwqT44E0BU0yrgM7u?si=5c52a9b7638143e7) - This one makes sense as just an extremely difficult song to predict. I would personally call this funk, perhaps? Of our genre's, I think hiphop is a fair true labell, but I'm not surprised the model missed this, given its unconventional sound.



## How to improve the model in the future
It seems that the best place to focus on for future improvement of the genre predicting model would be to improve the true labeling of the tracks. Detailed in part 1, every track / artist outputted by the Spotify API has a number of genre tags, but there is not a single definitive genre for one track. To allow us to build a genre-predicting algorithm, the track genre tags were pooled into a dictionary and I chose the most occuring gender to determine the True Genre for that track. This is somewhat imperfect and likely lead to some faulty labeling of true genre. However, this was the best option available until Spotify labels the genre of the track themselves, or I come up with some other NLP improvements for the tag text.