# Project Part 2: Traning The Song Popularity Score Prediction Model

[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/cdinh92/CS39AA-project/blob/main/project_part2.ipynb)

Welcome to the data science project undertaken for the CS39AA NLP Machine Learning class at MSU Denver. In this exploration, the aim is to delve into the world of music industry and investigate whether a predictive model can be designed to forecast the success of songs based on the popularity scores. The focus of this analysis lies on 8 key song features: danceability, energy, mode, loudness, speechiness, instrumentalness, tempo, and valence.

**Check the Project Part 1 [here](https://github.com/cdinh92/CS39AA-Project/blob/main/project_part1.ipynb)**

## 1. Introduction

After exploring the Spotify top songs dataset by Joakim Arvidsson on [Kaggle](https://www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs), I reduce it to nearly 15000 songs in the clean file named [filtered_spotify_songs.csv](https://github.com/cdinh92/CS39AA-Project/blob/main/filtered_spotify_songs.csv). Now, as we step into the training phase, three formidable models — **Random Forest**, **Decision Tree**, and **Linear Regression** — take center stage for a comparative analysis

**Initial Prediction:**
Early predictions lean towards **Random Forest** outperforming, especially on large datasets typical of regression problems. However, this model pose a major challenge that it can’t extrapolate outside unseen data. We’ll dive deeper into these challenge later.

**Alternative Approach:**
_While recognizing the potential limitations of predictive models in capturing the entirety of a song's success factors, this project might also seeks to identify common features among trending songs, just in case the predictive results are far from expectations._

## 2. Training the predicting models

Let's explore the dataset and see if we could trim the dataset and eliminate irrelevant columns.

In [23]:
# import all of the python modules/packages
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score
from sklearn.model_selection import train_test_split
# Check the filtered spotify songs csv file
data = pd.read_csv("filtered_spotify_songs.csv")
# data = pd.read_csv("https://raw.githubusercontent.com/cdinh92/CS39AA-Project/main/filtered_spotify_songs.csv")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14925 entries, 0 to 14924
Data columns (total 24 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   track_id                  14925 non-null  object 
 1   track_name                14923 non-null  object 
 2   track_artist              14923 non-null  object 
 3   track_popularity          14925 non-null  int64  
 4   track_album_id            14925 non-null  object 
 5   track_album_name          14923 non-null  object 
 6   track_album_release_date  14925 non-null  object 
 7   playlist_name             14925 non-null  object 
 8   playlist_id               14925 non-null  object 
 9   playlist_genre            14925 non-null  object 
 10  playlist_subgenre         14925 non-null  object 
 11  danceability              14925 non-null  float64
 12  energy                    14925 non-null  float64
 13  key                       14925 non-null  int64  
 14  loudne

**1. Random Forest**

In [36]:

features = ['track_popularity','danceability','energy','speechiness','loudness','mode','instrumentalness','valence','tempo']
data1 = data[features]
training_data, testing_data = train_test_split(data1, test_size=0.2, random_state=25)

print(f"No. of training examples: {training_data.shape[0]}")
print(f"No. of testing examples: {testing_data.shape[0]}")

# testing_data.info()

#Dropping Popularity Scores Column From Test data
testing_data_score=testing_data['track_popularity']
testing_data=testing_data.drop('track_popularity', axis=1)
testing_data.head()
#Setting training data into x_train and y_train
x_train=training_data.drop('track_popularity',axis=1)
y_train=training_data['track_popularity']

#Shapes of x_train,y_train and test data
x_train.shape, y_train.shape, testing_data.shape
#Random Forest Regression
rf_model=RandomForestRegressor(n_estimators=50)
rf_model.fit(x_train,y_train)
print(rf_model.score(x_train,y_train))

#Making predictions on test set 
rf_predict=rf_model.predict(testing_data)

rf_output=pd.DataFrame({'Track ID':testing_data.index,'Predicted Popularity Score':rf_predict,'Actual Popularity Score':testing_data_score})
print(rf_output)

No. of training examples: 11940
No. of testing examples: 2985
0.8486015261934492
       Track ID  Predicted Popularity Score  Actual Popularity Score
9789       9789                       38.38                       47
5112       5112                       40.99                       57
7623       7623                       42.05                       57
12546     12546                       33.34                       15
11338     11338                       41.32                        2
...         ...                         ...                      ...
14542     14542                       33.10                       42
10268     10268                       39.98                       53
14339     14339                       48.18                       14
6036       6036                       43.44                       36
5596       5596                       43.76                       61

[2985 rows x 3 columns]


In [22]:
""" This is for different approach """
# Create target object and call it y
y = data.track_popularity
# Create X
features = ['danceability','energy','speechiness','loudness','mode','instrumentalness','valence','tempo']
X = data[features]

#Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Define the model
rf_model = RandomForestRegressor(n_estimators=50)

# fit your model
rf_model.fit(train_X, train_y)

print(rf_model.score(train_X,train_y))

#Calculate the mean absolute error of Random Forest model on the validation data
val_predictions = rf_model.predict(val_X)
rf_val_mae = mean_absolute_error(val_predictions, val_y)



0.8466100035132159
