# Project Part 1: Predicting Song Popularity Score

[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/cdinh92/CS39AA-project/blob/main/project_part1.ipynb)

Welcome to the data science project undertaken for the CS39AA NLP Machine Learning class at MSU Denver. In this exploration, the aim is to delve into the world of music industry and investigate whether a predictive model can be designed to forecast the success of songs based on the popularity scores. The focus of this analysis lies on 8 key song features: danceability, energy, mode, loudness, speechiness, instrumentalness, tempo, and valence.

## 1. Introduction/Background

In this project, I will input almost 30,000 songs from the Spotify API provided by Joakim Arvidsson on [Kaggle](https://www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs). This is an 23-column database with 32833 popular songs mostly from the 2010s to 2020. I will use only 8 features mentioned above as well as 8 corresponding columns in the database to train the model and test the accuracy of the result.

**8 Key Feature Definitions:**
_(Use the same column's names from the dataset)_

| Feature | Description |
| --- | ----------- |
| **track_popularity** | Song Popularity (0-100) where higher is better |
| mode | indicates the modality (major or minor) of a track. Major is represented by 1 and minor is 0 |
| danceability | Danceability describes how suitable a track is for dancing. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | Scale (0.0-1.0). Energetic tracks  feel fast, loud, and noisy. |
| loudness | The overall loudness of a track in decibels (dB) (-60 and 0 db). |
| speechiness | Speechiness detects the presence of spoken words in a track. Values above 0.66 describe tracks that are probably made entirely of spoken words |
| instrumentalness | Values above 0.5 are intended to represent instrumental tracks. |
| tempo | The overall estimated tempo of a track in beats per minute (BPM). |
| valence | The higher the value, the more positive mood for the song |

**Initial Prediction:**
_Anticipating the factors that contribute to a song's popularity is a complex task. I guess that danceability, energy, tempo, and particularly valence might emerge as key components influencing a song's popularity scores. However, it is crucial to acknowledge the broader context of the music industry, where artist names and the strategic marketing by big companies often play much more significant influence. The dynamics of the industry mean that independent artists may face unique challenges in breaking into mainstream charts._

**Alternative Approach:**
_While recognizing the potential limitations of predictive models in capturing the entirety of a song's success factors, this project might also seeks to identify common features among trending songs, just in case the predictive results are far from expectations._

**ABOUT THE PREDICTING MODEL**

_The model follows the concept of **regression** where a model explores the relationship between a variety of diverse features and a desired outcome. In this case, the dataset consists of input features (valence, mode, energy,...) and corresponding output targets (the actual popularity scores of the songs). The goal is to train the model to make accurate predictions based on new inputs._

_The model then will be used to predict the popularity scores from another dataset: Spotify top 50 songs in 2021 on [Kaggle](https://www.kaggle.com/datasets/equinxx/spotify-top-50-songs-in-2021). This will be the primary test to check how accurate the predicting model performs on recent music taste._

## 2. Exploratory Data Analysis

Let's explore the dataset and see if we could trim the dataset and eliminate irrelevant columns.

In [None]:
# import all of the python modules/packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# read the csv file
raw_df = pd.read_csv("https://raw.githubusercontent.com/cdinh92/CS39AA-Project/main/spotify_songs.csv")
raw_df.describe()

**1. Modify the dataset**

There are 32833 songs in the file. I will check for duplicated elements, null values, and drop some columns

In [None]:
# Check for null columns
raw_df.isna().sum()

There are 3 columns: ***track_name***, ***track_artist***, and ***track_album_name*** contain null values. I will leave them here because these are non-significant features for the model (I will drop them all at the end)

In [None]:
# Create a 'year' column into which we will rewrite the year corresponding to the song’s release
def extract_year(date_str):
        return pd.to_datetime(date_str, format = 'mixed').year

raw_df['year'] = raw_df['track_album_release_date'].apply(extract_year)
raw_df1 = raw_df.drop(columns=['track_album_release_date'])

# Remove duplicated songs by 'track_id' and keep only one first value of the occurences 
raw_df1 = raw_df.drop_duplicates(subset=['track_id'], keep='first', inplace=False)
raw_df1.describe()

The dataset is reduced to 28356 songs after remove duplicated elements. However, I want to use only ***songs released from the 2015 to 2020*** for the modern music taste because we will use the dataset to train the model to predict the popular songs in the 2021 Spotify dataset.

In [None]:
raw_df2 = raw_df1[raw_df1['year'] > 2015]
raw_df2.describe()

Now, the dataset has nearly 15000 songs which is ready for futher steps in the **Training Model** part.

In [None]:
""" These steps are for download the filtered dataframe: """
# Specify the file name
csv_file_name = 'filtered_spotify_songs.csv'
# Export the DataFrame to a CSV file
raw_df2.to_csv(csv_file_name, index=False)
from IPython.display import FileLink
# Create a link to download the CSV file
FileLink(r'filtered_spotify_songs.csv')

In [None]:
# Check the filtered spotify songs csv file
data = pd.read_csv("https://raw.githubusercontent.com/cdinh92/CS39AA-Project/main/filtered_spotify_songs.csv")
data.describe()

**2. Dataset Visualization**

***Figures: 8 key features vs. Popularity***

In [None]:
columns_to_plot = ['danceability', 'energy', 'mode', 'loudness', 'speechiness', 'instrumentalness', 'tempo', 'valence']

fig, axes = plt.subplots(3, 3, figsize=(16, 12))

axes = axes.flatten()

for i, column in enumerate(columns_to_plot):
    plt.sca(axes[i])
    plt.bar(data[column], data['track_popularity'], color='blue')
    plt.xlabel(column, fontsize=12)
    plt.ylabel('Popularity', fontsize=12)
    plt.title(f'{column} vs. Popularity', fontsize=14)
    plt.grid(axis='y')
plt.tight_layout()
plt.show()

***Figure: Top 10 Tracks by Popularity***

In [None]:
# Sort the DataFrame by 'popularity' in descending order
top_10_tracks = data.sort_values(by='track_popularity', ascending=False).head(10)

# Create a bar plot using Seaborn
plt.figure(figsize=(12, 6))
sns.barplot(x='track_popularity', y='track_name', data=top_10_tracks, palette='viridis')

# Customize the plot display
plt.title('Top 10 Tracks by Popularity')
plt.xlabel('Popularity Score')
plt.ylabel('Track Name')

# Show the plot
plt.show()