# **Spotify Popularity Score Forecasting Based on Song Characteristics**

We dive into the fascinating world of music and data analysis to explore the relationship between song characteristics and Spotify popularity scores. With the advent of digital streaming platforms, music has become more accessible than ever before, and Spotify has emerged as one of the leading platforms for discovering and enjoying music. 

First, we will perform Exploratory Data Analysis (EDA) to gain a deeper understanding of our dataset. By visualizing and analyzing the relationships between various song features and the corresponding popularity scores, we will uncover any existing trends or correlations that could serve as valuable predictors.

**The questions we seek to answer in the Exploratory Data Analysis portion are:**
1. How is the distribution of Spotify popularity scores in the dataset? Are there any patterns or trends?
2. What are the most significant song characteristics provided in the dataset? How do they vary across popular and less popular songs?
3. Is there a correlation between specific song features (such as tempo, energy, danceability, etc.) and Spotify popularity scores? Which features show the strongest correlation?
4. Are there any outliers or unusual data points in terms of song characteristics or popularity scores? What might be the reasons behind these outliers?
5. How do different genres or artist categories relate to Spotify popularity scores? Are there specific genres or artists that tend to have higher popularity scores?
6. Are there any temporal trends in Spotify popularity scores? Do certain periods or years show higher or lower average popularity scores?

Once we have a solid grasp of our data, we will then train and evaluate a machine learning model. Leveraging popular algorithms such as regression or classification models, we will feed our dataset into the model and tune its parameters to optimize performance. Our aim is to create a reliable predictive model that can accurately estimate the popularity score of a song, allowing us to gain insights into what makes a track resonate with listeners on Spotify.

# **Setup**
Next cell imports all Python libraries needed for the project.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_theme()
plt.rcParams['font.family'] = 'sans-serif'

# **Import dataset**

The dataset contains the artist name, track name, id, popularity, year, genre, and song characteristics.

In [2]:
# Preview the first 5 rows of the dataset

track_df = pd.read_csv("../data/spotify_data.csv", index_col=0)
track_df.head()

Unnamed: 0,artist_name,track_name,track_id,popularity,year,genre,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,Jason Mraz,I Won't Give Up,53QF56cjZA9RTuuMZDrSA6,68,2012,acoustic,0.483,0.303,4,-10.058,1,0.0429,0.694,0.0,0.115,0.139,133.406,240166,3
1,Jason Mraz,93 Million Miles,1s8tP3jP4GZcyHDsjvw218,50,2012,acoustic,0.572,0.454,3,-10.286,1,0.0258,0.477,1.4e-05,0.0974,0.515,140.182,216387,4
2,Joshua Hyslop,Do Not Let Me Go,7BRCa8MPiyuvr2VU3O9W0F,57,2012,acoustic,0.409,0.234,3,-13.711,1,0.0323,0.338,5e-05,0.0895,0.145,139.832,158960,4
3,Boyce Avenue,Fast Car,63wsZUhUZLlh1OsyrZq7sz,58,2012,acoustic,0.392,0.251,10,-9.845,1,0.0363,0.807,0.0,0.0797,0.508,204.961,304293,4
4,Andrew Belle,Sky's Still Blue,6nXIYClvJAfi6ujLiKqEq8,54,2012,acoustic,0.43,0.791,6,-5.419,0,0.0302,0.0726,0.0193,0.11,0.217,171.864,244320,4


In [3]:
track_df.describe()

Unnamed: 0,popularity,year,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
count,1159764.0,1159764.0,1159764.0,1159764.0,1159764.0,1159764.0,1159764.0,1159764.0,1159764.0,1159764.0,1159764.0,1159764.0,1159764.0,1159764.0,1159764.0
mean,18.38312,2011.955,0.5374382,0.6396699,5.287778,-8.981353,0.6346533,0.09281477,0.321537,0.2523489,0.2230189,0.4555636,121.3771,249561.8,3.885879
std,15.88554,6.803901,0.184478,0.2705009,3.555197,5.682215,0.4815275,0.1268409,0.3549872,0.3650731,0.2010707,0.268519,29.77975,149426.2,0.4676967
min,0.0,2000.0,0.0,0.0,0.0,-58.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2073.0,0.0
25%,5.0,2006.0,0.413,0.454,2.0,-10.829,0.0,0.0371,0.0064,1.05e-06,0.0979,0.226,98.797,181091.0,4.0
50%,15.0,2012.0,0.55,0.694,5.0,-7.45,1.0,0.0507,0.147,0.00176,0.134,0.438,121.931,225744.0,4.0
75%,29.0,2018.0,0.677,0.873,8.0,-5.276,1.0,0.089,0.64,0.614,0.292,0.674,139.903,286913.5,4.0
max,100.0,2023.0,0.993,1.0,11.0,6.172,1.0,0.971,0.996,1.0,1.0,1.0,249.993,6000495.0,5.0


# **Data Preprocessing**

Check if there are any missing values and transform any data.

In [5]:
# Checking if there are any missing values in the dataset.

track_df.isna().sum()

artist_name         0
track_name          0
track_id            0
popularity          0
year                0
genre               0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
duration_ms         0
time_signature      0
dtype: int64