# **K-Pop Spotify Popularity Score Forecasting Based on Song Characteristics**

We dive into the fascinating world of music and data analysis to explore the relationship between song characteristics and Spotify popularity scores. With the advent of digital streaming platforms, music has become more accessible than ever before, and Spotify has emerged as one of the leading platforms for discovering and enjoying music. 

First, we will perform Exploratory Data Analysis (EDA) to gain a deeper understanding of our dataset. By visualizing and analyzing the relationships between various song features and the corresponding popularity scores, we will uncover any existing trends or correlations that could serve as valuable predictors.

**The questions we seek to answer in the Exploratory Data Analysis portion are:**
1. How is the distribution of Spotify popularity scores in the dataset? Are there any patterns or trends?
2. What are the most significant song characteristics provided in the dataset? How do they vary across popular and less popular songs?
3. Is there a correlation between specific song features (such as tempo, energy, danceability, etc.) and Spotify popularity scores? Which features show the strongest correlation?
4. Are there any outliers or unusual data points in terms of song characteristics or popularity scores? What might be the reasons behind these outliers?
5. Are there any temporal trends in Spotify popularity scores? Do certain periods or years show higher or lower average popularity scores?

Once we have a solid grasp of our data, we will then train and evaluate a machine learning model. Leveraging popular algorithms such as regression or classification models, we will feed our dataset into the model and tune its parameters to optimize performance. Our aim is to create a reliable predictive model that can accurately estimate the popularity score of a song, allowing us to gain insights into what makes a track resonate with listeners on Spotify.

# **Setup**
Next cell imports all Python libraries needed for the project.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_theme()
plt.rcParams['font.family'] = 'sans-serif'

# **Import dataset**

The dataset contains the artist name, track name, id, popularity, year, genre, and song characteristics.

In [2]:
# Preview the first 5 rows of the dataset

kpop_df = pd.read_csv("../data/kpop_spotify_data.csv", index_col=0)
kpop_df.head()

Unnamed: 0,artist_name,track_name,track_id,popularity,year,genre,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
35488,Block B,닐리리맘보,4fZYGHiRcKxwVSnF498uaE,59,2012,k-pop,0.756,0.837,2,-2.711,0,0.0524,0.0131,0.0,0.0943,0.537,109.937,205280,4
35489,Block B,난리나,7dgO6H1BxXeXfxDkiDN8E9,55,2012,k-pop,0.78,0.956,8,-3.69,1,0.0882,0.0523,0.0,0.0675,0.921,114.971,202606,4
35490,SHINee,Sherlock (Clue + Note),2sVtrcj32v3fR8mLjqWziv,54,2012,k-pop,0.64,0.937,1,-2.112,1,0.0839,0.0069,0.0,0.393,0.316,107.013,239187,4
35491,G-DRAGON,Without You (Feat. ROSE),3V375E3xldRPEEcIKiw83l,61,2012,k-pop,0.6,0.678,11,-6.894,1,0.0372,0.0311,0.0,0.338,0.326,123.991,242677,4
35492,J Boog,See Her Again,3Ibxs1OxL9wH3jBwpIQGid,47,2012,k-pop,0.734,0.502,2,-6.792,1,0.0484,0.768,0.0,0.0584,0.936,62.023,189677,4


In [3]:
kpop_df.describe().round(3)

Unnamed: 0,popularity,year,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
count,20004.0,20004.0,20004.0,20004.0,20004.0,20004.0,20004.0,20004.0,20004.0,20004.0,20004.0,20004.0,20004.0,20004.0,20004.0
mean,27.736,2011.487,0.622,0.685,5.328,-6.02,0.591,0.08,0.295,0.024,0.197,0.551,120.565,238646.271,3.943
std,16.622,6.951,0.139,0.2,3.563,3.069,0.492,0.082,0.276,0.121,0.158,0.234,27.907,126413.575,0.323
min,0.0,2000.0,0.102,0.0,0.0,-27.274,0.0,0.022,0.0,0.0,0.011,0.033,43.223,22040.0,0.0
25%,14.0,2005.0,0.532,0.541,2.0,-7.509,0.0,0.034,0.052,0.0,0.096,0.363,99.912,199586.75,4.0
50%,27.0,2012.0,0.64,0.719,5.0,-5.43,1.0,0.048,0.199,0.0,0.134,0.562,119.987,223493.0,4.0
75%,40.0,2018.0,0.724,0.852,8.0,-3.877,1.0,0.088,0.5,0.0,0.265,0.74,136.575,258628.0,4.0
max,92.0,2023.0,0.968,0.999,11.0,1.22,1.0,0.95,0.996,0.988,0.986,0.99,229.829,4051696.0,5.0


# **Data Preprocessing**

Check if there are any missing values and transform any data.

In [4]:
# Checking if there are any missing values in the dataset.

kpop_df.isna().sum()

artist_name         0
track_name          0
track_id            0
popularity          0
year                0
genre               0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
duration_ms         0
time_signature      0
dtype: int64