# STAT 315 Final Project  
## Spotify Track Popularity Analysis  
Group Members: Ayushi Sharma, Bernardo Gonzalez Guerra, Mohnish Bandari  
STAT 315 - Fall 2025

# Research Questions
1. What audio features are the strongest predictors of a song being popular on Spotify?
2. Can we accurately classify whether a song falls into a “high popularity” vs. “low popularity” group based solely on its audio features?
3. How well can we predict a song’s exact popularity score using a regression model, and which features contribute most to this prediction?
4. What correlations and relationships exist among audio features (ex. danceability, energy, valence, loudness), and how might these influence model performance?
5. Do certain genres or artists tend to have higher average popularity, and how do their audio profiles compare to less popular tracks?
6. How much uncertainty exists in our predictions, as quantified by cross-validation and bootstrap confidence intervals?

# Data Collection

## Our dataset
For this project, we are using the Spotify Tracks Dataset from Kaggle, which contains over 230,000 songs pulled from Spotify’s Web API. Each row represents a single track, and the dataset includes a mixture of audio features, metadata, and Spotify’s own popularity score. The audio features (such as danceability, energy, and acousticness) are numerical values generated by Spotify’s signal processing models that describe different musical characteristics of each track. Because the dataset is large, mostly numeric, and relatively clean, it is well suited for exploratory data analysis and predictive modeling.

## Dataset Source
This dataset is publicly available on Kaggle at:
https://www.kaggle.com/datasets/zaheenhamidani/ultimate-spotify-tracks-db

The original data was collected through Spotify’s Web API, which provides track level audio analysis values and popularity scores.

## Variable Descriptions
Below is a list of the key variables we will be using:
- genre: Genre label associated with the track.
- artist_name: Name of the performing artist.
- track_name: Name of the song.
- track_id: Unique Spotify ID for the track.
- popularity: Spotify-defined popularity score (0–100).
- acousticness: Confidence measure of whether a track is acoustic (0.0–1.0).
- danceability: How suitable a track is for dancing (0.0–1.0).
- duration_ms: Track duration in milliseconds.
- energy: Perceived intensity and activity of the track (0.0–1.0).
- instrumentalness: Likelihood the track contains no vocals (0.0–1.0).
- key: Musical key (0–11).
- liveness: Probability the track was performed live (0.0–1.0).
- loudness: Overall loudness of the track (in decibels).
- mode: Modal quality of the track (0 = minor, 1 = major).
- speechiness: Presence of spoken words in a track (0.0–1.0).
- tempo: Estimated tempo in beats per minute (BPM).
- time_signature: Estimated time signature (e.g., 3, 4, 5).
- valence: Musical “positiveness” conveyed by the track (0.0–1.0).

## Spotify Popularity Score
Spotify’s popularity metric ranges from 0 to 100, where higher values indicate a track that is streamed more often and more recently relative to others. The score is not based on audio features; instead, it is derived from listener behavior such as stream counts, recency, playlist appearances, and user engagement. Because it reflects real-world streaming activity, this popularity score serves as a meaningful target variable for both regression and classification tasks.

In [None]:
# Load data
import pandas as pd
import numpy as np

df = pd.read_csv('../data/SpotifyFeatures.csv')

df.head()
df.info()
df.describe()


### Dataset Summary

The dataset contains **232,725 rows** and **18 columns**. Most of the variables are numeric 
features such as danceability, energy, valence, loudness, tempo, and other audio related 
attributes generated by Spotify's API. A few variables, such as `track_name`, `artist_name`, 
and `genre`, are stored as object/string types.

Numeric columns include float based audio features (danceability, energy, valence, acousticness, 
speechiness, instrumentalness, etc) as well as integer based metadata (popularity, key, 
time_signature). The object columns contain track metadata.

# 4. Data Cleaning & Preparation

In this section, we clean the dataset by handling missing values, removing duplicates, 
performing feature transformations, and preparing variables for modeling. These steps 
ensure that the data is reliable, consistent, and suitable for regression and classification.


In [None]:
# Check missing values
df.isna().sum()

### Missing Values Summary

The missing values check shows that the dataset is almost completely full, with only 
one missing value found in the `track_name` column. Since every other column has zero 
missing entries and the dataset contains over 232,000 rows, removing this single row 
is the simplest and most appropriate approach. This ensures data consistency without 
any meaningful loss of information.


In [None]:
# Drop rows with missing values
df = df.dropna()
df.shape

After removing the single row containing a missing value, the dataset remains large 
and complete. This prepares the data for further preprocessing and analysis.

In [None]:
# check for and remove duplicates
df = df.drop_duplicates()
df.shape


### Duplicate Check

We checked for duplicate rows using `df.drop_duplicates()`. The shape of the dataset 
remained the same before and after this operation, indicating that there were no 
duplicate tracks present. This means the dataset is already clean with respect to 
duplicated entries.


# 5. Creating New Features

### 1. Converting duration from milliseconds to minutes:
We convert track duration from milliseconds to minutes for better interpretability 
in visualizations and modeling.


In [None]:
df['duration_min'] = df['duration_ms'] / 60000


### 2. Create “high_popularity” variable:
To create a balanced classification target, we define a track as "high popularity" if 
its popularity score is above the median of the dataset. This ensures an approximately 
50/50 split between high and low popularity classes. We use the median threshold, which is best for balanced classes.


In [None]:
threshold = df['popularity'].median()
df['high_popularity'] = (df['popularity'] >= threshold).astype(int)
threshold


# 6. Selecting Features for Modeling

In [None]:
features = [
    'danceability', 'energy', 'valence', 'loudness', 'tempo',
    'acousticness', 'instrumentalness', 'speechiness', 'liveness',
    'duration_min'
]

X = df[features]
y_reg = df['popularity']
y_clf = df['high_popularity']


We define our feature matrix `X` using key audio characteristics that are numeric 
and suitable for regression and classification. Our regression target is the continuous 
popularity score, and our classification target is the binary high_popularity variable.


# 7. Exploratory Data Analysis


In this section, we explore relationships among the audio features and popularity. 
We examine correlations, feature distributions, differences between high and low 
popularity tracks, and genre-level patterns. This helps us understand the structure 
of the dataset before building predictive models.

### Correlation Heatmap
The correlation heatmap helps identify which audio features are most closely 
associated with popularity. For example, features like loudness, energy, or 
danceability may show moderate positive correlation with popularity, while others 
such as acousticness may show weaker or negative associations.


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))
sns.heatmap(df[features + ['popularity']].corr(), cmap='viridis', annot=False)
plt.title('Correlation Heatmap of Audio Features and Popularity')
plt.show()


### Interpretation of Correlation Heatmap

The heatmap shows that popularity is only weakly correlated with most audio features. 
The strongest positive correlations occur with loudness, energy, danceability, and 
valence, suggesting that louder, energetic, danceable, and more positive-sounding 
tracks tend to be slightly more popular. Acousticness shows a moderate negative 
correlation with popularity, indicating that highly acoustic tracks are generally 
less popular.

Energy and loudness are very strongly correlated with each other, meaning these two 
features capture similar musical traits. Acousticness is strongly negatively correlated 
with both energy and loudness, which reflects the contrast between quiet acoustic songs 
and louder, more energetic tracks.

Several other features such as speechiness, tempo, liveness, and duration—show almost 
no linear relationship with popularity, implying that these factors do not strongly 
impact how popular a track becomes.

Overall, the heatmap suggests that while audio features influence popularity to some 
extent, no single feature strongly explains it. This supports the idea that popularity 
is driven by listener behavior and external factors rather than purely musical 
characteristics, which is important to keep in mind as we move into regression and 
classification modeling.


### Popularity Distribution
Popularity scores range from 0–100, with many tracks clustered at lower values. 
This indicates that most songs on Spotify are relatively less popular, and only 
a smaller subset achieves high popularity.



In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(df['popularity'], kde=True, bins=30)
plt.title("Distribution of Popularity Scores")
plt.xlabel("Popularity")
plt.ylabel("Count")
plt.show()


### Audio Feature Distribution
These distributions show how Spotify's audio attributes vary across tracks. Some 
features (such as danceability and energy) follow more uniform distributions, 
while others (like acousticness or instrumentalness) show heavy skewness.


In [None]:
cols_to_plot = ['danceability', 'energy', 'valence', 'acousticness', 'loudness']

plt.figure(figsize=(12, 8))
for i, col in enumerate(cols_to_plot, 1):
    plt.subplot(2, 3, i)
    sns.histplot(df[col], kde=True)
    plt.title(col)
plt.tight_layout()
plt.show()


### Scatterplots vs. Popularity
Scatterplots allow us to visually check whether relationships appear linear or 
nonlinear. Some features may show mild positive relationships with popularity, 
while others show little structure.


In [None]:
plt.figure(figsize=(12, 8))

sns.scatterplot(x='danceability', y='popularity', data=df, alpha=0.3)
plt.title("Danceability vs Popularity")
plt.show()

sns.scatterplot(x='energy', y='popularity', data=df, alpha=0.3)
plt.title("Energy vs Popularity")
plt.show()

sns.scatterplot(x='valence', y='popularity', data=df, alpha=0.3)
plt.title("Valence vs Popularity")
plt.show()


### High vs. Low Popularity Comparisons
These boxplots compare the distribution of audio features between high and low 
popularity groups. Features like loudness and energy often show clear differences, 
suggesting they may play a role in predicting popularity.


In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='high_popularity', y='danceability', data=df)
plt.title("Danceability: High vs Low Popularity")
plt.xlabel("High Popularity (0 = No, 1 = Yes)")
plt.show()

plt.figure(figsize=(10, 6))
sns.boxplot(x='high_popularity', y='energy', data=df)
plt.title("Energy: High vs Low Popularity")
plt.xlabel("High Popularity (0 = No, 1 = Yes)")
plt.show()

plt.figure(figsize=(10, 6))
sns.boxplot(x='high_popularity', y='loudness', data=df)
plt.title("Loudness: High vs Low Popularity")
plt.xlabel("High Popularity (0 = No, 1 = Yes)")
plt.show()


### Genre-Level Popularity
Some genres tend to have higher average popularity than others. This analysis helps 
identify genre-level trends and supports further questions about how musical style 
relates to popularity.


In [None]:
genre_pop = df.groupby('genre')['popularity'].mean().sort_values(ascending=False).head(15)

plt.figure(figsize=(12, 6))
sns.barplot(x=genre_pop.values, y=genre_pop.index)
plt.title("Top 15 Genres by Average Popularity")
plt.xlabel("Average Popularity")
plt.ylabel("Genre")
plt.show()
