# STAT 315 Final Project  
## Spotify Track Popularity Analysis  
Group Members: Ayushi Sharma, Bernardo Gonzalez Guerra, Mohnish Bandari  
STAT 315 - Fall 2025

# Research Questions
1. What audio features are the strongest predictors of a song being popular on Spotify?
2. Can we accurately classify whether a song falls into a “high popularity” vs. “low popularity” group based solely on its audio features?
3. How well can we predict a song’s exact popularity score using a regression model, and which features contribute most to this prediction?
4. What correlations and relationships exist among audio features (ex. danceability, energy, valence, loudness), and how might these influence model performance?
5. Do certain genres or artists tend to have higher average popularity, and how do their audio profiles compare to less popular tracks?
6. How much uncertainty exists in our predictions, as quantified by cross-validation and bootstrap confidence intervals?

# Data Collection

## Our dataset
For this project, we are using the Spotify Tracks Dataset from Kaggle, which contains over 230,000 songs pulled from Spotify’s Web API. Each row represents a single track, and the dataset includes a mixture of audio features, metadata, and Spotify’s own popularity score. The audio features (such as danceability, energy, and acousticness) are numerical values generated by Spotify’s signal processing models that describe different musical characteristics of each track. Because the dataset is large, mostly numeric, and relatively clean, it is well suited for exploratory data analysis and predictive modeling.

## Dataset Source
This dataset is publicly available on Kaggle at:
https://www.kaggle.com/datasets/zaheenhamidani/ultimate-spotify-tracks-db

The original data was collected through Spotify’s Web API, which provides track level audio analysis values and popularity scores.

## Variable Descriptions
Below is a list of the key variables we will be using:
- track_name – Name of the song.
- artist_name – Name of the performing artist.
- genre – Genre label associated with the track.
- track_id – Unique Spotify ID for the track.
- popularity – Spotify-defined popularity score (0–100).
- acousticness – Confidence measure of whether a track is acoustic (0.0–1.0).
- danceability – How suitable a track is for dancing (0.0–1.0).
- duration_ms – Track duration in milliseconds.
- energy – Perceived intensity and activity of the track (0.0–1.0).
- instrumentalness – Likelihood the track contains no vocals (0.0–1.0).
- key – Musical key (0–11).
- liveness – Probability the track was performed live (0.0–1.0).
- loudness – Overall loudness of the track (in decibels).
- mode – Modal quality of the track (0 = minor, 1 = major).
- speechiness – Presence of spoken words in a track (0.0–1.0).
- tempo – Estimated tempo in beats per minute (BPM).
- time_signature – Estimated time signature (e.g., 3, 4, 5).
- valence – Musical “positiveness” conveyed by the track (0.0–1.0).

## Spotify Popularity Score
Spotify’s popularity metric ranges from 0 to 100, where higher values indicate a track that is streamed more often and more recently relative to others. The score is not based on audio features; instead, it is derived from listener behavior such as stream counts, recency, playlist appearances, and user engagement. Because it reflects real-world streaming activity, this popularity score serves as a meaningful target variable for both regression and classification tasks.

In [None]:
import pandas as pd

df = pd.read_csv('../data/spotify_data.csv')

df.head()
df.info()
df.describe()