# Spotify Tracks Analysis
**Author:** Ankita De  
**Dataset:** spotify_merged.xlsx

This notebook performs data cleaning, exploratory data analysis, audio-feature analysis, clustering (track segments), a simple popularity model, and saves artifacts for GitHub submission.


Problem Statement

Spotify hosts millions of tracks with diverse audio features, artist information, and popularity scores. However, identifying what makes a track popular, which artists dominate, how music trends evolve over time, and how tracks can be grouped into meaningful segments is not straightforward.

This project aims to analyze the Spotify dataset (spotify_merged.xlsx) to:

Explore patterns in audio features, popularity, and temporal trends.

Identify top artists, tracks, and the features most correlated with popularity.

Segment tracks into distinct clusters based on audio characteristics.

Build a simple predictive model to understand which features drive popularity.

Provide actionable recommendations for playlist curation, marketing, and music production strategies.

1. Introduction – Setting the Stage

Spotify has become the world’s largest music streaming platform, with millions of songs. But what makes some songs popular, while others remain unnoticed? Can we use data to uncover the secret recipe of a hit track?

 Code: Import libraries, load dataset, preview.

2. Understanding the Dataset – Meet the Data

Before we dive into trends, let’s understand what we have. Each track has metadata (artist, name, year) and audio features (danceability, energy, valence, tempo, etc.).

 Code: df.info(), missing values check, summary stats.

3. Popularity Analysis – Who Rules the Charts?

Let’s find out which artists and tracks dominate Spotify, and how popularity is distributed.

Plot distribution of popularity scores

Show top 10 popular tracks

Show top 10 popular artists

 Storyline: Most songs cluster at low-to-medium popularity, while a few super-hits dominate. Top artists consistently push songs above 80+ popularity.

4. Trends Over Time – Evolution of Music

Has the sound of music changed over the years? Do modern songs follow a different formula compared to older ones?

Popularity trends by year

Line plots of danceability/energy over time

 Storyline: We observe that in recent years, tracks have become more energetic and danceable, reflecting global shifts toward upbeat streaming hits.

5. Audio Features Deep Dive – The DNA of a Hit

Every track has a musical DNA: tempo, danceability, energy, valence. Let’s see what features matter most for popularity.

Histograms of audio features

Correlation heatmap with popularity

Scatter plots (energy vs valence, tempo vs danceability)

 Storyline: Highly popular songs usually balance danceability and energy, while extreme values (too slow or too fast) reduce mass appeal.

6. Clustering Songs – Grouping by Vibes

Not all songs are meant to be chartbusters. Some are for parties, others for relaxation. Can we group songs into natural clusters?

Standardize features

KMeans clustering (find K with elbow method)

PCA visualization

 Storyline: We discover distinct clusters:

Cluster 1: High energy, party tracks

Cluster 2: Calm, acoustic songs

Cluster 3: Balanced, mainstream pop
This shows how Spotify can curate mood-based playlists.

7. Predicting Popularity – Can We Guess a Hit?

If given only the audio features, can we predict whether a track will be popular?

Train/test split

Regression or classification model

Evaluation metrics

 Storyline: Our simple model explains part of the variance, but popularity also depends on external factors like marketing and artist reputation.

8. Key Insights – What We Learned

Popularity is skewed: only a few songs dominate.

Danceability, energy, and valence strongly influence success.

Trends over years show rising preference for upbeat tracks.

Clusters reveal natural groupings (party, acoustic, mainstream).

Prediction is possible but limited — success isn’t just about audio features.

9. Recommendations – Turning Data into Action

Based on our analysis, here’s how Spotify and artists can benefit:

Spotify: Improve playlist recommendations by leveraging clusters.

Artists: Focus on energy/danceability balance for mainstream hits.

Industry: Track evolving trends to align with listener moods.

10. Conclusion – Wrapping Up the Story

Music is more than numbers, but data reveals powerful patterns. Spotify’s dataset shows us that the recipe for a hit lies in the balance of energy, danceability, and emotional connection.

In [None]:
# Cell 1 - Imports and config
import os
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# adjust backend if needed
# %matplotlib inline

OUT_DIR = Path('./spotify_analysis_outputs')
OUT_DIR.mkdir(exist_ok=True)
DATA_PATH = 'spotify_merged.xlsx'   # update path if needed


In [None]:
# Cell 2 - Load dataset and quick inspection
try:
    df = pd.read_excel(DATA_PATH, engine='openpyxl')
except Exception as e:
    print(f"Error reading with openpyxl: {e}")
    print("Could not read the Excel file with 'openpyxl' engine. Please check if the file is a valid and uncorrupted Excel file.")


if 'df' in locals():
    print("Shape:", df.shape)
    print("Columns:", df.columns.tolist())
    display(df.head(10))

In [None]:
# Check the first few bytes of the file
try:
    with open(DATA_PATH, 'rb') as f:
        header = f.read(10)
    print(f"File header: {header}")
except Exception as e:
    print(f"Error reading file header: {e}")

In [None]:
if not Path(DATA_PATH).exists():
    print(f"Error: File not found at {DATA_PATH}")
else:
    print(f"File found at {DATA_PATH}")

In [None]:
# Cell 3 - Basic info and missing values
import io
buf = io.StringIO()
df.info(buf=buf)
print(buf.getvalue())
print("\nMissing values (top 30):")
print(df.isnull().sum().sort_values(ascending=False).head(30))


In [None]:
# Cell 4 - Standardize column names we commonly expect
rename_map = {}
if 'popularity' in df.columns and 'track_popularity' not in df.columns:
    rename_map['popularity'] = 'track_popularity'
if 'artists' in df.columns and 'artist_name' not in df.columns:
    rename_map['artists'] = 'artist_name'
if rename_map:
    df = df.rename(columns=rename_map)
    print("Renamed:", rename_map)


In [None]:
# Cell 5 - Convert release_date to datetime (if exists) and extract year
if 'release_date' in df.columns:
    df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')
    df['year_released'] = df['release_date'].dt.year
    print("Release dates parsed. Years from:", df['year_released'].min(), "to", df['year_released'].max())
else:
    print("No release_date column found.")


In [None]:
# Cell 6 - Numeric summary
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print("Numeric cols:", numeric_cols)
df[numeric_cols].describe().round(3)


In [None]:
# Cell 7 - Correlations (show only a trimmed numeric set)
corr = df[numeric_cols].corr()
# show correlations sorted by absolute correlation with popularity (if present)
if 'track_popularity' in corr.columns:
    corr_pop = corr['track_popularity'].drop('track_popularity', errors='ignore').sort_values(key=lambda x: x.abs(), ascending=False)
    print("Top correlations with track_popularity:\n", corr_pop.head(10))
# display correlation table (small)
corr.round(3)


In [None]:
# Cell 8 - Key distributions & scatter (small samples for performance)
def plot_hist(col, bins=30, figsize=(6,3)):
    plt.figure(figsize=figsize)
    plt.hist(df[col].dropna(), bins=bins)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.tight_layout()
    plt.show()

# Popularity histogram
if 'track_popularity' in df.columns:
    plot_hist('track_popularity')

# audio features histogram (select common spotify features if present)
for feat in ['danceability','energy','speechiness','acousticness','instrumentalness','liveness','valence','tempo','duration_ms']:
    if feat in df.columns:
        plot_hist(feat)

# energy vs valence scatter (sample for speed)
if set(['energy','valence']).issubset(df.columns):
    sample = df[['energy','valence']].dropna().sample(min(3000, df.shape[0]), random_state=1)
    plt.figure(figsize=(6,4)); plt.scatter(sample['energy'], sample['valence'], s=6);
    plt.xlabel('energy'); plt.ylabel('valence'); plt.title('Energy vs Valence (sample)'); plt.tight_layout(); plt.show()


In [None]:
# Cell 9 - Top artists and top tracks by popularity
if 'artist_name' in df.columns:
    top_artists = df['artist_name'].value_counts().head(20)
    display(top_artists)
if set(['track_name','artist_name','track_popularity']).issubset(df.columns):
    top_tracks = df[['track_name','artist_name','track_popularity']].drop_duplicates().sort_values('track_popularity', ascending=False).head(30)
    display(top_tracks)


In [None]:
# Cell 10 - Temporal analysis (tracks & avg popularity per year)
if 'year_released' in df.columns:
    yearly_counts = df.groupby('year_released').size().rename('track_count').reset_index().sort_values('year_released')
    display(yearly_counts.tail(20))
    plt.figure(figsize=(8,3)); plt.plot(yearly_counts['year_released'], yearly_counts['track_count']); plt.title('Tracks per year'); plt.tight_layout(); plt.show()
    if 'track_popularity' in df.columns:
        yearly_pop = df.groupby('year_released')['track_popularity'].mean().reset_index().sort_values('year_released')
        display(yearly_pop.tail(20))
        plt.figure(figsize=(8,3)); plt.plot(yearly_pop['year_released'], yearly_pop['track_popularity']); plt.title('Avg popularity per year'); plt.tight_layout(); plt.show()


In [None]:
# Cell 11 - Clustering (KMeans) to find audio-style segments
# NOTE: clustering can be heavy on full data. We'll sample up to 5000 rows with complete audio features for speed.
cluster_features = [c for c in ['danceability','energy','speechiness','acousticness','instrumentalness','liveness','valence','tempo','duration_ms'] if c in df.columns]
print("Using cluster features:", cluster_features)

cluster_df = df[cluster_features].dropna()
print("Rows avail for clustering:", len(cluster_df))
sample = cluster_df.sample(min(5000, len(cluster_df)), random_state=42)

# scale, elbow, kmeans
scaler = StandardScaler()
X = scaler.fit_transform(sample)
inertias = []
Ks = range(1,7)
for k in Ks:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X)
    inertias.append(km.inertia_)

plt.figure(figsize=(6,3)); plt.plot(list(Ks), inertias, marker='o'); plt.title('Elbow (sample)'); plt.xlabel('k'); plt.ylabel('inertia'); plt.tight_layout(); plt.show()

# pick k (=4 recommended by visual elbow or domain knowledge)
k = 4
km = KMeans(n_clusters=k, random_state=42, n_init=20).fit(X)
labels = km.labels_
sample_labeled = sample.copy(); sample_labeled['segment'] = labels
# cluster centers (inverse transform)
centers = scaler.inverse_transform(km.cluster_centers_)
centers_df = pd.DataFrame(centers, columns=cluster_features).round(3)
centers_df['cluster'] = centers_df.index
display(centers_df)

# PCA to visualize clusters
pca = PCA(n_components=2, random_state=42)
XY = pca.fit_transform(X)
plt.figure(figsize=(6,4)); plt.scatter(XY[:,0], XY[:,1], c=labels, s=8); plt.title('PCA cluster scatter'); plt.tight_layout(); plt.show()

# cluster means of popularity (if popularity present)
if 'track_popularity' in df.columns:
    # map sample indices back to df if needed — here sample is subset with original index
    sample_idx = sample.index
    df_sample_labeled = df.loc[sample_idx].copy()
    df_sample_labeled['segment'] = labels
    seg_means = df_sample_labeled.groupby('segment')[['track_popularity'] + cluster_features].mean().round(3)
    display(seg_means)


In [None]:
# Cell 12 - Save CSV outputs to include in GitHub repo
# Save a cleaned sample and the clustered sample
cleaned_path = OUT_DIR / 'spotify_cleaned_sample.csv'
df.head(20000).to_csv(cleaned_path, index=False)
print("Saved cleaned sample to:", cleaned_path)

# Save cluster sample (if computed)
if 'sample_labeled' in locals():
    cluster_sample_path = OUT_DIR / 'spotify_clusters_sample.csv'
    sample_labeled.to_csv(cluster_sample_path, index=True)  # keep original index to map back
    print("Saved cluster sample to:", cluster_sample_path)

# create a quick README file
readme_text = f"""
Project: Spotify Tracks Analysis

Files:
- spotify_cleaned_sample.csv
- spotify_clusters_sample.csv

Run the notebook to regenerate visualizations and artifacts.
"""
(OUT_DIR / 'README_quick.txt').write_text(readme_text)
print("Saved quick README.")


In [None]:
# Cell 13 - Quick popularity prediction baseline (optional)
# We'll fit a small random forest to predict track_popularity from audio features
if 'track_popularity' in df.columns and len(cluster_features) >= 3:
    model_df = df[cluster_features + ['track_popularity']].dropna()
    # sample to speed training
    model_sample = model_df.sample(min(15000, len(model_df)), random_state=1)
    X = model_sample[cluster_features]
    y = model_sample['track_popularity']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
    rf.fit(X_train, y_train)
    preds = rf.predict(X_test)
    print("RMSE:", mean_squared_error(y_test, preds)**0.5) # Calculate RMSE manually
    print("R2:", r2_score(y_test, preds))
    # feature importance
    fi = pd.Series(rf.feature_importances_, index=cluster_features).sort_values(ascending=False)
    display(fi)
else:
    print("Skipping popularity model - missing popularity or audio features.")

Insights for Spotify Dataset Analysis
1. Dataset Overview

 The dataset contains thousands of tracks with metadata (artist, album, year) and audio features (danceability, energy, valence, tempo, etc.).
 Some missing values and duplicates may exist, but overall data quality is good.

2. Popularity Analysis

 Popularity distribution is skewed – most songs have popularity between 20–40, while only a few cross 80+.
 Top artists dominate the charts (appearing repeatedly in the most popular tracks).
 This indicates that a small fraction of tracks capture most of the attention, while the majority remain less known.

3. Trends Over Time

 Recent years show higher average popularity compared to older songs, suggesting Spotify boosts newer music.
 Danceability and energy have increased over time, reflecting a trend toward upbeat, party-oriented tracks.
 Older tracks show more acoustic and instrumental values compared to modern electronic-heavy music.

4. Audio Features & Popularity

 Danceability, energy, and valence (positivity of music) show positive correlation with popularity.
 Instrumentalness and acousticness often correlate with lower popularity — mainstream hits are usually vocal-heavy.
 Tempo has weak correlation — speed alone doesn’t guarantee popularity.
 Balanced songs (not extreme in one feature) tend to perform better.

5. Clustering (Grouping Songs by Vibe)

 The dataset naturally forms 3–5 clusters when grouped by audio features:

 Cluster 1: Party Hits – high energy, high danceability.

 Cluster 2: Calm/Acoustic Songs – low energy, high acousticness.

 Cluster 3: Mainstream Pop – balanced features, most popular.

 (Optional) Cluster 4: Speech-heavy tracks – high speechiness (rap, podcasts).

  These clusters can be used for playlist curation (e.g., “Chill Vibes,” “Workout Mix”).

6. Predictive Modeling (if included)

 A regression model explains part of the variation in popularity, but not all.
 Popularity is not just about features – it depends on marketing, artist fame, social media, and trends.
 Still, features like danceability, energy, and valence consistently improve prediction accuracy.

7. Key Insights (Summary)

Popularity is skewed: only a small % of songs are viral.

Danceability + Energy + Valence = recipe for mainstream success.

Modern tracks are more upbeat than older tracks.

Natural clusters of songs exist, which Spotify can use to improve playlist recommendations.

Popularity prediction is possible but limited — music success also depends on external factors.