# Spotify Data Analysis

## Setup

### Downloading your own account's listening history

Your account's data can be obtained by going to [Spotify's Privacy page](https://www.spotify.com/in-en/account/privacy/), and under `Download your data` section, selecting `Extended streaming history` and requesting it.


### Generating your own Analysis

1. Clone this repo or download the zip and extract it to a folder

 
```bash
git clone https://github.com/d1vij/spotify-data-analysis/
```

2. Install dependencies

```bash
# If using uv
uv sync

# Using pip
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```


3. Start the jupyter server
```bash
uv run jupyter lab
# OR 
jupyter lab
```

4. Open and run all cells of this notebook (_project.ipynb_)

## About the project

The project focuses on extracting and analyzing Spotify’s Extended Streaming History, a multi-file dataset provided by Spotify under GDPR data access requests. 

Unlike the standard one-year streaming history, the extended dataset contains a list of items (e.g. songs, videos, and podcasts) listened to or watched during the lifetime of the account

---

## About the data

Spotify provided zip file has the listening history split into multiple json files with size of around 12KB

![Manual Unzipping](https://github.com/d1vij/spotify-data-analysis/blob/main/images/manual_unzipping.png?raw=true)

_Although this could've been unzipped and processed manually, the script `utils.process_zip` does it automatically for us :P._

---

### Data Compilation & DataFrame Construction Workflow

#### 1. Asking the User for the ZIP File Path
The program begins by prompting the user to provide the file path of Spotify’s **Extended Streaming History** ZIP archive. Usually named _my_spotify_data.zip_

#### 2. Unzipping and Reading JSON Files
After extraction, the application scans the directory and reads every JSON file with the prefix: _Streaming_History_Audio_

Each of these files contains an **array of listening-event objects**, shaped like:

```json
{
    "ts": "2023-04-09T14:24:49Z",
    "platform": "android",
    "ms_played": 185733,
    "conn_country": "IN",
    "ip_addr": "152.57.221.62",
    "master_metadata_track_name": "Here Comes The Sun - Remastered 2009",
    "master_metadata_album_artist_name": "The Beatles",
    "master_metadata_album_album_name": "Abbey Road",
    "spotify_track_uri": "spotify:track:6dGnYIeXmHdcikdzNNDMm2",
    "episode_name": null,
    "episode_show_name": null,
    "spotify_episode_uri": null,
    "audiobook_title": null,
    "audiobook_uri": null,
    "audiobook_chapter_uri": null,
    "audiobook_chapter_title": null,
    "reason_start": "playbtn",
    "reason_end": "trackdone",
    "shuffle": false,
    "skipped": false,
    "offline": false,
    "offline_timestamp": 1681050047,
    "incognito_mode": false
}
```
#### Filtering && Reducing to Required Fields

Although Spotify provides a wide variety of metrics (see _utils.TrackInfoClasses.SongAttributes_), this project focuses on a minimal, analysis-oriented subset:

1. ts – Timestamp when the track was played
1. ms_played – Total listening duration in milliseconds
1. master_metadata_track_name – Track name
1. master_metadata_album_artist_name – Primary artist
1. master_metadata_album_album_name – Album name

The final compiled dataset is stored as a large JSON (or CSV) file (≈ 11 MB in my case), containing simplified objects of _utils.TrackInfoClasses.FilteredTrackInfo_, like:

```csv
...
2025-11-29T09:31:18Z,952,She's A Rainbow,The Rolling Stones,Forty Licks
2025-11-29T09:31:33Z,14674,Changes - 2015 Remaster,David Bowie,Hunky Dory
2025-11-29T09:31:34Z,934,Going to California - Remaster,Led Zeppelin,Led Zeppelin IV
2025-11-29T09:31:36Z,1372,You Can't Always Get What You Want,The Rolling Stones,Let It Bleed
2025-11-29T09:32:24Z,48475,Going to California - Remaster,Led Zeppelin,Led Zeppelin IV
2025-11-29T09:32:25Z,0,Broken Hearts Are For Assholes,Frank Zappa,Sheik Yerbouti
2025-11-29T09:34:05Z,99907,Broken Hearts Are For Assholes,Frank Zappa,Sheik Yerbouti
2025-11-29T09:34:10Z,5228,Flakes,Frank Zappa,Sheik Yerbouti
2025-11-29T09:34:14Z,3911,Baba O'Riley,The Who,Who's Next
2025-11-29T09:34:20Z,0,Misirlou,Dick Dale,Surf With Me Now!
2025-11-29T09:39:04Z,215618,Dazed and Confused - 1990 Remaster,Led Zeppelin,Led Zeppelin
2025-11-29T12:53:56Z,44476,Why Does It Hurt When I Pee?,Frank Zappa,"Joe's Garage Acts I, II & III"
2025-11-29T13:04:48Z,8634,Why Does It Hurt When I Pee?,Frank Zappa,"Joe's Garage Acts I, II & III"
2025-11-29T13:04:59Z,10778,Catholic Girls,Frank Zappa,"Joe's Garage Acts I, II & III"
2025-11-29T13:05:03Z,2223,The Soft Parade,The Doors,The Soft Parade
2025-11-29T13:08:42Z,218185,I Talk To The Wind,King Crimson,In The Court Of The Crimson King (Expanded & Remastered Original Album Mix)
2025-11-29T13:08:53Z,10708,1983...(A Merman I Should Turn to Be),Jimi Hendrix,Electric Ladyland
2025-11-29T13:13:13Z,258944,"Careful with That Axe, Eugene - Live",Pink Floyd,Ummagumma
2025-11-29T13:25:55Z,100453,Wish You Were Here,Pink Floyd,Wish You Were Here
2025-11-29T14:13:52Z,3073,Wah-Wah (2014 Remaster),George Harrison,All Things Must Pass
2025-11-29T14:14:17Z,939,Taxman - Remastered 2009,The Beatles,Revolver
...
```

---

### External Data

An [additional CSV*](https://www.kaggle.com/datasets/harshdprajapati/worldwide-music-artists-dataset-with-image) file containing data regarding artists, the country they belong to and their genres are is cleaned and used to provide further metadata regarding the artist.

_*https://www.kaggle.com/datasets/harshdprajapati/worldwide-music-artists-dataset-with-image_

```csv
...
David Bowie,"art rock, classic rock, glam rock, permanent wave, rock",United Kingdom
The black eyed peas karaok...,hip hop,United States
The Strokes,"alternative rock, garage rock, modern rock, permanent wave, rock",United States
Britney Spears,"dance pop, pop",United States
Guns N' Roses,"glam metal, hard rock, rock",United States
Franz Ferdinand,"alternative rock, dance rock, indie rock, modern rock, rock, scottish rock",United Kingdom
The Doors,"acid rock, album rock, classic rock, hard rock, psychedelic rock, rock",United States
JAY-Z,"east coast hip hop, gangster rap, hip hop, pop rap, rap",United States
Madonna,"dance pop, pop",United States
Pink Floyd,"album rock, art rock, classic rock, progressive rock, psychedelic rock, rock, symphonic rock",United Kingdom
Weezer,"alternative rock, modern power pop, modern rock, permanent wave, rock",United States
Snow Patrol,"irish rock, modern rock, neo mellow, permanent wave, pop rock",United Kingdom
blink-182,"alternative metal, modern rock, pop punk, punk, rock, socal pop punk",United States
The White Stripes,"alternative rock, blues rock, detroit rock, garage rock, modern blues rock, permanent wave, punk blues, rock",United States
The Cure,"new wave, permanent wave, rock, uk post-punk",United Kingdom
Led Zeppelin,"album rock, classic rock, hard rock, rock",United Kingdom
Becky G,"latin pop, latin viral pop, rap latina, reggaeton, trap latino, urbano latino",United States
Aerosmith,"album rock, classic rock, hard rock, rock",United States
The Offspring,"alternative metal, permanent wave, post-grunge, punk, rock, skate punk, socal pop punk",United States
...
```

---

### Processing and Analyzing Tools used

The project implements wrappers around external libraries which helps in abstracting away the workings from the end user as well as simplifies the developing experience.

Major external libraries used
1. Pandas -- data manipulation and aggregation for analysis
1. Matplotlib && Seaborn -- plots for visual analysis
1. Fizzbuzz -- fuzzy searcher for user queries

---

All the helper functions and utilites is made into a unified module `utils` and further description along with its working is explained in each of its corresponding submodules.

All generic plots are present in `utils.plot_sources.plotters` module whereas the functions utilizing them to generate analysis ae present in the `utils.plot_sources.analysis_plots` module

---

# Imports and configurations

In [None]:
from os import path
import pandas as pd

# Prevent wrapping of dataframes when printed
pd.set_option("display.expand_frame_repr", False)


# Disabling warnings generated by matplotlib
import warnings

warnings.filterwarnings("ignore")


from utils.generate_frame import generate_frame

# Plot imports
from utils.plot_sources.analysis_plots.top_n import top_analysis
from utils.plot_sources.analysis_plots.daily_tracks_graph import daily_tracks_graph
from utils.plot_sources.analysis_plots.daily_listening_activity import (
    daily_listening_activity,
)
from utils.plot_sources.analysis_plots.track_playtime_kde_dist import (
    track_playtime_kde_dist,
)
from utils.plot_sources.analysis_plots.per_artist_analysis import (
    interactive_per_artist_analysis,
)
from utils.plot_sources.analysis_plots.artist_corelation_plot import (
    artist_correlation_plot,
)

## Track info dataframe

In [None]:
df = generate_frame("./data/my_spotify_data.zip")
print(df.tail(20))

## Artist info dataframe 

In [None]:
artists_df = pd.read_csv(path.abspath("./ext_data/global_music_artists.csv"))
print(artists_df.head(20))

## Analysis

### Top analysis

Analyzing the top artists, tracks, albums and genres the user has been listening to


In [None]:
top_analysis(df, artists_df)

In [None]:
daily_tracks_graph(df)

In [None]:
daily_listening_activity(df)

In [None]:
track_playtime_kde_dist(df)

### Correlation in listened Artists 

In [1]:
artist_correlation_plot(df)

NameError: name 'artist_correlation_plot' is not defined

### Interactive per artist analysis

In [None]:
interactive_per_artist_analysis(df, artists_df)

## Exporting 

This jupyter notebook is converted into a ready to print html saved at /build/project.html

In [None]:
%run ./scripts/export.py

In [None]:
%run ./utils/jupyter_configs/loadcell.py
%run ./utils/jupyter_configs/imagestyling.py

## Source code

All source files required for this project compiled into one single cell. Does not include any of the external data.

Source code best viewed directly on the repo.

In [None]:
# Run the script to compile all python files
%run ./scripts/bake.py

# Paste the content of the compiled file into this cell
%loadnext ./builds/built.pie