Spotify Data Pipeline

Data pipeline that extracts a user's song listening history from the Spotify API using Python, PostgreSQL, dbt, Metabase, Airflow, and Docker

Objective

Deep dive into a user's song listening history to retrieve information about top artists, top tracks, top genres, and more. This is a personal side project for fun to recreate Spotify Wrapped but at a more frequent cadence to get quicker and more detailed insights. This pipeline calls the Spotify API every hour from hours 0-6 and 14-23 UTC (basically whenever I'm awake) to extract a user's song listening history, load the responses into a database, apply transformations and visualize the metrics in a dashboard. Since the dataset is small and this doesn't need to be running 24/7 this is all built using open source tools and hosted locally to avoid any cost.

Tools & Technologies

Containerization - Docker, Docker Compose
Orchestration - Airflow
Database - PostgreSQL
Transformation - dbt
Data Visualization - Metabase
Language - Python

Architecture

Data Flow

main.py script is triggered every hour (from hours 0-6 and 14-23 UTC) via Airflow to refresh the access token, make a connection to the Postgres database to check for the latest listened time, and call the Spotify API to retrieve the most recently played songs and corresponding genres.
Responses are saved as CSV files in 'YYYY-MM-DD.csv' format. These are saved on the local file system and act as our replayable source since the Spotify API only allows requesting the 50 most recently played songs and not any historical data. These files will keep getting appended with the most recently played songs for the respective date.
Data is copied into the Postgres Database into the respective tables, spotify_songs and spotify_genres.
dbt run task is triggered to run transformations on top of the staging data to produce analytical and reporting tables/views.
dbt test will run after successful completion of dbt run to ensure all tests pass.
Tables/views are fed into Metabase and the metrics are visualized through a dashboard.
Slack subscription is set up in Metabase to send a weekly summary every Monday.

Throughout this entire process if any Airflow task fails an automatic Slack alert will be sent to a custom Slack channel that was created.

DAG

Sample Slack Alert

Dashboard

Setup

Further Improvements (Work In Progress)

Create a BranchPythonOperator to first check if the API payload is empty. If empty then proceed directly to the end task else continue to the downstream tasks.
Implement data quality checks to catch any potential errors in the dataset
Create unit tests to ensure pipeline is running as intended
Include CI/CD
Create more visualizations to uncover further insights once Spotify sends back my entire songs listening history from 10+ years back to the current date (this needed to be requested separately since the current API only allows requesting the 50 most recently played tracks)
If and whenever Spotify allows requesting historical data implement backfill capability

Name		Name	Last commit message	Last commit date
Latest commit History 232 Commits
.github/workflows		.github/workflows
dags		dags
images		images
operators		operators
setup		setup
spotify_data		spotify_data
tests		tests
.gitignore		.gitignore
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spotify Data Pipeline

Objective

Tools & Technologies

Architecture

Data Flow

DAG

Sample Slack Alert

Dashboard

Setup

Further Improvements (Work In Progress)

About

Releases

Packages

Languages

calbergs/spotify-api

Folders and files

Latest commit

History

Repository files navigation

Spotify Data Pipeline

Objective

Tools & Technologies

Architecture

Data Flow

DAG

Sample Slack Alert

Dashboard

Setup

Further Improvements (Work In Progress)

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages