Skip to content

Pipeline that extracts data from the Spotify API to build a more detailed version of Spotify Wrapped

Notifications You must be signed in to change notification settings

calbergs/spotify-api

Repository files navigation

Spotify Data Pipeline

Data pipeline that extracts a user's song listening history from the Spotify API using Python, PostgreSQL, dbt, Metabase, Airflow, and Docker

Objective

Deep dive into a user's song listening history to retrieve information about top artists, top tracks, top genres, and more. This is a personal side project for fun to recreate Spotify Wrapped but at a more frequent cadence to get quicker and more detailed insights. This pipeline calls the Spotify API every hour from hours 0-6 and 14-23 UTC (basically whenever I'm awake) to extract a user's song listening history, load the responses into a database, apply transformations and visualize the metrics in a dashboard. Since the dataset is small and this doesn't need to be running 24/7 this is all built using open source tools and hosted locally to avoid any cost.

Tools & Technologies

Architecture

spotify drawio

Data Flow

  1. main.py script is triggered every hour (from hours 0-6 and 14-23 UTC) via Airflow to refresh the access token, make a connection to the Postgres database to check for the latest listened time, and call the Spotify API to retrieve the most recently played songs and corresponding genres.
  2. Responses are saved as CSV files in 'YYYY-MM-DD.csv' format. These are saved on the local file system and act as our replayable source since the Spotify API only allows requesting the 50 most recently played songs and not any historical data. These files will keep getting appended with the most recently played songs for the respective date.
  3. Data is copied into the Postgres Database into the respective tables, spotify_songs and spotify_genres.
  4. dbt run task is triggered to run transformations on top of the staging data to produce analytical and reporting tables/views.
  5. dbt test will run after successful completion of dbt run to ensure all tests pass.
  6. Tables/views are fed into Metabase and the metrics are visualized through a dashboard.
  7. Slack subscription is set up in Metabase to send a weekly summary every Monday.

Throughout this entire process if any Airflow task fails an automatic Slack alert will be sent to a custom Slack channel that was created.

DAG

Screenshot 2023-01-05 at 9 32 42 PM

Sample Slack Alert

Screenshot 2023-01-05 at 9 33 09 PM

Dashboard

Screenshot 2023-01-31 at 12 02 56 PM

Screenshot 2023-01-31 at 1 20 51 PM

Screenshot 2023-01-24 at 10 18 42 PM

Screenshot 2023-01-31 at 12 03 24 PM

Screenshot 2023-01-31 at 12 03 36 PM

Setup

  1. Get Spotify API Access
  2. Build Docker Containers for Airflow
  3. Set Up Airflow Connection to Postgres
  4. Install dbt Core
  5. Enable Airflow Slack Notifications
  6. Install Metabase

Further Improvements (Work In Progress)

  • Create a BranchPythonOperator to first check if the API payload is empty. If empty then proceed directly to the end task else continue to the downstream tasks.
  • Implement data quality checks to catch any potential errors in the dataset
  • Create unit tests to ensure pipeline is running as intended
  • Include CI/CD
  • Create more visualizations to uncover further insights once Spotify sends back my entire songs listening history from 10+ years back to the current date (this needed to be requested separately since the current API only allows requesting the 50 most recently played tracks)
  • If and whenever Spotify allows requesting historical data implement backfill capability

About

Pipeline that extracts data from the Spotify API to build a more detailed version of Spotify Wrapped

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages