<a href="https://colab.research.google.com/github/bmill42/streaming-data/blob/main/Getting_track_genres_from_Spotify_API.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

Install spotipy, import the libraries we need, and set up the credentials to get data from the Spotify API. **Make sure to insert your own `client_id` and `client_secret`.**

Then load the listening data. Again, replace the filepath with the path to your own data.

In [None]:
!pip install Spotipy

In [None]:
import pandas as pd
from google.colab import drive
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

drive.mount('/content/drive')

client_id = ''
client_secret = ''

client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

In [None]:
df = pd.read_json('/content/drive/MyDrive/COMPFOR 304/Data/BAM - Streaming_History_Audio_2013-2024.json')

# Getting genres

Spotify only assigns genres to *artists*, not tracks, but our listening data only contains info on the tracks we've listened to - specifically, it contains track URIs but not artist URIs.

This means that it's a multi-step process to get from track data to genre data:

1. Use track URIs in our listening data to retrieve the rest of the metadata for each track, including artist URIs
2. Use artist URIs to retrieve the rest of the metadata for each artist, including genre labels
3. Attach the genre labels back to the track URIs via the artist URIs
4. Attach the genre labels back to the original dataset via the track URIs

Spotify only lets us request a limited number of items at once via the API, so if we want genre info for many tracks at once we need some extra code to split our data into small chunks and request their info one by one, then put all the data back together again.

In [None]:
import time

def get_tracks_info(uris, sp_object, batch_size=50):
    all_tracks = []
    for i in range(0, len(uris), batch_size):
        batch_uris = uris[i: i + batch_size] # Get a batch of URIs
        tracks_batch = sp_object.tracks(batch_uris) # Fetch track info for the batch
        all_tracks.extend(tracks_batch["tracks"]) # Add batch results to the main list
        time.sleep (0.5) # Wait for a short duration to avoid rate limiting
    return all_tracks

def get_artists_info(uris, sp_object, batch_size=50):
    all_artists = []
    for i in range(0, len(uris), batch_size):
        batch_uris = uris[i: i + batch_size] # Get a batch of URIs
        artists_batch = sp_object.artists(batch_uris) # Fetch artist info for the batch
        all_artists.extend(artists_batch["artists"]) # Add batch results to the main
        time.sleep (0.5) # Wait for a short duration to avoid rate limiting
    return all_artists


### Narrowing down the data

We'll test this on a subset of the rows using `head()` so we don't request 14k+ tracks' genres.

**IF USING THIS CODE WITH YOUR OWN DATA:** You won't want to just request genre info for the first 200 tracks. But I also highly recommend NOT asking for genres for every single track you've ever listened to, as this may take a long time or cause errors with the API.

Instead, you should filter your data down to a more manageable subset first, whether that's "all the tracks I listened to while walking to class" or "all the tracks I listened to in the past year," etc. Replace `tracks_subset` with that data when doing this on your own.

In [None]:
tracks_subset = df.head(200)

The `uri_subset` variable stores a list of all the Spotify URIs for the tracks in the dataframe. Since many tracks will appear multiple times in the full dataset, we can drop duplicates here so we only request each song's data once.

In [None]:
uris = tracks_subset.spotify_track_uri.drop_duplicates()

## Step 1: Get track metadata

Now we ask for the track info from the API, which will contain the artists for each track.

In [None]:
tracks = get_tracks_info(uris, sp)

Then we put together a table that contains only the track URIs and the artist URIs, so we don't lose track of which ones go together.

In [None]:
artist_lists = [d['artists'] for d in tracks]
track_uris = [d['uri'] for d in tracks]
artist_ids = []

for tn in range(len(artist_lists)):
    for a in artist_lists[tn]:
        artist_ids.append({'track_uri': track_uris[tn], 'artist_uri': a['uri']})

track_artist_uris = pd.DataFrame(artist_ids)

Notice how we now have multiple copies of some track URIs, for any tracks with multiple artists:

In [None]:
track_artist_uris

## Step 2: Get artist metadata

Now we can ask the API for the info for each artist using the `artist_uri` column.

In [None]:
artist_info = get_artists_info(track_artist_uris.artist_uri.drop_duplicates(), sp)

Next we pull the genres out of the API data and create a new table that links artist URIs to their lists of genres.

In [None]:
artists_genres = [{'artist_uri': a['uri'], 'genres': a['genres']} for a in artist_info]
genres_table = pd.DataFrame(artists_genres).drop_duplicates('artist_uri')

The resulting table has exactly one row for each artist, each paired with a list of genres:

In [None]:
genres_table

## Step 3: Attach genre info to tracks

Now we **merge** the genres onto the track URIs. In other words, we take our `track_artist_uris` table from above and use the artist URI to connect the genres from the `genres_table` back to the track URIs.

In [None]:
tracks_with_genres = pd.merge(track_artist_uris, genres_table, how='left', on='artist_uri')

Notice here that we're back to having multiple rows for some tracks if they had multiple artists:

In [None]:
tracks_with_genres

Since we're trying to associate genres with tracks, it would make more sense to have a single list of genres for each track, rather than separate genre lists for each artist.

Using `groupby`, we can add together all the genre lists associated with each individual track. The result will be a table with one row per track containing the track URI and its genre list.

In [None]:
tracks_with_genres = tracks_with_genres.groupby('track_uri').agg({'genres': 'sum'}).reset_index()

In [None]:
tracks_with_genres

## Step 4: Attach genres back to the full dataset

From here, we can merge the genres back into the larger dataset. All we need to do is line up the track URIs from the original table with the ones in our new `tracks_with_genres` table.

Note that the column name for the track URI is different in the two tables even though they contain the same information, so we can use `left_on` and `right_on` arguments for the merge.

The resulting table looks exactly like our original dataset, but with genres added!

In [None]:
full_table = pd.merge(tracks_subset, tracks_with_genres, left_on='spotify_track_uri', right_on='track_uri', how='left')

In [None]:
full_table

# Moving forward: making the data easier to work with

There is still a small problem with this dataset, which is that the genre column contains *lists* of genres. How we engage with this information will depend on exactly what questions we're trying to answer, but let's assume that we want to examine our listening history purely by genre.

We can start by cutting our table down to just the timestamp and the genre list

In [None]:
df_genre_ts = full_table[['ts', 'genres']]

We can interpret this table as saying, "at X time and date, I was listening to genres [A, B, C], etc."

In [None]:
df_genre_ts

But it'll be even easier to work with this for most purposes if we expand those genre lists out so that each genre gets its own row.

The pandas `explode` method will do exactly that:

In [None]:
df_genre_ts = df_genre_ts.explode('genres', ignore_index=True)

Looking at the resulting data, we now have a column of timestamps and a column of *individual* genres. From here, we can easily apply the kinds of timestamp manipulations and filters that we covered previously.

In [None]:
df_genre_ts