<a href="https://colab.research.google.com/github/bmill42/streaming-data/blob/main/Getting_track_genres_from_Spotify_API.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

Install spotipy, import the libraries we need, and set up the credentials to get data from the Spotify API. **Make sure to insert your own `client_id` and `client_secret`.**

Then load the listening data. Again, replace the filepath with the path to your own data.

In [None]:
!pip install Spotipy

In [None]:
import pandas as pd
from google.colab import drive
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

drive.mount('/content/drive')

client_id = ''
client_secret = ''

client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

**NOTE:** if you're using Apple Music data, skip this cell and move on to "Getting Features for Apple Music Data" below.

In [None]:
df = pd.read_json('/content/drive/MyDrive/COMPFOR 304/Data/BAM - Streaming_History_Audio_2013-2024.json')

# Getting genres

We'll test this on a subset of the rows using `head()` so we don't request 14k+ tracks' genres.

The `uri_subset` variable stores a list of all the Spotify URIs for the tracks in the dataframe. Since many tracks will appear multiple times in the full dataset, we can drop duplicates here so we only request each song's features once.

In [None]:
tracks_subset = df.head() # replace tracks_subset with your own data
uri_subset = tracks_subset.spotify_track_uri.drop_duplicates()
uri_subset

We need to get the artists for each track, then get the genres for each artist, then merge that information back to the track URIs.

First we ask for the track info from the API, which will contain the artists for the track.

In [None]:
tracks = sp.tracks(uri_subset)

Now we put together a table that contains only the track URIs and the artist URIs, so we don't lose track of which ones go together.

In [None]:
artist_lists = [d['artists'] for d in tracks['tracks']]
track_uris = [d['uri'] for d in tracks['tracks']]
artist_ids = []

for tn in range(len(artist_lists)):
    for a in artist_lists[tn]:
        artist_ids.append({'track_uri': track_uris[tn], 'artist_uri': a['uri']}) # 'track_uri': track_uris[tn],

In [None]:
track_artist_uris = pd.DataFrame(artist_ids)
track_artist_uris

Now we can ask the API for the info for each artist using the `artist_uri` column.

In [None]:
artist_info = sp.artists(track_artist_uris.artist_uri)

Next we pull the genres out of the API data and create a new table that links artist URIs to their lists of genres.

In [None]:
artists_genres = [{'artist_uri': a['uri'], 'genres': a['genres']} for a in artist_info['artists']]
genres_table = pd.DataFrame(artists_genres).drop_duplicates('artist_uri')
genres_table

Now we just merge the genres onto the track URIs using the artist URIs.

In [None]:
tracks_with_genres = pd.merge(track_artist_uris, genres_table, how='left', on='artist_uri')
tracks_with_genres

# Finalizing the table and cleaning up the genre data

From here, we can merge the genres back into the larger dataset. Note that the column name for the track URI is different in the two tables even though they contain the same information, so we can use `left_on` and `right_on` arguments for the merge.

The resulting table looks exactly like our original dataset, but with both artist URIs and genres added!

**Note**, however, that we now have multiple copies of tracks that had multiple artists. You may want to keep only the first instance of each timestamped entry, which would keep only one artist per track. Whether you want to do this depends on your goals.

In [None]:
full_table = pd.merge(tracks_subset, tracks_with_genres, left_on='spotify_track_uri', right_on='track_uri')
full_table.drop_duplicates('ts')

Another issue is that the genres are in the form of a Python list. You may want to pull out just the first genre from each list:

In [None]:
full_table['primary_genre'] = full_table.genres.str[0]
full_table