# Track Data Enriching

**Main objective:** include the `main genre` of the *artist* and `album image URI` in the dataset, expanding the analysis possibilities. Also, the `platform` column is normalized, and the complete information removed from the dataset for privacy and security reasons.

To do the *data enriching* steps, I used the [Spotipy](https://github.com/spotipy-dev/spotipy) library to make the interface with [Spotify's Web API](https://developer.spotify.com/documentation/web-api).

Regarding the data, I used the **Extended streaming history** download, that can be found under *Privacy* settings on your Spotify account. For more information on how to get yours, here is a [simple tutorial from Quora](https://www.quora.com/How-can-I-download-my-Spotify-data).

*Disclaimer*: this is my own data, thus, my own music taste and Spotify usage. I am not responsible if your musical taste is not as good as mine 😂

## Imports and inicializations

In [1]:
import pandas as pd
import spotipy
import time

from spotipy.oauth2 import SpotifyClientCredentials

In [2]:
sp = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials())
sp

<spotipy.client.Spotify at 0x11e3b0f50>

In [3]:
tracks_history = pd.read_csv('consolidated_spotify_data_2024.csv')
tracks_history.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189780 entries, 0 to 189779
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   timestamp_play     189780 non-null  object
 1   platform           189780 non-null  object
 2   ms_played          189780 non-null  int64 
 3   track_name         189780 non-null  object
 4   artist_name        189780 non-null  object
 5   album_name         189780 non-null  object
 6   spotify_track_uri  189780 non-null  object
 7   reason_start       189713 non-null  object
 8   reason_end         151097 non-null  object
dtypes: int64(1), object(8)
memory usage: 13.0+ MB


In [4]:
tracks_history.head(5)

Unnamed: 0,timestamp_play,platform,ms_played,track_name,artist_name,album_name,spotify_track_uri,reason_start,reason_end
0,2022-01-22T12:59:36Z,"Android OS 11 API 30 (Xiaomi, M2101K6G)",2567,GALOPA,PEDRO SAMPAIO,GALOPA,spotify:track:2wG1R0uDFwyobcWzVssC1J,clickrow,endplay
1,2022-01-22T13:00:24Z,"Android OS 11 API 30 (Xiaomi, M2101K6G)",47653,A QUEDA,Gloria Groove,A QUEDA,spotify:track:2s9BO8c0co0PmgBiUoTT17,clickrow,endplay
2,2022-01-22T13:04:16Z,"Android OS 11 API 30 (Xiaomi, M2101K6G)",218223,Convite de Casamento,Luan Santana,Confraternização Família Santana 2,spotify:track:3BDnZJC9yaj01jtqpyDYzG,clickrow,endplay
3,2022-01-22T13:05:17Z,"Android OS 11 API 30 (Xiaomi, M2101K6G)",61025,Acordando o Prédio,Luan Santana,Acordando o Prédio,spotify:track:40bK2uosUmAS92c17n98xd,clickrow,endplay
4,2022-01-22T13:05:41Z,"Android OS 11 API 30 (Xiaomi, M2101K6G)",23375,Juntos,Paula Fernandes,Juntos,spotify:track:2PIWldP9oUhb9iqe3EJh1e,clickrow,endplay


## Data Enriching

There is some interesting facts about this part, as I am filling the notebook after completing the development:

I found out, in the worst way, that APIs have *request rate limits*. This is, on the first try, I sent around 4k requests to **Spotify's API**, what is impractical and could get my API access cut (luckily, this did not happen).

**Datasets are unique**, so knowing its *structure and context* is fundamental. This is a **expanded tracks history**, and knowing my own musical habits, I know that most tracks here were played hundreds of times during the last 10 years (focus on Linkin Park and Green Day). It is, instead of sending a request for each track in the list (182k) multiple times, I can unify them and send only one request per track. Of course, it probably is still a big number, but already better than the initial one.

In [5]:
unique_artists = tracks_history.groupby('artist_name')[['spotify_track_uri']].first().reset_index()
print('Artists:', len(unique_artists))
unique_artists.sample(10)

Artists: 9480


Unnamed: 0,artist_name,spotify_track_uri
4355,Kilavista,spotify:track:436tA4Teo7CPNadBzUihMV
7348,Sickly,spotify:track:7GXO38sbsOtZcL4C2HCEMa
1537,Chon,spotify:track:2FDSwTY9PHc5dS8jWzXUq0
5442,Melvin Taylor,spotify:track:1pXIOp6Vl3QMvvIPFCjj2s
8562,Toploader,spotify:track:1FHNctV68GUNLgXclG2DtR
2682,F4ST,spotify:track:7tNAeLkrsnOqKXWTmd8cQp
2507,Elephanz,spotify:track:6gj7kH97i7Iy1JQWXk7vS8
4021,Joker & Sequence,spotify:track:2SBahFRVnfbLkhaP7SsSTN
2980,GNR,spotify:track:1nnFZioZrw6CRfiUk8KIEE
4855,Luciano Pavarotti,spotify:track:3KEx71XWs0qpcqChWzhRyn


Now, instead of sending a request for each track as I first tried with the [Get Track endpoint](https://developer.spotify.com/documentation/web-api/reference/get-track), I discovered I can actually send a batch of `track_id` and [Get Several Tracks](https://developer.spotify.com/documentation/web-api/reference/get-several-tracks) at once. *Isn't this beautiful?*

[Spotify's documentation](https://developer.spotify.com/documentation/web-api/reference/get-several-tracks) says it has a limit of 100 IDs per request. But, [Spotipy's documentation](https://spotipy.readthedocs.io/en/2.16.1/#spotipy.client.Spotify.tracks) says it has a 50 IDs limitation. I tried both, but only 50 worked, so let's stay under this limit.

Now, having the `batch_size` defined, it was needed to create the batches for further usage.

In [6]:
unique_tracks = tracks_history[['spotify_track_uri']].drop_duplicates()
unique_tracks_ids = unique_tracks['spotify_track_uri'].tolist()

# Spotipy's documentation says the limit is 50 IDs
batch_size = 50

track_uris_batches = []
for i in range(0, len(unique_tracks_ids), batch_size):
    track_uris_batches.append(unique_tracks_ids[i:i+batch_size])

print('Batches:', len(track_uris_batches))

Batches: 690


Here is the logic to get the information needed. First, we take each *batch* and make an API request. With the *response tracks* in hand, we can go through each track and then extract the information intended.

1. **Album cover URI:** this is already included in the *track* JSON. As Spotify's response include several URIs related to different sizes and qualities, we just want to get the first one, that should be the smallest one available (if included).

2. **Artist:** unfortunately, the *artist's genre* is not included in the *track* JSON. We only have access to *artist's ID* here, so this will be used later to make another request.

Just as I mentioned before, Spotify's API has a rate limit based on a 30-seconds window. I was not able to find exactly how much is this limit, but I thought it was a good idea to keep my requests under 4 per second (or one each 250ms). That is why I added a `request_delay` variable that controls the `time.sleep` function after each request.

In [7]:
tracks_enriched = []
counter = 0
request_delay = 0.25 # Delay (in seconds) between requests to avoid API rate limitation

for batch in track_uris_batches:
    tracks = sp.tracks(batch)['tracks']
    
    for track in tracks:
        artist_id = track['artists'][0]['id']
        album = track['album']

        # Get album cover URL from the smallest available image size
        album_cover_url = album['images'][0]['url'] if album['images'] else None
        
        tracks_enriched.append({
            'spotify_track_uri': track['uri'],
            'track_id': track['id'],
            'artist_id': artist_id,
            'album_cover_url': album_cover_url
        })
        
    # Counting and printing for progress tracking
    counter += 1
    
    if counter % 10 == 0:
        print(f'Progress: {counter} out of {len(track_uris_batches)} ({int(counter / len(track_uris_batches) * 100)}%)')
    
    time.sleep(request_delay)

print(f'Processed a total of {counter} batches.')

Progress: 10 out of 690 (1%)
Progress: 20 out of 690 (2%)
Progress: 30 out of 690 (4%)
Progress: 40 out of 690 (5%)
Progress: 50 out of 690 (7%)
Progress: 60 out of 690 (8%)
Progress: 70 out of 690 (10%)
Progress: 80 out of 690 (11%)
Progress: 90 out of 690 (13%)
Progress: 100 out of 690 (14%)
Progress: 110 out of 690 (15%)
Progress: 120 out of 690 (17%)
Progress: 130 out of 690 (18%)
Progress: 140 out of 690 (20%)
Progress: 150 out of 690 (21%)
Progress: 160 out of 690 (23%)
Progress: 170 out of 690 (24%)
Progress: 180 out of 690 (26%)
Progress: 190 out of 690 (27%)
Progress: 200 out of 690 (28%)
Progress: 210 out of 690 (30%)
Progress: 220 out of 690 (31%)
Progress: 230 out of 690 (33%)
Progress: 240 out of 690 (34%)
Progress: 250 out of 690 (36%)
Progress: 260 out of 690 (37%)
Progress: 270 out of 690 (39%)
Progress: 280 out of 690 (40%)
Progress: 290 out of 690 (42%)
Progress: 300 out of 690 (43%)
Progress: 310 out of 690 (44%)
Progress: 320 out of 690 (46%)
Progress: 330 out of 69

In [8]:
df_tracks_enriched = pd.DataFrame(tracks_enriched)
df_tracks_enriched

Unnamed: 0,spotify_track_uri,track_id,artist_id,album_cover_url
0,spotify:track:2wG1R0uDFwyobcWzVssC1J,2wG1R0uDFwyobcWzVssC1J,5wbf52LA6kcaboHSN6NEF1,https://i.scdn.co/image/ab67616d0000b27380dadd...
1,spotify:track:2s9BO8c0co0PmgBiUoTT17,2s9BO8c0co0PmgBiUoTT17,7rXMvXRnWHaSwnVvPeUUfw,https://i.scdn.co/image/ab67616d0000b273ea1781...
2,spotify:track:3BDnZJC9yaj01jtqpyDYzG,3BDnZJC9yaj01jtqpyDYzG,3qvcCP2J0fWi0m0uQDUf6r,https://i.scdn.co/image/ab67616d0000b273e70918...
3,spotify:track:40bK2uosUmAS92c17n98xd,40bK2uosUmAS92c17n98xd,3qvcCP2J0fWi0m0uQDUf6r,https://i.scdn.co/image/ab67616d0000b273d393a0...
4,spotify:track:2PIWldP9oUhb9iqe3EJh1e,2PIWldP9oUhb9iqe3EJh1e,1nca3OA1kKCpP6aPJcBL92,https://i.scdn.co/image/ab67616d0000b2733c7178...
...,...,...,...,...
34484,spotify:track:1n108NGnovKSwv2z0J9Vmw,1n108NGnovKSwv2z0J9Vmw,53XhwfbYqKCa1cC15pYq2q,https://i.scdn.co/image/ab67616d0000b273dee648...
34485,spotify:track:5QbWc5ipIn02XePjT0Tvaf,5QbWc5ipIn02XePjT0Tvaf,7gOdHgIoIKoe4i9Tta6qdD,https://i.scdn.co/image/ab67616d0000b27338f63e...
34486,spotify:track:1xHexNjWpStUnoFCqNLanp,1xHexNjWpStUnoFCqNLanp,3h7RaVXBvdSNa7LXQtVYqH,https://i.scdn.co/image/ab67616d0000b2737027fe...
34487,spotify:track:2bQ9UIx9KQkOybSpuTHaL8,2bQ9UIx9KQkOybSpuTHaL8,5Pwc4xIPtQLFEnJriah9YJ,https://i.scdn.co/image/ab67616d0000b2734c4052...


Ok, so the last enriching step is to add the *artist's genre* on the dataset.

Considering that I already have every `artist_id` from the dataset on the *tracks* DataFrame, I just need to unify them exactly the same way as I did with the tracks.

For the same reason, it is natural to think that most part of my musical history is based on a small set of artists/bands, and just a few artists are present only once on the dataset (probably from LoFi playlists with weird names).

The same logic is applied here, as we also have the [Get Several Artists](https://developer.spotify.com/documentation/web-api/reference/get-multiple-artists) endpoint, so I can group them all in batches and then make the request.

In [9]:
artists_unique = df_tracks_enriched['artist_id'].drop_duplicates()
unique_artist_ids = artists_unique.tolist()

artist_id_batches = []
for i in range(0, len(unique_artist_ids), batch_size):
    artist_id_batches.append(unique_artist_ids[i:i+batch_size])

print('Batches:', len(artist_id_batches))

Batches: 190


The `genre` field from Spotify can have multiple values, for instance, one band can be considered as *Hard Rock* and *Classic Rock* at once, like AC/DC. For the extent of this work, I will only consider the first **genre** available as the "most important" one.

This can be polemic, as one artist/band can be known for a certain genre for most of people, but Spotify can put the first genre anything else. I don't exactly know at this point how much this will impact on the analysis, but I will leave this simple solution and see what happens later.

Also, another data info available here is the artist image URL, so I will extract is as well to be used on the visualizations.

In [10]:
artists_enriched = []
counter = 0
request_delay = 0.25 # Delay (in seconds) between requests to avoid API rate limitation

for batch in artist_id_batches:
    artists = sp.artists(batch)['artists']
    
    for artist in artists:

        # Check if there are genres (handle cases where it might be empty)
        artist_genre = artist['genres'][0] if artist['genres'] else None
        artist_image_url = artist['images'][0]['url'] if artist['images'] else None
        
        artists_enriched.append({
            'artist_id': artist['id'],
            'artist_genre': artist_genre,
            'artist_image_url': artist_image_url
        })
        
    # Counting and printing for progress tracking
    counter += 1
    
    if counter % 10 == 0:
        print(f'Progress: {counter} out of {len(artist_id_batches)} ({int(counter / len(artist_id_batches) * 100)}%)')
    
    time.sleep(request_delay)

print(f'Processed a total of {counter} batches.')

Progress: 10 out of 190 (5%)
Progress: 20 out of 190 (10%)
Progress: 30 out of 190 (15%)
Progress: 40 out of 190 (21%)
Progress: 50 out of 190 (26%)
Progress: 60 out of 190 (31%)
Progress: 70 out of 190 (36%)
Progress: 80 out of 190 (42%)
Progress: 90 out of 190 (47%)
Progress: 100 out of 190 (52%)
Progress: 110 out of 190 (57%)
Progress: 120 out of 190 (63%)
Progress: 130 out of 190 (68%)
Progress: 140 out of 190 (73%)
Progress: 150 out of 190 (78%)
Progress: 160 out of 190 (84%)
Progress: 170 out of 190 (89%)
Progress: 180 out of 190 (94%)
Progress: 190 out of 190 (100%)
Processed a total of 190 batches.


In [11]:
df_artists_enriched = pd.DataFrame(artists_enriched)
df_artists_enriched.sample(10)

Unnamed: 0,artist_id,artist_genre,artist_image_url
1993,0vg08N1z9G9LrGLkG1nNDS,lo-fi,https://i.scdn.co/image/ab6761610000e5eb71ee25...
8219,7wbkl3zgDZEoZer357mVIw,bedroom pop,https://i.scdn.co/image/ab6761610000e5eb710f70...
1006,50TqpjP2iRI4hR1wCfVj3w,reggaeton,https://i.scdn.co/image/ab6761610000e5ebee235a...
1452,5sUrlPAHlS9NEirDB8SEbF,latin pop,https://i.scdn.co/image/ab6761610000e5eb8f09f1...
1073,4g3JlT9j18DeWH7gr1uF6L,metalcore,https://i.scdn.co/image/ab6761610000e5ebeec36c...
353,5hgOjk7FGUwJvo1J3oDK9R,dark ambient,https://i.scdn.co/image/ab6761610000e5eb57a1db...
4948,6kndrupH2JaLYqh1wBKGar,,https://i.scdn.co/image/ab6761610000e5eb9ca6be...
4977,1VBflYyxBhnDc9uVib98rw,,https://i.scdn.co/image/ab6761610000e5ebc66438...
3403,7ChbI909duz2evHDqsYsSa,,https://i.scdn.co/image/ab6761610000e5ebf828dd...
3735,6JL8zeS1NmiOftqZTRgdTz,,https://i.scdn.co/image/ab6761610000e5eb365984...


Having both DataFrames already done with the intended fields added, *now we are two **JOINs** away from the objective*.

In [12]:
df_enriched = df_tracks_enriched.merge(
    df_artists_enriched,
    how='left',
    on='artist_id'
)

df_enriched.sample(10)

Unnamed: 0,spotify_track_uri,track_id,artist_id,album_cover_url,artist_genre,artist_image_url
6559,spotify:track:7eQHxigpuDJjCG50JyzU8v,7eQHxigpuDJjCG50JyzU8v,1bqxdqvUtPWZri43cKHac8,https://i.scdn.co/image/ab67616d0000b2734bb6dc...,,https://i.scdn.co/image/ab6761610000e5eb23e614...
4932,spotify:track:0XRlRULMnFbsm9EhJDHGYW,0XRlRULMnFbsm9EhJDHGYW,3qm84nBOXUEQ2vnTfUTTFC,https://i.scdn.co/image/ab67616d0000b273d2c9d6...,rock,https://i.scdn.co/image/ab6761610000e5eb50defa...
1776,spotify:track:2Q0z8XDKkmk5NDprCwE6FF,2Q0z8XDKkmk5NDprCwE6FF,6xTk3EK5T9UzudENVvu9YB,https://i.scdn.co/image/ab67616d0000b2732fe235...,punk,https://i.scdn.co/image/ab6761610000e5eb9ef133...
32024,spotify:track:531t4BCKZCGCcOine3oGkc,531t4BCKZCGCcOine3oGkc,4l9ufn9GC6LLYXvIanDlLd,https://i.scdn.co/image/ab67616d0000b273e5d41c...,skate punk,https://i.scdn.co/image/ab6761610000e5eb8a3907...
9321,spotify:track:2EPi4anyxVjKTcJwPtntjM,2EPi4anyxVjKTcJwPtntjM,4S2yOnmsWW97dT87yVoaSZ,https://i.scdn.co/image/ab67616d0000b273b00c51...,punk,https://i.scdn.co/image/ab6772690000c46cddcf65...
15829,spotify:track:6ho0GyrWZN3mhi9zVRW7xi,6ho0GyrWZN3mhi9zVRW7xi,1VJ0briNOlXRtJUAzoUJdt,https://i.scdn.co/image/ab67616d0000b2739367c1...,tech house,https://i.scdn.co/image/ab6761610000e5ebcd3fb8...
31987,spotify:track:1ryLj8yxVclPISoSQ1gL6w,1ryLj8yxVclPISoSQ1gL6w,7cU3dIRkY6xry3D6GYqeOg,https://i.scdn.co/image/ab67616d0000b273bda8f6...,,https://i.scdn.co/image/ab6761610000e5eb951bd4...
33572,spotify:track:01MqjMMr7HAbGCSmpXPOlc,01MqjMMr7HAbGCSmpXPOlc,0q74f5UdR3j14Lis4AKKxq,https://i.scdn.co/image/ab67616d0000b273de5f1c...,pop punk,https://i.scdn.co/image/ab6761610000e5ebf1310b...
19284,spotify:track:7Ft93N7aXgfVBgw34LpuiF,7Ft93N7aXgfVBgw34LpuiF,6zXHaJc2ZqAmSyxT606ccM,https://i.scdn.co/image/ab67616d0000b2737f73b5...,skate punk,https://i.scdn.co/image/a9d338fd0941e37f58f605...
1672,spotify:track:6hwQ69v7VbPhTTR2fOtYX7,6hwQ69v7VbPhTTR2fOtYX7,5LfGQac0EIXyAN8aUwmNAQ,https://i.scdn.co/image/ab67616d0000b2732d4c59...,punk,https://i.scdn.co/image/ab6761610000e5eb580938...


*Quick note:* I am using `sample` instead of `head` or `tail` just to make the results more interesting when taking a look at the dataset (and avoid judgement - or not).

In [13]:
tracks_history_enriched = tracks_history.merge(
    df_enriched,
    how='left',
    on='spotify_track_uri'
)

tracks_history_enriched.sample(10)

Unnamed: 0,timestamp_play,platform,ms_played,track_name,artist_name,album_name,spotify_track_uri,reason_start,reason_end,track_id,artist_id,album_cover_url,artist_genre,artist_image_url
15226,2022-10-10T09:44:45Z,"Android OS 12 API 31 (Xiaomi, M2101K6G)",259158,Just a Gigolo / I Ain't Got Nobody - 45 Version,David Lee Roth,Just a Gigolo/I Ain't Got Nobody (45 Version) ...,spotify:track:0QcIphdyZIVgFvW7ijEiPX,trackdone,trackdone,0QcIphdyZIVgFvW7ijEiPX,0KyCXNSa7ZMb5LydfKbLG3,https://i.scdn.co/image/ab67616d0000b273101502...,glam metal,https://i.scdn.co/image/ab6761610000e5eb7b3b79...
105161,2019-01-23T02:01:06Z,Linux [x86-64 0],216226,Desde Quando Você Se Foi,Fresno,Redenção,spotify:track:3Xx8DDgvddsYPnxTH2LXsd,trackdone,trackdone,3Xx8DDgvddsYPnxTH2LXsd,2sFXe6NbmT3k7Qy4N8fE7f,https://i.scdn.co/image/ab67616d0000b273a5c306...,emocore,https://i.scdn.co/image/ab6761610000e5eba4e56a...
56625,2021-08-22T04:44:04Z,"Android OS 11 API 30 (Xiaomi, M2101K6G)",225188,No More Sad Songs (feat. Machine Gun Kelly),Little Mix,Glory Days,spotify:track:0U372D2pbraSnrTFxSYlNj,trackdone,trackdone,0U372D2pbraSnrTFxSYlNj,3e7awlrlDSwF3iM0WBjGMp,https://i.scdn.co/image/ab67616d0000b2733042c5...,,https://i.scdn.co/image/ab6761610000e5eb08cd53...
103539,2018-12-07T15:38:02Z,Partner public_js-sdk harmony-chrome.71-linux,13436,I Write Sins Not Tragedies,Panic! At The Disco,A Fever You Can't Sweat Out,spotify:track:5cY8y2XgOfkAh4kSWLFKkz,playbtn,endplay,5cY8y2XgOfkAh4kSWLFKkz,20JZFwl6HVl6yg8a4H3ZqK,https://i.scdn.co/image/ab67616d0000b273e8b923...,emo,https://i.scdn.co/image/ab6761610000e5ebb256ae...
35530,2017-02-17T13:47:40Z,Windows 10 (10.0.14393; x64),314702,No Sleep Til Cleveland - Live,Prophets Of Rage,No Sleep Til Cleveland,spotify:track:6nfeBDDUAYqcjvIQnERl3p,trackdone,,6nfeBDDUAYqcjvIQnERl3p,1fSzW5cXBmquli5laFnoGY,https://i.scdn.co/image/ab67616d0000b2731ac9ce...,rap metal,https://i.scdn.co/image/ab6761610000e5eb0f23cf...
39526,2017-04-13T20:44:33Z,"Android OS 7.0 API 24 (motorola, Moto G (4))",93121,912 Passos,Dead Fish,Vitória,spotify:track:3RDZTFpR9yUQK5i7388iDf,fwdbtn,,3RDZTFpR9yUQK5i7388iDf,7Lvg39k5XgXevGR767ikYI,https://i.scdn.co/image/ab67616d0000b273e17de8...,brazilian rock,https://i.scdn.co/image/ab6761610000e5ebdcde01...
27329,2023-04-17T18:22:24Z,osx,100383,Darker Than The Light That Never Bleeds - Ches...,Linkin Park,Darker Than The Light That Never Bleeds,spotify:track:7FBaqIBcIWVh0bAWlaQaGa,trackdone,endplay,7FBaqIBcIWVh0bAWlaQaGa,6XyY86QOPPrYVGvF9ch6wz,https://i.scdn.co/image/ab67616d0000b273f25331...,nu metal,https://i.scdn.co/image/ab6761610000e5ebc7e6bd...
9266,2022-06-19T18:33:39Z,OS X 12.4.0 [arm 2],101863,Cigarette Duet,Princess Chelsea,Lil' Golden Book,spotify:track:5zZt5BzzSrUXwytlRatvc2,trackdone,fwdbtn,5zZt5BzzSrUXwytlRatvc2,6SrA4711bML5NvPO13Tr6t,https://i.scdn.co/image/ab67616d0000b27368a2b4...,baroque pop,https://i.scdn.co/image/ab6761610000e5eba23699...
43929,2017-05-16T15:13:07Z,"Android OS 7.0 API 24 (motorola, Moto G (4))",1290,Superbeast,Rob Zombie,Hellbilly Deluxe,spotify:track:2lJvciTEU834w2Yl7VVueJ,fwdbtn,,2lJvciTEU834w2Yl7VVueJ,3HVdAiMNjYrQIKlOGxoGh5,https://i.scdn.co/image/ab67616d0000b273ae17cb...,industrial metal,https://i.scdn.co/image/ab6761610000e5eb40a6cf...
81965,2020-04-14T18:46:31Z,OS X 10.15.4 [x86 8],244364,Fora da Regra,Perna Leiga,Soca Porva,spotify:track:7CCeWIg1hK68RU5Wm6Ng5T,trackdone,trackdone,7CCeWIg1hK68RU5Wm6Ng5T,4xadLLBCKTecYQRPoUhAQ0,https://i.scdn.co/image/ab67616d0000b273a058a9...,brazilian rock,https://i.scdn.co/image/ab6761610000e5eb29bd79...


Now, my last concern is about the `platform` field available on the dataset. As I am making an overview analysis on my musical taste, I am not interested on know which version of Android or Windows I was using when I was listening to certain track. I just want to know if it was on a PC, my phone or any other device, in general terms.

So, now it's time to take care of this information. First I created a mapping to normalize the names, and then applied it to the `platform` field, creating the `simplified_platform` column. Then, I dropped the original `platform` column and renamed the new column as platform, just to simplify the dataset (confusing?).

*Note:* also, as a matter of Privacy and Security, I dropped the original `platform` column from the dataset, as it will be available publicly on Tableau Public. Not that anyone is interested in my devices and their versions, but I prefer not taking the risk.

In [14]:
tracks_history_enriched.groupby('platform')['spotify_track_uri'].count().sort_values(ascending=False)

platform
Android OS 7.0 API 24 (motorola, Moto G (4))                 19275
osx                                                          17941
Android OS 9 API 28 (Xiaomi, Redmi Note 5)                   16788
Android OS 11 API 30 (Xiaomi, M2101K6G)                      13137
Linux [x86-64 0]                                             10031
                                                             ...  
web_player windows 10;chrome 74.0.3729.169;desktop               1
web_player windows 10;chrome 75.0.3770.90;desktop                1
web_player windows 10;chrome 88.0.4324.104;desktop               1
web_player windows 10;chrome 89.0.4389.90;desktop                1
web_player osx 10.15.5;microsoft edge 83.0.478.58;desktop        1
Name: spotify_track_uri, Length: 103, dtype: int64

In [15]:
platform_mapping = {
    'android': 'Android',
    'osx': 'macOS',
    'os x': 'macOS',
    'windows': 'Windows',
    'web_player': 'Web Browser',
    'linux': 'Linux',
    'ps4': 'PS4',
    'xbox': 'Xbox',
    'tv': 'Smart TV',
    'cast': 'Cast'
}

# Function to simplify platform names
def simplify_platform(platform):
    for key, value in platform_mapping.items():
        if key in platform.lower():
            return value

    return 'Others'

# Apply the simplify_platform function to the "platform" column
tracks_history_enriched['simplified_platform'] = tracks_history_enriched['platform'].apply(simplify_platform)

# Removing the complete platform info for security reasons
tracks_history_enriched.drop(columns=['platform'], inplace=True)
tracks_history_enriched.rename(columns={'simplified_platform': 'platform'}, inplace=True)

tracks_history_enriched.groupby('platform')['track_id'].count().sort_values(ascending=False)

platform
Android        71015
macOS          67018
Windows        38019
Linux          10046
PS4             1484
Cast             909
Xbox             544
Others           347
Smart TV         291
Web Browser      107
Name: track_id, dtype: int64

In [16]:
len(tracks_history_enriched)

189780

In [19]:
unique_album_covers = tracks_history_enriched.groupby(['artist_name', 'album_name'], as_index=False)[['album_cover_url']].first()

tracks_history_enriched = tracks_history_enriched.merge(unique_album_covers, how='left', on=['artist_name', 'album_name'], suffixes=('_1', '_2'))
tracks_history_enriched.drop(columns='album_cover_url_1', inplace=True)
tracks_history_enriched.rename(columns={'album_cover_url_2': 'album_cover_url'}, inplace=True)

tracks_history_enriched.sample(10)

Unnamed: 0,timestamp_play,ms_played,track_name,artist_name,album_name,spotify_track_uri,reason_start,reason_end,track_id,artist_id,artist_genre,artist_image_url,platform,album_cover_url
186053,2018-01-04T13:34:19Z,153600,I'm Shipping Up To Boston,Dropkick Murphys,The Warrior's Code,spotify:track:7rSERmjAT38lC5QhJ8hnQc,trackdone,,7rSERmjAT38lC5QhJ8hnQc,7w9jdhcgHNdiPeNPUoFSlx,celtic rock,https://i.scdn.co/image/ab6761610000e5eb2b3032...,Linux,https://i.scdn.co/image/ab67616d0000b273030915...
171960,2020-11-07T16:43:20Z,205280,Numb / Encore,JAY-Z,Collision Course,spotify:track:7dyluIqv7QYVTXXZiMWPHW,trackdone,trackdone,7dyluIqv7QYVTXXZiMWPHW,3nFkdlSjzX9mRTtwJOzDYB,hip hop,https://i.scdn.co/image/ab6761610000e5ebc75afc...,Android,https://i.scdn.co/image/ab67616d0000b273d3acd0...
107113,2019-03-12T22:24:41Z,74288,Só você,Hori,Hori Versão + Fã,spotify:track:2OE0yPXbN6ejT5cJSlKgwz,trackdone,fwdbtn,2OE0yPXbN6ejT5cJSlKgwz,4cn93QHq3HY7wAgnV7bSZ9,emocore,https://i.scdn.co/image/ab67616d0000b273462eac...,Android,https://i.scdn.co/image/ab67616d0000b273462eac...
48007,2017-07-08T02:33:25Z,268426,Losing My Religion,R.E.M.,Out Of Time,spotify:track:2wVXSc0IrQVodHESUxVUJA,trackdone,,2wVXSc0IrQVodHESUxVUJA,4KWTAlx2RvbpseOGMEmROg,jangle pop,https://i.scdn.co/image/ab6761610000e5eb6334ab...,Windows,https://i.scdn.co/image/ab67616d0000b273e2dd4e...
2056,2022-02-21T21:58:02Z,5222,Crazy Nando,The Black Mamba,Crazy Nando,spotify:track:5Im0tMPrcyFbbTDMdUaspx,clickrow,endplay,5Im0tMPrcyFbbTDMdUaspx,2huzXUiBGRUJ8iRFD4NHc7,,https://i.scdn.co/image/ab6761610000e5eb5e8c20...,Android,https://i.scdn.co/image/ab67616d0000b27344c62e...
42638,2017-05-07T22:39:39Z,281332,The Black,Asking Alexandria,The Black,spotify:track:4ORXSxr4tV5H6gH5KHAiAh,trackdone,,4ORXSxr4tV5H6gH5KHAiAh,1caBfBEapzw8z2Qz9q0OaQ,metalcore,https://i.scdn.co/image/ab6761610000e5ebd07f0b...,Android,https://i.scdn.co/image/ab67616d0000b27309c75d...
118572,2015-05-15T23:48:04Z,2959,Trap Queen,Fetty Wap,Trap Queen,spotify:track:3twQx3psUMJKj4wna5d1zU,fwdbtn,fwdbtn,3twQx3psUMJKj4wna5d1zU,6PXS4YHDkKvl1wkIl4V8DL,,https://i.scdn.co/image/ab6761610000e5eb8ffb1b...,Windows,https://i.scdn.co/image/ab67616d0000b2737f2d80...
53733,2021-07-08T12:00:09Z,171998,Agora é Tudo Meu,DENNIS,Agora é Tudo Meu,spotify:track:5yH8pYxNckOU1cxfPsMIaz,trackdone,trackdone,5yH8pYxNckOU1cxfPsMIaz,6xlRSRMLgZbsSNd0BMobwy,funk,https://i.scdn.co/image/ab6761610000e5eb84e963...,Android,https://i.scdn.co/image/ab67616d0000b273a58013...
167269,2020-08-24T00:23:17Z,188849,No Diggity,Campsite Dream,No Diggity,spotify:track:79T9m6YkRCgP0955CE4UCd,trackdone,trackdone,79T9m6YkRCgP0955CE4UCd,69VkQLf4DH7GJ68BCDOPKL,tropical house,https://i.scdn.co/image/ab6761610000e5eb998363...,macOS,https://i.scdn.co/image/ab67616d0000b273e45756...
5195,2022-04-14T22:41:58Z,220905,Whatchu Thinkin',Red Hot Chili Peppers,Unlimited Love,spotify:track:37o0LxdGxiF3chjxjwJwBD,trackdone,trackdone,37o0LxdGxiF3chjxjwJwBD,0L8ExT028jH3ddEcZwqJJ5,funk rock,https://i.scdn.co/image/ab6761610000e5ebc33cc1...,Android,https://i.scdn.co/image/ab67616d0000b27397a52e...


Objectives accomplished, now it is time to export the enriched dataset.

In [20]:
tracks_history_enriched.to_csv('consolidated_spotify_data_2024_enriched.csv', index=False)
print('File saved!')

File saved!
