# An Analysis of My Spotify Listening History 2013-2023

## Background
This is pretty much just meant to be a way to keep my Python skills up to date. I would like to do an R version as well just to show off :)

### Processing Data

We start with getting data into a format we can work with. The data comes in the form of several json files. We'll need to append these files together into one workable dataframe.

In [None]:
# the libraries we need
import json
import pandas as pd
import glob
import matplotlib.pyplot as plt
import numpy as np

# I heard this is like dplyr, so I wanted to try it
from siuba import _, group_by, summarize, arrange, head, filter, count

Now I'll load in one json file to see get a sense of what the data looks like. I've taken the liberty of dropping the included ip address because I don't want that info out there.

In [4]:
listening_history = pd.read_json("data/Streaming_History_Audio_2012-2016_0.json")
listening_history.drop('ip_addr_decrypted',axis=1,inplace=True)
listening_history.head()

Unnamed: 0,ts,username,platform,ms_played,conn_country,user_agent_decrypted,master_metadata_track_name,master_metadata_album_artist_name,master_metadata_album_album_name,spotify_track_uri,episode_name,episode_show_name,spotify_episode_uri,reason_start,reason_end,shuffle,skipped,offline,offline_timestamp,incognito_mode
0,2012-06-08T02:20:16Z,1233293719,Windows 7 (Unknown Ed) SP0 [x86 0],47713,US,,Spain,US Air Force Airmen Of Note,Somewhere Out There,spotify:track:6taXUSSvSQjzuyNzkykKm1,,,,remote,popup,False,1.0,False,0,False
1,2012-06-08T02:21:26Z,1233293719,Windows 7 (Unknown Ed) SP0 [x86 0],70248,US,,Little Sunflower,US Air Force Airmen Of Note,Invitation,spotify:track:6HvdyLX3iee3YXjIoxj8hz,,,,popup,popup,False,1.0,False,0,False
2,2012-06-08T02:23:42Z,1233293719,Windows 7 (Unknown Ed) SP0 [x86 0],137423,US,,Purple Porpoise Parkway,The Tom Kubis Big Band,"""Slightly Off The Ground""",spotify:track:5d12h6FdL6NbqaFogjjh6I,,,,popup,popup,False,1.0,False,0,False
3,2012-06-08T02:23:45Z,1233293719,Windows 7 (Unknown Ed) SP0 [x86 0],3993,US,,Exactly Like This,The Tom Kubis Big Band,"""Slightly Off The Ground""",spotify:track:3YE9TMQgo1CDYbWHqASeiI,,,,popup,popup,False,1.0,False,0,False
4,2012-06-08T02:25:18Z,1233293719,Windows 7 (Unknown Ed) SP0 [x86 0],93337,US,,Who Can I Turn To,The Tom Kubis Big Band,"""Slightly Off The Ground""",spotify:track:7Gh3zYrCFIYSvFrMcL7P55,,,,popup,popup,False,1.0,False,0,False


This data seems to distinguish between music and podcast by using either track name or episode name. Now that I can see hwo this is working, I'll merge the 8 files I have into one dataframe.

In [None]:

spotify_files = glob.glob('data/*.json')

listen_hist = pd.DataFrame()

for file in spotify_files:
    with open(file) as f:
        data = json.load(f)
        listen_hist = pd.concat([listen_hist, pd.DataFrame(data)], ignore_index = True)

listen_hist.drop('ip_addr_decrypted',axis=1,inplace=True)

display(listen_hist.head())


Unnamed: 0,ts,username,platform,ms_played,conn_country,user_agent_decrypted,master_metadata_track_name,master_metadata_album_artist_name,master_metadata_album_album_name,spotify_track_uri,episode_name,episode_show_name,spotify_episode_uri,reason_start,reason_end,shuffle,skipped,offline,offline_timestamp,incognito_mode
0,2012-06-08T02:20:16Z,1233293719,Windows 7 (Unknown Ed) SP0 [x86 0],47713,US,,Spain,US Air Force Airmen Of Note,Somewhere Out There,spotify:track:6taXUSSvSQjzuyNzkykKm1,,,,remote,popup,False,True,False,0,False
1,2012-06-08T02:21:26Z,1233293719,Windows 7 (Unknown Ed) SP0 [x86 0],70248,US,,Little Sunflower,US Air Force Airmen Of Note,Invitation,spotify:track:6HvdyLX3iee3YXjIoxj8hz,,,,popup,popup,False,True,False,0,False
2,2012-06-08T02:23:42Z,1233293719,Windows 7 (Unknown Ed) SP0 [x86 0],137423,US,,Purple Porpoise Parkway,The Tom Kubis Big Band,"""Slightly Off The Ground""",spotify:track:5d12h6FdL6NbqaFogjjh6I,,,,popup,popup,False,True,False,0,False
3,2012-06-08T02:23:45Z,1233293719,Windows 7 (Unknown Ed) SP0 [x86 0],3993,US,,Exactly Like This,The Tom Kubis Big Band,"""Slightly Off The Ground""",spotify:track:3YE9TMQgo1CDYbWHqASeiI,,,,popup,popup,False,True,False,0,False
4,2012-06-08T02:25:18Z,1233293719,Windows 7 (Unknown Ed) SP0 [x86 0],93337,US,,Who Can I Turn To,The Tom Kubis Big Band,"""Slightly Off The Ground""",spotify:track:7Gh3zYrCFIYSvFrMcL7P55,,,,popup,popup,False,True,False,0,False


Now let's clean up some of this data to make it a little more usable. I'll start with the time `ts` field I'd like to convert this into a date time that I'd understand

In [50]:
# format the to an actual time type
listen_hist['ts'] = pd.to_datetime(listen_hist['ts'])

#convert to US eastern time
listen_hist['ts_local'] = listen_hist['ts'].dt.tz_convert('US/Eastern').dt.strftime('%Y-%m-%d %H:%M:%S')
listen_hist['ts_date'] = listen_hist['ts'].dt.tz_convert('US/Eastern').dt.strftime('%Y-%m-%d')

Next, let's look at the `ms_played` field. This represents how much of the track has been listened to in milliseconds. Let's convert this into seconds seconds played

In [51]:
#converting to seconds played
listen_hist['seconds_played'] = listen_hist['ms_played']/1000

Finally, I mentioned before that podcasts are included in this dataset. I'm not particularly interested in this data today, so we'll drop the rows and columns related to the podcast data

In [52]:
music_hist = listen_hist[pd.isnull(listen_hist['spotify_episode_uri'])]

music_hist.drop(['spotify_episode_uri','episode_name','episode_show_name'],axis=1,inplace=True)

music_hist

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  music_hist.drop(['spotify_episode_uri','episode_name','episode_show_name'],axis=1,inplace=True)


Unnamed: 0,ts,username,platform,ms_played,conn_country,user_agent_decrypted,master_metadata_track_name,master_metadata_album_artist_name,master_metadata_album_album_name,spotify_track_uri,reason_start,reason_end,shuffle,skipped,offline,offline_timestamp,incognito_mode,ts_local,seconds_played,ts_date
0,2012-06-08 02:20:16+00:00,1233293719,Windows 7 (Unknown Ed) SP0 [x86 0],47713,US,,Spain,US Air Force Airmen Of Note,Somewhere Out There,spotify:track:6taXUSSvSQjzuyNzkykKm1,remote,popup,False,True,False,0,False,2012-06-07 22:20:16,47.713,2012-06-07
1,2012-06-08 02:21:26+00:00,1233293719,Windows 7 (Unknown Ed) SP0 [x86 0],70248,US,,Little Sunflower,US Air Force Airmen Of Note,Invitation,spotify:track:6HvdyLX3iee3YXjIoxj8hz,popup,popup,False,True,False,0,False,2012-06-07 22:21:26,70.248,2012-06-07
2,2012-06-08 02:23:42+00:00,1233293719,Windows 7 (Unknown Ed) SP0 [x86 0],137423,US,,Purple Porpoise Parkway,The Tom Kubis Big Band,"""Slightly Off The Ground""",spotify:track:5d12h6FdL6NbqaFogjjh6I,popup,popup,False,True,False,0,False,2012-06-07 22:23:42,137.423,2012-06-07
3,2012-06-08 02:23:45+00:00,1233293719,Windows 7 (Unknown Ed) SP0 [x86 0],3993,US,,Exactly Like This,The Tom Kubis Big Band,"""Slightly Off The Ground""",spotify:track:3YE9TMQgo1CDYbWHqASeiI,popup,popup,False,True,False,0,False,2012-06-07 22:23:45,3.993,2012-06-07
4,2012-06-08 02:25:18+00:00,1233293719,Windows 7 (Unknown Ed) SP0 [x86 0],93337,US,,Who Can I Turn To,The Tom Kubis Big Band,"""Slightly Off The Ground""",spotify:track:7Gh3zYrCFIYSvFrMcL7P55,popup,popup,False,True,False,0,False,2012-06-07 22:25:18,93.337,2012-06-07
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125099,2023-12-29 23:35:33+00:00,1233293719,android,176720,US,unknown,Black Heart,Carly Rae Jepsen,Emotion,spotify:track:6PDPy2HspcDoiSAxRSmhug,trackdone,trackdone,True,False,False,1703892759,False,2023-12-29 18:35:33,176.720,2023-12-29
125100,2023-12-29 23:39:32+00:00,1233293719,android,237906,US,unknown,POOL,tricot,T H E,spotify:track:6qL7sccU4FQS6zUgKCzNu6,trackdone,trackdone,True,False,False,1703892936,False,2023-12-29 18:39:32,237.906,2023-12-29
125101,2023-12-29 23:44:38+00:00,1233293719,android,305350,US,unknown,Dang! (feat. Anderson .Paak),Mac Miller,The Divine Feminine,spotify:track:5iUQMwxUPdJBFeGkePtM66,trackdone,trackdone,True,False,False,1703893175,False,2023-12-29 18:44:38,305.350,2023-12-29
125102,2023-12-29 23:48:25+00:00,1233293719,android,226400,US,unknown,Sunshine,Zitten,SAVe tHE AiR GREEN CONCERT,spotify:track:6D4ZkbtBd4DshKMzStOf8x,trackdone,trackdone,True,False,False,1703893481,False,2023-12-29 18:48:25,226.400,2023-12-29


I'm feeling pretty good about how this data is looking now. Let's start slicing and dicing this data!

### Data Analysis

Let's start by looking which artists I've listened to the most

In [None]:
(music_hist
    >> filter(_.master_metadata_album_artist_name.notna())
    >> group_by(_.master_metadata_album_artist_name)
    >> summarize(hits = _.shape[0],
                 hours_listened = _.seconds_played.sum()/60/60) #divide by 60 seconds in a minute and then 60 minutes in an hour
    >> arrange(-_.hits)
    >> head(10)
    )

Unnamed: 0,master_metadata_album_artist_name,hits,hours_listened
1517,Maroon 5,4690,110.424699
1764,OK Go,2452,72.429674
1097,Jason Mraz,2291,65.425588
586,Daryl Hall & John Oates,2015,56.791697
421,Carly Rae Jepsen,1719,44.217813
469,Childish Gambino,1546,29.170275
1921,RADWIMPS,1541,29.618808
2922,tricot,1367,45.647691
281,Billy Joel,1361,50.056616
138,Andy Grammer,1317,32.994738


Turns out I'm a big Maroon 5 fan - over 100 hours in the last 10 years! Let's see what my overall most listened to songs were.

In [31]:
(music_hist
    >> filter(_.master_metadata_track_name.notna())
    >> group_by(_.master_metadata_track_name,_.master_metadata_album_artist_name)
    >> summarize(hits = _.shape[0],
                 hours_listened = _.seconds_played.sum()/60/60) #divide by 60 seconds in a minute and then 60 minutes in an hour
    >> arrange(-_.hits)
    >> head(20)
    )

Unnamed: 0,master_metadata_track_name,master_metadata_album_artist_name,hits,hours_listened
2099,Feelings,Maroon 5,490,15.219947
6578,The Dynamo of Volition,Jason Mraz,460,19.416056
6194,Sugar,Maroon 5,457,16.444629
4139,Make It Mine,Jason Mraz,444,10.211573
1513,Dancing In the Moonlight,King Harvest,436,6.763429
263,Ain't It Fun,Paramore,419,12.847124
4149,Makes Me Wonder,Maroon 5,411,8.889034
3072,I Really Like You,Carly Rae Jepsen,405,6.770289
2734,Heartbeat,Childish Gambino,402,7.245315
5906,Sober,Childish Gambino,400,10.149967


So I've listened to Feelings by Maroon 5 for a total of 15.2 hours over 490 listens. It is interesting to note that this song is about 3.5 minutes long, but if you look at the average time listend for this song (`hours_listed * 60 / hits`) you'll see that I'm only listening to this song for an average of 1.86 minutes! Luckily we can also see in the data if I skipped the song, so let's go ahead and add that in now

In [53]:
(music_hist
    >> filter(_.master_metadata_track_name.notna())
    >> group_by(_.master_metadata_track_name,_.master_metadata_album_artist_name)
    >> summarize(hits = _.shape[0]
                 , completed_listens = _.shape[0] - _.skipped.sum()                  
                 , hours_listened = _.seconds_played.sum()/60/60 #divide by 60 seconds in a minute and then 60 minutes in an hour
    )
    >> arrange(-_.hits)
    >> head(20)
    )

Unnamed: 0,master_metadata_track_name,master_metadata_album_artist_name,hits,completed_listens,hours_listened
2099,Feelings,Maroon 5,490,420,15.219947
6578,The Dynamo of Volition,Jason Mraz,460,404,19.416056
6194,Sugar,Maroon 5,457,392,16.444629
4139,Make It Mine,Jason Mraz,444,361,10.211573
1513,Dancing In the Moonlight,King Harvest,436,381,6.763429
263,Ain't It Fun,Paramore,419,345,12.847124
4149,Makes Me Wonder,Maroon 5,411,322,8.889034
3072,I Really Like You,Carly Rae Jepsen,405,292,6.770289
2734,Heartbeat,Childish Gambino,402,288,7.245315
5906,Sober,Childish Gambino,400,316,10.149967


Now I'd like to spice things up and see how my favorite artists have changed over time.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  music_hist['ts_date'] = music_hist['ts'].dt.tz_convert('US/Eastern').dt.strftime('%Y-%m-%d')
