# Playing With My Spotify Data

I've been using Spotify for about 6 years now. I have always loved and been fascinated by the end-of-year "Spotify Wrapped" which gives Spotify users a snapshot of their listening behavior from the past year. For example, your most frequently played artists, how many minutes you spent listening to music versus podcasts, your top 5 genres, and more. I recently learned that I could acess my own listening data and I requested it immediately! To see how to request your Spotify data see this web page: https://thenextweb.com/tech/2020/06/19/a-simple-guide-to-visualising-your-spotify-listening-data-badass-ly/.

After a couple days, I received an email with a zip file containing a few different json files. These included information about my streaming history, my playlists, my search queries, and even information about my family plan and payment history. To see what data is included, see this page: https://support.spotify.com/us/article/understanding-my-data/. 

For now, I don't have a specific idea for how I'm going to use my data. Rather, I would like to use my Spotify data to practice various data science techniques like EDA or regression analyses. This notebook is my work flow.


## Playlist DataFrame

Firstly, the data is given in JSON format so before I can do any analysis, I need to get the JSON file in to a dataframe format. The file I am most excited about is playlist.json which has information on each song in each of my playlists. I import the pandas and json packages in order to load the data

In [5]:
import pandas as pd
import json

with open("/Users/ericaschultz/Desktop/My_Projects/DATA/spotify/MyData/Playlist1.json") as file:
    data2 = json.load(file)   
playlists = pd.DataFrame.from_dict(data2['playlists'])

playlists.head()

Unnamed: 0,name,lastModifiedDate,items,description,numberOfFollowers
0,BPM <130,2021-01-26,[{'track': {'trackName': 'Invitation (feat. Ko...,,0
1,BPM 165+,2021-01-06,[],,0
2,BPM 145-165,2021-01-26,"[{'track': {'trackName': 'STARGAZING', 'artist...",,0
3,BPM 130-145,2021-01-26,"[{'track': {'trackName': 'Heaven On Earth', 'a...",,0
4,under the *covers*,2021-01-13,[{'track': {'trackName': 'I'm Not the Only One...,This could also be called acoustic sunrise...,0


In [6]:
playlists.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82 entries, 0 to 81
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   name               82 non-null     object
 1   lastModifiedDate   82 non-null     object
 2   items              82 non-null     object
 3   description        14 non-null     object
 4   numberOfFollowers  82 non-null     int64 
dtypes: int64(1), object(4)
memory usage: 3.3+ KB


In [7]:
playlists['items'].iloc[0]

[{'track': {'trackName': 'Invitation (feat. Kodie Shane)',
   'artistName': 'Ashnikko',
   'albumName': 'Unlikeable'},
  'episode': None,
  'localTrack': None},
 {'track': {'trackName': 'Hit It (feat. I-Ez)',
   'artistName': 'Deorro',
   'albumName': 'Boombox'},
  'episode': None,
  'localTrack': None},
 {'track': {'trackName': 'Movements',
   'artistName': 'Pham',
   'albumName': 'Movements - Single'},
  'episode': None,
  'localTrack': None}]

As you can see, the items column here is a list of dictionaries that describe different attributes of each song. I need to unpack this list so that I have columns for each attribute. My goal is to get a dataframe where each observation is a song in a specific playlist. Since the playlists dataframe has a line for each playlist, I create a temporary dataframe with the songs of each playlist and append it to the playlists dataframe. I will show this step by step first and then show the loop I created to automate the process.

I first unpack this list to make it into the dataframe I want:

In [8]:
#Create a dataframe from the series in the items column of index 0. 
test = pd.DataFrame(playlists['items'].iloc[0])
test

Unnamed: 0,track,episode,localTrack
0,{'trackName': 'Invitation (feat. Kodie Shane)'...,,
1,"{'trackName': 'Hit It (feat. I-Ez)', 'artistNa...",,
2,"{'trackName': 'Movements', 'artistName': 'Pham...",,


In [9]:
#Take the dictionary in each row and create columns in the dataframe
test = pd.concat([test.drop(['track'], axis = 1), test['track'].apply(pd.Series)], axis = 1)
test

Unnamed: 0,episode,localTrack,trackName,artistName,albumName
0,,,Invitation (feat. Kodie Shane),Ashnikko,Unlikeable
1,,,Hit It (feat. I-Ez),Deorro,Boombox
2,,,Movements,Pham,Movements - Single


In [10]:
#Add a column with the playlist name corresponding to the index in the original dataframe
test['playlist_name'] = playlists['name'].iloc[0]
test

Unnamed: 0,episode,localTrack,trackName,artistName,albumName,playlist_name
0,,,Invitation (feat. Kodie Shane),Ashnikko,Unlikeable,BPM <130
1,,,Hit It (feat. I-Ez),Deorro,Boombox,BPM <130
2,,,Movements,Pham,Movements - Single,BPM <130


Great! Now I have a dataframe for every song in the playlist titled "BPM<130." Now I want to make a deep copy of the original playlists dataframe to append these individual playlist dataframes to. I'm going to outer merge the dataframes on the playlist name and drop the unnecessary columns.

In [11]:
play_copy = playlists.copy()
play_copy = pd.merge(play_copy, test, how = 'outer', left_on = 'name', right_on = 'playlist_name')
play_copy = play_copy.drop(["name",'items'], axis = 1)
play_copy.head()

Unnamed: 0,lastModifiedDate,description,numberOfFollowers,episode,localTrack,trackName,artistName,albumName,playlist_name
0,2021-01-26,,0,,,Invitation (feat. Kodie Shane),Ashnikko,Unlikeable,BPM <130
1,2021-01-26,,0,,,Hit It (feat. I-Ez),Deorro,Boombox,BPM <130
2,2021-01-26,,0,,,Movements,Pham,Movements - Single,BPM <130
3,2021-01-06,,0,,,,,,
4,2021-01-26,,0,,,,,,


Great! Now we have a dataframe to append to. Unfortunately the next playlist is an empty playlist so for the purposes of demonstrating the process, I am going to append one of my favorite playlists.

In [12]:
playlists[playlists['name'] == 'boy']

Unnamed: 0,name,lastModifiedDate,items,description,numberOfFollowers
63,boy,2021-01-14,[{'track': {'trackName': 'Come Together - Rema...,,0


I am going to do the same three steps from before.

In [13]:
test2 = pd.DataFrame(playlists['items'].iloc[63])
test2 = pd.concat([test2.drop(['track'], axis = 1), test2['track'].apply(pd.Series)], axis = 1)
test2['playlist_name'] = playlists['name'].iloc[63]
test2.head()

Unnamed: 0,episode,localTrack,trackName,artistName,albumName,playlist_name
0,,,Come Together - Remastered 2009,The Beatles,Abbey Road,boy
1,,,Roxanne - Remastered 2003,The Police,Outlandos D'Amour,boy
2,,,With Or Without You - Remastered,U2,The Joshua Tree,boy
3,,,Baba O'Riley,The Who,Who's Next,boy
4,,,Sympathy For The Devil,The Rolling Stones,Beggars Banquet,boy


Now my first idea was to outer merge this dataframe with the play_copy dataframe. But when I do that, new columns are created to differentiate between the left and right dataframes beacuse it is an outer merge. I don't want to do an inner or right merge because I will lose everything thats not in the current playlist and I don't think I want to do a left merge because I want all of the tracks to merge. So, in order to not lose any information from the playlists dataframe, I need to create columns and copy that information. Specifically, I need to copy the values in the "lastModifiedDate", "description", and "numberOfFollowers" columns.

In [14]:
test2['lastModifiedDate'] = playlists['lastModifiedDate'].iloc[63]
test2['description'] = playlists['description'].iloc[63]
test2['numberOfFollowers'] = playlists['numberOfFollowers'].iloc[63]

Next I can simply append test2 to the play_copy dataframe.

In [15]:
play_copy = play_copy.append(test2)
play_copy

Unnamed: 0,lastModifiedDate,description,numberOfFollowers,episode,localTrack,trackName,artistName,albumName,playlist_name
0,2021-01-26,,0,,,Invitation (feat. Kodie Shane),Ashnikko,Unlikeable,BPM <130
1,2021-01-26,,0,,,Hit It (feat. I-Ez),Deorro,Boombox,BPM <130
2,2021-01-26,,0,,,Movements,Pham,Movements - Single,BPM <130
3,2021-01-06,,0,,,,,,
4,2021-01-26,,0,,,,,,
...,...,...,...,...,...,...,...,...,...
125,2021-01-14,,0,,,Fanfare for the Common Man,Aaron Copland,Copland: Super Hits,boy
126,2021-01-14,,0,,,Rio - 2009 Remaster,Duran Duran,Rio,boy
127,2021-01-14,,0,,,Spill The Wine,Eric Burdon,Boogie Nights / Music From The Original Motion...,boy
128,2021-01-14,,0,,,Long Distance Runaround - 2003 Remaster,Yes,Fragile,boy


Now I can do this for each index in the playlists dataframe. In the following cell I loop through playlists and create the play_df dataframe consisting of every song in each of my playlists.

In [24]:
play_df = playlists.copy()

for i in playlists.index:
    #Create new df from items object in row i. This only has one column with each column being a dictionary for a specific song
    temp = pd.DataFrame(playlists['items'].iloc[i])
    
    #If the temp df is empty then I need to manually create the entry to be appended to the play_df df
    if temp.empty:
        temp['playlist_name'] = playlists['name'].iloc[i]
        temp['lastModifiedDate'] = playlists['lastModifiedDate'].iloc[i]
        temp['description'] = playlists['description'].iloc[i]
        temp['numberOfFollowers'] = playlists['numberOfFollowers'].iloc[i]
        temp['episode'] = None
        temp['localTrack'] = None
        temp['trackName'] = None
        temp['artistName'] = None
        temp['albumName'] = None
        play_df = play_df.append(temp)
        continue
    
    #We take the dictionary and convert it to a series we can concatonate to the df itself.
    temp = pd.concat([temp.drop(['track'], axis = 1), temp['track'].apply(pd.Series)], axis = 1)
    
    #Make a column specifying the playlist name using the location in the playlists df
    temp['playlist_name'] = playlists['name'].iloc[i]
    
    
    #If this is the first iteration, then we create the dataframe with the appropriate columns
    if i == 0:
        play_df = pd.merge(play_df, temp, how = 'outer', left_on = 'name', right_on = 'playlist_name')
        play_df = play_df.drop(["name",'items'], axis = 1)
        play_df.dropna(subset = ['trackName'],inplace = True)
        
    #If it's not the first iteration then we append to play_df
    else:
        temp['lastModifiedDate'] = playlists['lastModifiedDate'].iloc[i]
        temp['description'] = playlists['description'].iloc[i]
        temp['numberOfFollowers'] = playlists['numberOfFollowers'].iloc[i]
        play_df = play_df.append(temp)
        
play_df.head(20)
#What I need to do is 

Unnamed: 0,lastModifiedDate,description,numberOfFollowers,episode,localTrack,trackName,artistName,albumName,playlist_name
0,2021-01-26,,0,,,Invitation (feat. Kodie Shane),Ashnikko,Unlikeable,BPM <130
1,2021-01-26,,0,,,Hit It (feat. I-Ez),Deorro,Boombox,BPM <130
2,2021-01-26,,0,,,Movements,Pham,Movements - Single,BPM <130
0,2021-01-26,,0,,,STARGAZING,Travis Scott,ASTROWORLD,BPM 145-165
0,2021-01-26,,0,,,Heaven On Earth,Kid Cudi,Man On The Moon III: The Chosen,BPM 130-145
1,2021-01-26,,0,,,White Iverson,Post Malone,White Iverson,BPM 130-145
2,2021-01-26,,0,,,Mercy,Kanye West,Mercy,BPM 130-145
0,2021-01-13,This could also be called acoustic sunrise...,0,,,"I'm Not the Only One - Live from Spotify, London",Dua Lipa,Spotify Session,under the *covers*
1,2021-01-13,This could also be called acoustic sunrise...,0,,,Love On The Brain - Los Feliz Blvd,Cold War Kids,Love On The Brain,under the *covers*
2,2021-01-13,This could also be called acoustic sunrise...,0,,,No One - Recorded At Spotify Studios NYC,Cold War Kids,Spotify Singles,under the *covers*


In [25]:
len(play_df.playlist_name.unique())

81

In [26]:
len(playlists.name.unique())

82

In [27]:
list(set(playlists.name.unique()) - set(play_df.playlist_name.unique()))

['BPM 165+']

This one playlists that did not make it into the final dataframe is an empty playlist and so is irrelevant to my analyses.