## Collecting Data

For the purpose of this project, I will be extracting data from the Spotify API. Luckiliy, Spotify has an easy-to-use wrapper called "Spotipy", which can be used to easily navigate the API. Throughout this notebook I will be making calls using the Spotipy wrapper.  

In [1]:
#Import necessary packages
import json
import config
import sys
import pandas as pd

#Import spotipy wrapper 
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials


In [2]:
#Define function to retrieve API keys from a Json file
def get_keys(path):
    with open(path) as f:
        return json.load(f)

In [3]:
#Retrieve personal keys for Spotify API 
keys = get_keys("/Users/adinasteinman/.secret/spotify_api.json")
client_id = keys['client_id']
client_secret = keys['client_secret']

In [4]:
#Access the Spotipy wrapper with client id and client secret credentials 
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id=client_id,
                                                           client_secret=client_secret))

Before we extract our dataset, we will investigate ways to search for artists, songs, albums, etc. through the Spotify API.

In [5]:
#Use the sp.search method to look up songs by the Artist "The Weeknd" and print the first result
search_str = 'The Weeknd'
result = sp.search(search_str, limit=2)
result

{'tracks': {'href': 'https://api.spotify.com/v1/search?query=The+Weeknd&type=track&offset=0&limit=2',
  'items': [{'album': {'album_type': 'album',
     'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/1Xyo4u8uXC1ZmMpatF05PJ'},
       'href': 'https://api.spotify.com/v1/artists/1Xyo4u8uXC1ZmMpatF05PJ',
       'id': '1Xyo4u8uXC1ZmMpatF05PJ',
       'name': 'The Weeknd',
       'type': 'artist',
       'uri': 'spotify:artist:1Xyo4u8uXC1ZmMpatF05PJ'}],
     'available_markets': ['AD',
      'AE',
      'AL',
      'AR',
      'AT',
      'AU',
      'BA',
      'BE',
      'BG',
      'BH',
      'BO',
      'BR',
      'BY',
      'CA',
      'CH',
      'CL',
      'CO',
      'CR',
      'CY',
      'CZ',
      'DE',
      'DK',
      'DO',
      'DZ',
      'EC',
      'EE',
      'EG',
      'ES',
      'FI',
      'FR',
      'GB',
      'GR',
      'GT',
      'HK',
      'HN',
      'HR',
      'HU',
      'ID',
      'IE',
      'IL',
      'IN',
      '

In [6]:
#Create a query that looks at top tracks from 2020 
track_results = sp.search(q='year:2020', type='track', limit=3)
track_results

{'tracks': {'href': 'https://api.spotify.com/v1/search?query=year%3A2020&type=track&offset=0&limit=3',
  'items': [{'album': {'album_type': 'single',
     'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/7tYKF4w9nC0nq9CsPZTHyP'},
       'href': 'https://api.spotify.com/v1/artists/7tYKF4w9nC0nq9CsPZTHyP',
       'id': '7tYKF4w9nC0nq9CsPZTHyP',
       'name': 'SZA',
       'type': 'artist',
       'uri': 'spotify:artist:7tYKF4w9nC0nq9CsPZTHyP'}],
     'available_markets': ['AD',
      'AE',
      'AL',
      'AR',
      'AT',
      'AU',
      'BA',
      'BE',
      'BG',
      'BH',
      'BO',
      'BR',
      'BY',
      'CA',
      'CH',
      'CL',
      'CO',
      'CR',
      'CY',
      'CZ',
      'DE',
      'DK',
      'DO',
      'DZ',
      'EC',
      'EE',
      'EG',
      'ES',
      'FI',
      'FR',
      'GB',
      'GR',
      'GT',
      'HK',
      'HN',
      'HR',
      'HU',
      'ID',
      'IE',
      'IL',
      'IN',
      'IS',


Look at all the possible genres in the Spotify database using the recommendaton_genre_seeds call.

In [7]:
#Use the recommendation_genres_seeds method to extract genres 
genres = sp.recommendation_genre_seeds()

In [8]:
#Print the list of genres to see which genres exist 
genres['genres']

['acoustic',
 'afrobeat',
 'alt-rock',
 'alternative',
 'ambient',
 'anime',
 'black-metal',
 'bluegrass',
 'blues',
 'bossanova',
 'brazil',
 'breakbeat',
 'british',
 'cantopop',
 'chicago-house',
 'children',
 'chill',
 'classical',
 'club',
 'comedy',
 'country',
 'dance',
 'dancehall',
 'death-metal',
 'deep-house',
 'detroit-techno',
 'disco',
 'disney',
 'drum-and-bass',
 'dub',
 'dubstep',
 'edm',
 'electro',
 'electronic',
 'emo',
 'folk',
 'forro',
 'french',
 'funk',
 'garage',
 'german',
 'gospel',
 'goth',
 'grindcore',
 'groove',
 'grunge',
 'guitar',
 'happy',
 'hard-rock',
 'hardcore',
 'hardstyle',
 'heavy-metal',
 'hip-hop',
 'holidays',
 'honky-tonk',
 'house',
 'idm',
 'indian',
 'indie',
 'indie-pop',
 'industrial',
 'iranian',
 'j-dance',
 'j-idol',
 'j-pop',
 'j-rock',
 'jazz',
 'k-pop',
 'kids',
 'latin',
 'latino',
 'malay',
 'mandopop',
 'metal',
 'metal-misc',
 'metalcore',
 'minimal-techno',
 'movies',
 'mpb',
 'new-age',
 'new-release',
 'opera',
 'pagode',

In [9]:
#Extract specific information from the "acoustic" genre: find the name of the third song
acoustics = sp.search(q='genre: acoustic', limit=5, type='track')['tracks']['items']
acoustics[3]['name']

'Come On Get Higher'

In [10]:
#Creative a variable called 'hugeplaylist' that extracts information from a specific playlist using its playlist id  
hugeplaylist = sp.user_playlist_tracks(playlist_id="54nv8jbrm4JoHEZ49Qvjgl", offset=100)["items"]

In [47]:
len(hugeplaylist)

100

The spotipy call for user_playlist_tracks only lets a user extract 100 data points at a time. From looking at this playlist on the Spotify app, I know that this playlist actually contains thousands of songs. Before I work to paginate through all the songs in the playlist, I will look at what sort of information I can extract from this playlist.

In [11]:
#Find the first song's track name 
hugeplaylist[0]['track']['name']

'Turning Tables'

In [12]:
#Find the first song's release date 
hugeplaylist[0]['track']['album']['release_date']

'2011-01-19'

According to the Spotify API, the popularity of a track is defined by: 

"*The popularity of the track. The value will be between 0 and 100, with 100 being the most popular.
The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are.
Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past. Duplicate tracks (e.g. the same track from a single and an album) are rated independently. Artist and album popularity is derived mathematically from track popularity. Note that the popularity value may lag actual popularity by a few days: the value is not updated in real time.*"

This 'popularity' measure could potentially be useful in future modeling when determining the ratings of our songs. For now, let's see how we can extract the popularity metric for the song "Turning Tables" from our playlist.

In [13]:
#Find the first track's popularity score
hugeplaylist[0]['track']['popularity']

40

It also may be interesting to see if a track is explicit or not. Let's analyze this feature as well.

In [14]:
#See if the first song is explicit or not 
hugeplaylist[0]['track']['explicit']

False

I will now look to extract data from one playlist on Spotify. The playlist selected was random, however it was chosen due to its large volume (approximately 10,000 songs are in the playlist). I will ues this playlist as our dataset for the remainder of my analysis.  The playlist data can be called from the API using the user_playlist_tracks method, and inserting the playlist's ID. 

In [15]:
#Perform a pagination method that increases the offset by increments of 100 to extract approximately 10,000 songs 
#from the chosen Spotify playlist 

#Start with offset=0
offset = 0 
#Create an empty playlist 
playlist = []
#Apply the user_playlist_tracks method from spotipy to extract playlist data 
p1 = sp.user_playlist_tracks(playlist_id="54nv8jbrm4JoHEZ49Qvjgl", offset=offset)

#Continue to loop through the API call and append results to the empty playlist until 10,000 songs are extracted
while offset<10000:
    for i in p1["items"]:
        playlist.append(i)
    offset+=100
    p1 = sp.user_playlist_tracks(playlist_id="54nv8jbrm4JoHEZ49Qvjgl", offset=offset)    
        

In [16]:
#Print the length of the playlist
len(playlist)

9964

I now have all 9,964 songs in my playlist extracted from the API, and can use this information as the full dataset. I will again take a look at some of the information that I can extract.

In [17]:
#Find the second song's track ID 
playlist[1]['track']['id']

'1Jx69b09LKTuBQxkEiFfVX'

In [18]:
#Find the first song's artist ID 
playlist[0]['track']['artists'][0]['id']

'6jJ0s89eD6GaHleKKya26X'

## Putting Playlist into DataFrame

In [20]:
#Create a function to extract features of our playlist and append them to a DataFrame 
def df_playlist(playlist_id):
    
    # Set column names and build empty dataframe
    playlist_features_list = ["artist","artist_id", "album","track_name", "release_date", "popularity", "is_explicit", "track_id"]
    playlist_df = pd.DataFrame(columns = playlist_features_list)
    
    # Create empty dictionary of playlist features 
    playlist_features = {}
    
    #Instantiate a counter = 0 
    counter=0
    
    # Create a for loop that looks through every track in the playlist, 
    # Then, extract relevant features and append the features to a DataFrame
    for track in playlist_id:
        if track["track"]!=None:
            counter+=1
            playlist_features["artist"] = track["track"]["album"]["artists"][0]["name"]
            playlist_features["artist_id"] = track["track"]["artists"][0]["id"]
            playlist_features["album"] = track["track"]["album"]["name"]
            playlist_features["track_name"] = track["track"]["name"]
            playlist_features["release_date"] = track["track"]["album"]["release_date"]
            playlist_features["popularity"] = track["track"]["popularity"]
            playlist_features["is_explicit"] = track["track"]["explicit"]
            playlist_features["track_id"] = track["track"]["id"]
            
    # Add new features to a DataFrame then continuously add new features to existing DataFrame
    # This method ensures that all songs extracted from playlist result in one final dataframe    
            track_df = pd.DataFrame(playlist_features, index = [0])
            playlist_df = pd.concat([playlist_df, track_df], ignore_index = True)
     
    #Return final playlist dataframe
    return playlist_df

In [21]:
#Apply function to the playlist that was previously extracted and set equal to a new dataframe called df 
df = df_playlist(playlist)

In [22]:
#Look at shape of new DataFrame 
df.shape

(9963, 8)

In [48]:
#Look at the new dataframe with the chosen features 
df.head()

Unnamed: 0,artist,artist_id,album,track_name,release_date,popularity,is_explicit,track_id
0,Katy Perry,6jJ0s89eD6GaHleKKya26X,Katy Perry - Teenage Dream: The Complete Confe...,Firework,2012-03-12,73,False,4lCv7b86sLynZbXhfScfm2
1,OneRepublic,5Pwc4xIPtQLFEnJriah9YJ,Dreaming Out Loud,All We Are,2007-01-01,39,False,1Jx69b09LKTuBQxkEiFfVX
2,Amy Winehouse,6Q192DXotxtaysaqNPy5yR,Back To Black,Wake Up Alone,2006-01-01,0,False,4u83mwF5tUuWlXS86UOXdu
3,The Script,3AQRLZ9PuTAozP28Skbq8V,The Script,The Man Who Can't Be Moved,2008-09-08,57,False,4Musyaro0NM5Awx8b5c627
4,Adele,4dpARuHxo51G3z768sgnrY,21,Rolling in the Deep,2011-01-19,51,False,1CkvWZme3pRgbzaxZnTl5X


In [23]:
#Export new DataFrame to csv file 
df.to_csv('playlistdf')

### Audio Features

The spotify API also allows you to extract certain audio featuers for tracks; danceability, energy, loudness, acousticness, etc. Here is an example of the response that is printed out when this call is made on a specific track.

In [24]:
#Use the API to extract the audio features from one track from our above playlist (random track ID was chosen)
sp.audio_features(tracks=['76N7FdzCI9OsiUnzJVLY2m'])

[{'danceability': 0.618,
  'energy': 0.753,
  'key': 7,
  'loudness': -5.05,
  'mode': 0,
  'speechiness': 0.0451,
  'acousticness': 0.637,
  'instrumentalness': 0,
  'liveness': 0.0905,
  'valence': 0.557,
  'tempo': 120.041,
  'type': 'audio_features',
  'id': '76N7FdzCI9OsiUnzJVLY2m',
  'uri': 'spotify:track:76N7FdzCI9OsiUnzJVLY2m',
  'track_href': 'https://api.spotify.com/v1/tracks/76N7FdzCI9OsiUnzJVLY2m',
  'analysis_url': 'https://api.spotify.com/v1/audio-analysis/76N7FdzCI9OsiUnzJVLY2m',
  'duration_ms': 221427,
  'time_signature': 4}]

I will need to use the API to extract the audio features on all the songs in the playlist. Since the API takes in a list of track IDs, I will begin by extracting the track ids from our existing playlist.

In [25]:
#Create a list of just the track ids for each song in the playlist 
trackid_list = []
for track in playlist:
    if track["track"]!=None: 
        trackid_list.append(track['track']['id'])

In [26]:
#Check the length of the track id list 
len(trackid_list)

9963

The API call on sp.audio_features only takes in 100 track IDs at a time. Therefore, I will need to create a loop that is able to extract all 9,963 songs in my playlist in order to get the audio features for all of these songs.

In [27]:
#Create an empty array called "audiofeatures"
audiofeatures = []

#Initialize a counter 
counter = 0 

#Create a loop that adds all audio feature results to the above array 
while counter<10000:
        audiofeatures.extend(sp.audio_features(tracks=trackid_list[counter:counter+100]))
        counter+=100

In [28]:
#Check the length of the new array to see that it matches the length of the track id list 
len(audiofeatures)

9963

In [29]:
#Transform result to a DataFrame and Print the first 5 rows 
audio_data = pd.DataFrame(audiofeatures)
audio_data.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,0.638,0.826,8,-4.968,1,0.0479,0.139,0.0,0.0803,0.649,124.072,audio_features,4lCv7b86sLynZbXhfScfm2,spotify:track:4lCv7b86sLynZbXhfScfm2,https://api.spotify.com/v1/tracks/4lCv7b86sLyn...,https://api.spotify.com/v1/audio-analysis/4lCv...,227880,4
1,0.397,0.817,2,-5.495,1,0.042,0.0966,8e-06,0.316,0.416,158.004,audio_features,1Jx69b09LKTuBQxkEiFfVX,spotify:track:1Jx69b09LKTuBQxkEiFfVX,https://api.spotify.com/v1/tracks/1Jx69b09LKTu...,https://api.spotify.com/v1/audio-analysis/1Jx6...,266227,4
2,0.384,0.527,11,-5.294,0,0.0269,0.542,0.0,0.156,0.222,100.73,audio_features,4u83mwF5tUuWlXS86UOXdu,spotify:track:4u83mwF5tUuWlXS86UOXdu,https://api.spotify.com/v1/tracks/4u83mwF5tUuW...,https://api.spotify.com/v1/audio-analysis/4u83...,221200,3
3,0.609,0.629,10,-5.024,1,0.0264,0.425,0.0,0.0978,0.325,99.955,audio_features,4Musyaro0NM5Awx8b5c627,spotify:track:4Musyaro0NM5Awx8b5c627,https://api.spotify.com/v1/tracks/4Musyaro0NM5...,https://api.spotify.com/v1/audio-analysis/4Mus...,241467,4
4,0.729,0.756,8,-5.119,1,0.0294,0.131,0.0,0.0527,0.522,104.945,audio_features,1CkvWZme3pRgbzaxZnTl5X,spotify:track:1CkvWZme3pRgbzaxZnTl5X,https://api.spotify.com/v1/tracks/1CkvWZme3pRg...,https://api.spotify.com/v1/audio-analysis/1Ckv...,228293,4


There are many features that are relevant to a track's audio data; I will export these values to a csv file and include them in our data exploration and modeling.

In [30]:
#Save results to a csv file 
audio_data.to_csv('dfaudio')

## Genres

Another relevant feature related to songs' is a song's genre. Within the Spotipy API, the sp.artists() call is the endpoint where genre information is available. I can extract information on genres' by inputting the artist Ids for every artist in the playlist.

In [31]:
#Create list of artist IDs
artistid_list = []
for track in playlist:
    if track["track"]!=None:
        artistid_list.append(track["track"]["artists"][0]["id"])

In [32]:
#Preview the first two items in the artistid list 
artistid_list[0:2]

['6jJ0s89eD6GaHleKKya26X', '5Pwc4xIPtQLFEnJriah9YJ']

In [33]:
#Check for any NA values 
pd.Series(artistid_list).isna().sum()

0

In [34]:
#Extract the information from the sp.artists call for the first two artist IDs 
music = sp.artists(artists=['6jJ0s89eD6GaHleKKya26X', '5Pwc4xIPtQLFEnJriah9YJ'])

In [35]:
#Print the full output for the first song 
music['artists'][0]

{'external_urls': {'spotify': 'https://open.spotify.com/artist/6jJ0s89eD6GaHleKKya26X'},
 'followers': {'href': None, 'total': 17542631},
 'genres': ['dance pop', 'pop', 'pop dance', 'post-teen pop'],
 'href': 'https://api.spotify.com/v1/artists/6jJ0s89eD6GaHleKKya26X',
 'id': '6jJ0s89eD6GaHleKKya26X',
 'images': [{'height': 640,
   'url': 'https://i.scdn.co/image/ecebace064c7a48b7ae4a611b82887aa79163c1e',
   'width': 640},
  {'height': 320,
   'url': 'https://i.scdn.co/image/bfba3057841ac48bebdd5dc3549647a8473d5ad9',
   'width': 320},
  {'height': 160,
   'url': 'https://i.scdn.co/image/1369660012d3d32372ce05da3e7268c9bf426030',
   'width': 160}],
 'name': 'Katy Perry',
 'popularity': 88,
 'type': 'artist',
 'uri': 'spotify:artist:6jJ0s89eD6GaHleKKya26X'}

In [36]:
#Print the genres of the first song
music['artists'][0]['genres']

['dance pop', 'pop', 'pop dance', 'post-teen pop']

As we see from this artist, all the genres listed are similar (dance pop and pop dance are the same thing). As a result, for the purposes of this analysis, I will only retrieve the first genre listed.

Next, I will apply the sp.artists method to all the artists in our playlist.

In [37]:
#Create an empty array called "genre features"
genrefeatures = []

#Initialize a counter 
counter = 0 

#Create a loop that adds all audio feature results to the above array 
while counter<10000:
    try:
        genrefeatures.append(sp.artists(artists=artistid_list[counter:counter+10]))
        counter+=10
    except:
        #Print out the errors to see which URLs are not working 
        print("Bad URL")
        print(counter)
        counter+=10

HTTP Error for GET to https://api.spotify.com/v1/artists/?ids= returned 400 due to invalid id
HTTP Error for GET to https://api.spotify.com/v1/artists/?ids= returned 400 due to invalid id
HTTP Error for GET to https://api.spotify.com/v1/artists/?ids= returned 400 due to invalid id


Bad URL
9970
Bad URL
9980
Bad URL
9990


In [38]:
#Look at the length of our genrefeatures
len(genrefeatures)

997

Let's look more into the items in "genrefeatures".

In [39]:
#Look at the length of each item in our list 
len(genrefeatures[0]['artists'])

10

It appears that within each item in genrefeatures, there are 10 artists. Let's pull out all of these items so that we are working with easier dataset.

In [40]:
#Create empty array called genres 
genres = []
#Pull out the information for all 10 artists within each 
for i in genrefeatures:
    genres.extend(i['artists'])

In [41]:
#Check length of genres list 
len(genres)

9963

Now that all of the artists and their genre information have their own index in one list, we can go ahead and extract the genre information for all of the artists in our playlist.

In [42]:
genres[0]['name']

'Katy Perry'

In [43]:
#name which features will be extracted from dataset 
genres_features_list = ["artist_id", "artist", "genres"]
#Create an empty dataframe with genre_features_list as the column names 
genre_df = pd.DataFrame(columns = genres_features_list)

#Create empty dictionary called genre_features 
genre_features = {}
#Set up a counter 
counter = 0

#Create for loop that loops through each item in genres and extracts the artist Id and the first genre listed 
for genre in genres:
    counter+=1
    genre_features['artist_id'] = genre["id"]
    genre_features['artist'] = genre["name"]
    #For any artists that doesn't have genres, write "unknown"
    if len(genre["genres"])==0:
        genre_features['genres'] = "unknown"
    #For remaining artists, print the first genre in their genre list 
    else:
        genre_features['genres'] = genre["genres"][0]
    #Append results to a dataframe 
    features_df = pd.DataFrame(genre_features, index = [0])
    genre_df = pd.concat([genre_df, features_df], ignore_index = True)

In [44]:
#Investigate new genres dataframe 
genre_df.head()

Unnamed: 0,artist_id,artist,genres
0,6jJ0s89eD6GaHleKKya26X,Katy Perry,dance pop
1,5Pwc4xIPtQLFEnJriah9YJ,OneRepublic,dance pop
2,6Q192DXotxtaysaqNPy5yR,Amy Winehouse,british soul
3,3AQRLZ9PuTAozP28Skbq8V,The Script,celtic rock
4,4dpARuHxo51G3z768sgnrY,Adele,british soul


In [45]:
#Look at unique number of genres in dataset 
genre_df['genres'].nunique()

575

It appears that we have 577 genres in our dataset! Later on in the data cleaning process we could look to bin these genres somehow. 

Lastly, we will export this genres dataframe to a csv file:

In [46]:
#Save genre results to a csv file 
genre_df.to_csv('dfgenres')

I now have all the data required to conduct my EDA analysis. The datasets from the API calls for playlist information, audio features and genres will be combined and explored in the next notebook.