# Dataframe Creation from Kaggle Dataset using Spotipy

**Author:** Dermot O'Brien
***

## Overview
In this notebook, I will be going over how I created the dataframe I used for modeling. I first gathered a list of track ID's from Kaggle's [Every Noise at Once Dataset](https://www.kaggle.com/datasets/nikitricky/every-noise-at-once/discussion?select=songs.csv) and passed those three Spotipy - a python libray that is connected to Spotify's Web API. See Spotipy's documentation [here](https://spotipy.readthedocs.io/en/latest/).

## Import Standard Packages

In [1]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import base64
import requests
import datetime

%matplotlib inline

## Kaggle Data
To start, I created a list of track_id's from Kaggle data to later pass through Spotipy. Note that I am only taking tracks between Jan 2020 and Sept 2021

In [13]:
# Turn Kaggle Data into dataframe
df = pd.read_csv('./Data/songs.csv')

In [14]:
# Check Shape
df.shape

(544401, 17)

In [15]:
# Change Date('Release') column from string to datetime object
df['Release'] =  (df['Release']).astype('datetime64[ns]')

In [16]:
# Look at only current songs (Between Jan 2020 and Sept 2021)
new_df = df.loc[df['Release'] >= '2020/01/01']

In [17]:
# Check shape of new dataframe
new_df.shape

(101986, 17)

In [18]:
# Create a list of ID's from the Id column to pass through Spotipy
id_list = new_df.Id.values.tolist()

## Spoti(py)

To gain access to Spotify's API, you'll need to first be authorized. Authorization steps can be found [here](https://developer.spotify.com/documentation/general/guides/authorization/). For this project, you will need the "Client credentials" authorization flow - a more basic authorization since we are not pulling user data. Once you have that, you will need to create a new app. This will grant you a cliet id and client secret which is needed for pulling the data. App setup can be found [here](https://developer.spotify.com/documentation/general/guides/authorization/app-settings/). Once you have a client id and client secret, create variables for them as shown below. For demonstration purposes, I've replaced my actual strings with fake strings as not to give away my information.

In [2]:
# Create variables for client id and client secret to create token
cid = 'fe9b61ddthisisafakeclientid095bb66a'
secret = '8544bfthisisafakeclientsecret032d9617e'

### Audio Features
Next we're going to pull our audio features using the list of track id's. Once spotipy is imported as well as SpotifyClientCredentials (see imported packages above), we can set up our connection to the API. See code below:

In [3]:
# Spotipy authorization steps
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
spotify = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

The code below shows how to get audio features for a single track. Spotify's API has many endpoints that can be called to for your data needs. See a reference to their endpoints and available data [here](https://developer.spotify.com/documentation/web-api/reference/#/).

In [103]:
# Get audio features for a single track
audio_features = spotify.audio_features(test_list)
#audio_features

[{'danceability': 0.666,
  'energy': 0.796,
  'key': 10,
  'loudness': -6.967,
  'mode': 0,
  'speechiness': 0.103,
  'acousticness': 0.0492,
  'instrumentalness': 0,
  'liveness': 0.0442,
  'valence': 0.61,
  'tempo': 110.108,
  'type': 'audio_features',
  'id': '1058fW9H3fZA6QjYCdOBad',
  'uri': 'spotify:track:1058fW9H3fZA6QjYCdOBad',
  'track_href': 'https://api.spotify.com/v1/tracks/1058fW9H3fZA6QjYCdOBad',
  'analysis_url': 'https://api.spotify.com/v1/audio-analysis/1058fW9H3fZA6QjYCdOBad',
  'duration_ms': 164842,
  'time_signature': 4},
 {'danceability': 0.386,
  'energy': 0.426,
  'key': 3,
  'loudness': -6.642,
  'mode': 1,
  'speechiness': 0.0363,
  'acousticness': 0.807,
  'instrumentalness': 0,
  'liveness': 0.14,
  'valence': 0.261,
  'tempo': 180.104,
  'type': 'audio_features',
  'id': '5ajjAnNRh8bxFvaVHzpPjh',
  'uri': 'spotify:track:5ajjAnNRh8bxFvaVHzpPjh',
  'track_href': 'https://api.spotify.com/v1/tracks/5ajjAnNRh8bxFvaVHzpPjh',
  'analysis_url': 'https://api.spotif

Spotify only allows you to pull 100 rows at a time, so the code below will allow us to iterate through the entire list of track id's.

In [153]:
# Get audio features for entire dataset
features_list = []
for i in range(0, len(id_list), 100):
    audio_features = spotify.audio_features(id_list[i:i+100])
    features_list += audio_features
features_list

[{'danceability': 0.666,
  'energy': 0.796,
  'key': 10,
  'loudness': -6.967,
  'mode': 0,
  'speechiness': 0.103,
  'acousticness': 0.0492,
  'instrumentalness': 0,
  'liveness': 0.0442,
  'valence': 0.61,
  'tempo': 110.108,
  'type': 'audio_features',
  'id': '1058fW9H3fZA6QjYCdOBad',
  'uri': 'spotify:track:1058fW9H3fZA6QjYCdOBad',
  'track_href': 'https://api.spotify.com/v1/tracks/1058fW9H3fZA6QjYCdOBad',
  'analysis_url': 'https://api.spotify.com/v1/audio-analysis/1058fW9H3fZA6QjYCdOBad',
  'duration_ms': 164842,
  'time_signature': 4},
 {'danceability': 0.386,
  'energy': 0.426,
  'key': 3,
  'loudness': -6.642,
  'mode': 1,
  'speechiness': 0.0363,
  'acousticness': 0.807,
  'instrumentalness': 0,
  'liveness': 0.14,
  'valence': 0.261,
  'tempo': 180.104,
  'type': 'audio_features',
  'id': '5ajjAnNRh8bxFvaVHzpPjh',
  'uri': 'spotify:track:5ajjAnNRh8bxFvaVHzpPjh',
  'track_href': 'https://api.spotify.com/v1/tracks/5ajjAnNRh8bxFvaVHzpPjh',
  'analysis_url': 'https://api.spotif

In [157]:
# Create a dataframe from the audio features list
features_df = pd.DataFrame.from_records(features_list)

In [160]:
# Save feat_df to a csv
features_df.to_csv("./Data/features_df", index=False)

I saved this dataframe off as a separate csv file so that I do not need to refresh my API call. This also allowed me to have a cleaner modeling notebook.

### Target Feature - Popularity
Let's do the same thing to get the popularity scores from the list of track id's.

In [32]:
# Get popularity for a single track
popularity = spotify.tracks(['02MWAaffLxlfxAUY7c5dvx'])['tracks']
popularity

[{'album': {'album_type': 'album',
   'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/4yvcSjfu4PC0CYQyLy4wSq'},
     'href': 'https://api.spotify.com/v1/artists/4yvcSjfu4PC0CYQyLy4wSq',
     'id': '4yvcSjfu4PC0CYQyLy4wSq',
     'name': 'Glass Animals',
     'type': 'artist',
     'uri': 'spotify:artist:4yvcSjfu4PC0CYQyLy4wSq'}],
   'available_markets': ['AD',
    'AE',
    'AG',
    'AL',
    'AM',
    'AO',
    'AR',
    'AT',
    'AU',
    'AZ',
    'BA',
    'BB',
    'BD',
    'BE',
    'BF',
    'BG',
    'BH',
    'BI',
    'BJ',
    'BN',
    'BO',
    'BR',
    'BS',
    'BT',
    'BW',
    'BY',
    'BZ',
    'CA',
    'CD',
    'CG',
    'CH',
    'CI',
    'CL',
    'CM',
    'CO',
    'CR',
    'CV',
    'CW',
    'CY',
    'CZ',
    'DE',
    'DJ',
    'DK',
    'DM',
    'DO',
    'DZ',
    'EC',
    'EE',
    'EG',
    'ES',
    'ET',
    'FI',
    'FJ',
    'FM',
    'FR',
    'GA',
    'GB',
    'GD',
    'GE',
    'GH',
    'GM',
    'GN',
 

Another loop to iterate through our entire list of track id's. This specific endpoint only allows you to pull 50 rows at a time. It is also buried in a dictionary, which requires a nested for loop.

In [126]:
# Get popularity for all tracks in id_list into a dictionary
popularityDict = {}
for i in range(0, len(id_list) - 50, 50):
    result = spotify.tracks(id_list[i:i+50])
    if result:
        tracks = result['tracks']
        for track in tracks:
            popularityDict[track['id']] = track['popularity']
popularityDict 

{'1058fW9H3fZA6QjYCdOBad': 75,
 '5ajjAnNRh8bxFvaVHzpPjh': 79,
 '079Ey5uxL04AKPQgVQwx5h': 74,
 '6FZDfxM3a3UCqtzo5pxSLZ': 80,
 '2Oycxb8QbPkpHTo8ZrmG0B': 79,
 '1CMa9Rxlky3HsjkB8oYuL0': 34,
 '0z8hI3OPS8ADPWtoCjjLl6': 80,
 '45bE4HXI0AwGZXfZtMp8JR': 83,
 '03B2SfXuvDh1m9F4tqrX07': 68,
 '6J2LdBN97cDWn0MLxYh9HB': 80,
 '46u5B2WN4wryYLZuMAOmI4': 72,
 '5Z0AM9HW78XIyZqF2BPasr': 68,
 '6nfqlFOMiWthaOEa53uU0v': 65,
 '7eQHxigpuDJjCG50JyzU8v': 74,
 '49dFIRQCQxPWgoH0m38XQ5': 66,
 '4qefHyLKbyW3yeqk5Jrjey': 63,
 '6PZpNMstpIiRenGK5UyG5D': 60,
 '77MdvMx9L4ZQuLhhn3o21h': 74,
 '2vXgyN14LX2zl7JEASw242': 72,
 '0ZXdzaT1k688dkpNeEgQiV': 68,
 '6KL88T4Ma4ABXqzgUoEwkd': 75,
 '35mvY5S1H3J2QZyna3TFe0': 84,
 '5VmpLtRycwbA54XsTffKq4': 1,
 '2D0FX6WiP1GKGL3yCdXxs7': 65,
 '59qrUpoplZxbIZxk6X0Bm3': 77,
 '3D2H0RZzOXziswr9UHbpyb': 64,
 '1is8gU4RVcN4J8xItxWoOY': 77,
 '3n5te2xbUAPjzAnhLgA42z': 46,
 '69HzZ3ti9DLwb0GdWCGYSo': 77,
 '3p2wS6G159mBIU50xl7uvc': 59,
 '7CH8J4ulT49UfZwSDSkSZA': 62,
 '3DTqHfTGj1c6y2gDXsTez4': 63,
 '73QyyUM

In [132]:
# Create a dataframe from the dictionary
pop_df = pd.DataFrame(list(popularityDict.items()))

In [142]:
# Rename columns
pop_df.rename(columns={0: "id", 1: "popularity"}, inplace=True)

In [144]:
# Save df to a csv
pop_df.to_csv("./Data/target_df", index=False)

And thats it! Hope this was helpful.