# Collecting Information and Audio Features into a Dataframe

## Importing Libaries

First we import the libraries necessary for using the Spotify API. Fortunately, Spotify has created an API wrapper called `spotipy` which allows us to interact with their API from a top level. 

We have two options for interacting with their API: Authorization Flow and Client Credentials. The Authorization flow is used when we need personal information from a Spotify user like their playlists, likes etc. Client Credentials flow is used to pull general information about Spotify's tracks, playlists etc. We will use the _Client Credentials_ flow. 

In [2]:
import os
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from dotenv import load_dotenv, dotenv_values

The dotenv library helps us load our environment variables. You can create your environment variables in a `.env` file.

Now, we have created a `.env` file in our local machines which has our API information like Client_ID, Client_Secret and Redirect_URL. You can create a similar file in the same directory that has this file. The template for the file is given below for your convenience. Save the file as `.env`

```
    CLIENT_ID="your_client_id_here"
    CLIENT_SECRET="your_client_secret_here"
    REDIRECT_URL="your_redirect_url_here"

```

## Loading environment variables and storing them

In [3]:
load_dotenv()

True

In [4]:
client_id = os.getenv("CLIENT_ID")
client_secret = os.getenv("CLIENT_SECRET")
redirect_url = os.getenv("REDIRECT_URL")

## Creating an API session

In [5]:
auth_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(auth_manager=auth_manager)

## Extracting information from the API

- First we build a dataframe to train our model on. We can do this by taking curated playlists from Spotify with pre-assigned mood labels and classify them according to our labels.

- We will use 4 mood labels : `Happy`, `Sad`, `Enegetic` and `Calm`

- Second, we will also pull top hits songs from the Top Hits playlists to test our model on. Thus, the goal of the project is to classify mood for what the general public likes.

### Creating our DataFrame for building our model.

In [6]:
#we are using playlists from two categories. 
#found URI of specific mood playlists from mood based playlists
happy_playlists = [\
    '37i9dQZF1DX9XIFQuFvzM4','37i9dQZF1DX8Dc28snyWrn','37i9dQZF1DWYBO1MoTDhZI','37i9dQZF1DX1BzILRveYHb']
sad_playlists = [\
    '37i9dQZF1DWSqBruwoIXkA','37i9dQZF1DX9LT7r8qPxfa','37i9dQZF1DWZrBs4FjpxlE','37i9dQZF1DWW2hj3ZtMbuO']
calm_playlists = [\
    '37i9dQZF1DX889U0CL85jj','37i9dQZF1DX6tTW0xDxScH','37i9dQZF1DWUJrRlgpYslH','37i9dQZF1DX17TxDoLeXxl']
energetic_playlists = [\
    '37i9dQZF1DXa2PvUpywmrr','37i9dQZF1DXaXB8fQg7xif','37i9dQZF1DXa90jZU6E5GN','37i9dQZF1DX0u6E5bcKUrV']

Since there is a limit in Spotify WEB API for the number of requests, we are better off first collecting all the track ids in a single list.
Then we can partiton it into sublists with 100 track ids (we are allowed to fetch features for a max of 100 ids at a time). This will allow us to use a single request for multiple track ids.

In [7]:
#extract songs from a playlist and return a list of rows that can be inserted into a dataframe
def get_playlist_tracks(playlist_id):
    track_ids = []
    track_names =[]
    track_artists = []
    tracks = sp.playlist_tracks(playlist_id)['items']
    for track in tracks:
        track_ids.append(track['track']['id'])
        track_names.append(track['track']['name'])
        track_artists.append(track['track']['artists'])
    return track_ids, track_names, track_artists
        
#extract songs from all playlist 
def extract_track_ids(playlist_lists):
    all_track_ids = []
    all_track_names = []
    all_track_artists = []
    for playlist in playlist_lists:
        track_ids, track_names, track_artists = get_playlist_tracks(playlist)
        all_track_ids.extend(track_ids)
        all_track_names.extend(track_names)
        all_track_artists.extend(track_artists)
        
    return all_track_ids, all_track_names, all_track_artists

#partition function
def partition(lst, size):
    split = [lst[i:i+size] for i in range(0,len(lst),size)]
    return split

   


In [8]:
happy_track_ids, happy_track_names, happy_track_artists = extract_track_ids(happy_playlists)
sad_track_ids, sad_track_names, sad_track_artists = extract_track_ids(sad_playlists)
energetic_track_ids, energetic_track_names, energetic_track_artists = extract_track_ids(energetic_playlists)
calm_track_ids, calm_track_names, calm_track_artists = extract_track_ids(calm_playlists)

In [10]:
#since we can only collect audio features of 100 tracks at a time, we should check the length of the the lists that we have
print("happy:", len(happy_track_names))
print('sad:', len(sad_track_names))
print("calm:", len(calm_track_names))
print('energetic:', len(energetic_track_names))

happy: 305
sad: 325
calm: 400
energetic: 390


In [12]:
def get_features(track_ids):
    part = partition(track_ids,100) #list of partitioned tracks
    audio_features = []
    for p in part:
        audio_features.extend(sp.audio_features(p))
    return audio_features

In [13]:
happy_features = get_features(happy_track_ids)
sad_features = get_features(sad_track_ids)
calm_features = get_features(calm_track_ids)
energetic_features = get_features(energetic_track_ids)

_We also need to parse the artists so that we only include the first artist._

In [16]:
total_artists = (happy_track_artists + sad_track_artists + calm_track_artists + energetic_track_artists)

In [17]:
first_artist = []
for i in range(0,len(total_artists)):
    first_artist.append(total_artists[i][0]['name'])

In [21]:
#combining everything into singular lists
track_name = happy_track_names + sad_track_names + calm_track_names + energetic_track_names
track_id = happy_track_ids + sad_track_ids + calm_track_ids + energetic_track_ids
audio_features = happy_features + sad_features + calm_features + energetic_features

In [19]:
#Labeling our data. Important to maintain the order!!
mood = ['happy']*len(happy_track_ids) + ['sad']*len(sad_track_ids) + ['calm']*len(calm_track_ids) + ['energetic']*len(energetic_track_ids)
mood

['happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',
 'happy',


In [23]:
#checking if all lengths are same
print("No of artists: ", len(first_artist))
print("No. of track names: ", len(track_name))
print("No of track ids: ", len(track_id))
print("No. of tracks with audio features: ", len(audio_features))
print("No. of mood labels: ", len(mood))


No of artists:  1420
No. of track names:  1420
No of track ids:  1420
No. of tracks with audio features:  1420
No. of mood labels:  1420


### All seem to be well. We can now collect it into a Pandas dataframe.

## Collecting extracted information in a Pandas dataframe

In [24]:
import pandas as pd
import numpy as np

df = pd.DataFrame(audio_features)
df.head()


Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,0.663,0.6,7,-10.87,1,0.032,0.43,0.0,0.184,0.8,129.991,audio_features,7tqhbajSfrz2F7E1Z75ASX,spotify:track:7tqhbajSfrz2F7E1Z75ASX,https://api.spotify.com/v1/tracks/7tqhbajSfrz2...,https://api.spotify.com/v1/audio-analysis/7tqh...,151667,4
1,0.527,0.415,4,-11.451,0,0.122,0.457,1.7e-05,0.117,0.515,78.169,audio_features,1k1Bqnv2R0uJXQN4u6LKYt,spotify:track:1k1Bqnv2R0uJXQN4u6LKYt,https://api.spotify.com/v1/tracks/1k1Bqnv2R0uJ...,https://api.spotify.com/v1/audio-analysis/1k1B...,125093,4
2,0.572,0.418,0,-10.738,1,0.0349,0.635,0.0,0.0961,0.694,104.566,audio_features,745H5CctFr12Mo7cqa1BMH,spotify:track:745H5CctFr12Mo7cqa1BMH,https://api.spotify.com/v1/tracks/745H5CctFr12...,https://api.spotify.com/v1/audio-analysis/745H...,165000,4
3,0.769,0.367,2,-11.226,1,0.0312,0.684,1.6e-05,0.081,0.535,103.621,audio_features,3zBhihYUHBmGd2bcQIobrF,spotify:track:3zBhihYUHBmGd2bcQIobrF,https://api.spotify.com/v1/tracks/3zBhihYUHBmG...,https://api.spotify.com/v1/audio-analysis/3zBh...,163756,4
4,0.65,0.306,9,-9.443,1,0.0393,0.57,7e-06,0.0707,0.605,118.068,audio_features,3SdTKo2uVsxFblQjpScoHy,spotify:track:3SdTKo2uVsxFblQjpScoHy,https://api.spotify.com/v1/tracks/3SdTKo2uVsxF...,https://api.spotify.com/v1/audio-analysis/3SdT...,180056,4


In [25]:
#adding other columns
df.insert(loc=0, column='track_id', value=track_id)
df.insert(loc=1, column='track_name', value=track_name)
df.insert(loc=2, column='first_artist', value=first_artist)
df['mood'] = mood
df.head()

Unnamed: 0,track_id,track_name,first_artist,danceability,energy,key,loudness,mode,speechiness,acousticness,...,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature,mood
0,7tqhbajSfrz2F7E1Z75ASX,Ain't No Mountain High Enough,Marvin Gaye,0.663,0.6,7,-10.87,1,0.032,0.43,...,0.8,129.991,audio_features,7tqhbajSfrz2F7E1Z75ASX,spotify:track:7tqhbajSfrz2F7E1Z75ASX,https://api.spotify.com/v1/tracks/7tqhbajSfrz2...,https://api.spotify.com/v1/audio-analysis/7tqh...,151667,4,happy
1,1k1Bqnv2R0uJXQN4u6LKYt,Ain't No Sunshine,Bill Withers,0.527,0.415,4,-11.451,0,0.122,0.457,...,0.515,78.169,audio_features,1k1Bqnv2R0uJXQN4u6LKYt,spotify:track:1k1Bqnv2R0uJXQN4u6LKYt,https://api.spotify.com/v1/tracks/1k1Bqnv2R0uJ...,https://api.spotify.com/v1/audio-analysis/1k1B...,125093,4,happy
2,745H5CctFr12Mo7cqa1BMH,My Girl,The Temptations,0.572,0.418,0,-10.738,1,0.0349,0.635,...,0.694,104.566,audio_features,745H5CctFr12Mo7cqa1BMH,spotify:track:745H5CctFr12Mo7cqa1BMH,https://api.spotify.com/v1/tracks/745H5CctFr12...,https://api.spotify.com/v1/audio-analysis/745H...,165000,4,happy
3,3zBhihYUHBmGd2bcQIobrF,(Sittin' On) the Dock of the Bay,Otis Redding,0.769,0.367,2,-11.226,1,0.0312,0.684,...,0.535,103.621,audio_features,3zBhihYUHBmGd2bcQIobrF,spotify:track:3zBhihYUHBmGd2bcQIobrF,https://api.spotify.com/v1/tracks/3zBhihYUHBmG...,https://api.spotify.com/v1/audio-analysis/3zBh...,163756,4,happy
4,3SdTKo2uVsxFblQjpScoHy,Stand by Me,Ben E. King,0.65,0.306,9,-9.443,1,0.0393,0.57,...,0.605,118.068,audio_features,3SdTKo2uVsxFblQjpScoHy,spotify:track:3SdTKo2uVsxFblQjpScoHy,https://api.spotify.com/v1/tracks/3SdTKo2uVsxF...,https://api.spotify.com/v1/audio-analysis/3SdT...,180056,4,happy


### Dropping irrelavant columns

In [26]:
df.columns

Index(['track_id', 'track_name', 'first_artist', 'danceability', 'energy',
       'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'type', 'id', 'uri',
       'track_href', 'analysis_url', 'duration_ms', 'time_signature', 'mood'],
      dtype='object')

In [27]:
df.drop(['id','track_href','analysis_url'],axis=1,inplace=True)
df.head()

Unnamed: 0,track_id,track_name,first_artist,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,uri,duration_ms,time_signature,mood
0,7tqhbajSfrz2F7E1Z75ASX,Ain't No Mountain High Enough,Marvin Gaye,0.663,0.6,7,-10.87,1,0.032,0.43,0.0,0.184,0.8,129.991,audio_features,spotify:track:7tqhbajSfrz2F7E1Z75ASX,151667,4,happy
1,1k1Bqnv2R0uJXQN4u6LKYt,Ain't No Sunshine,Bill Withers,0.527,0.415,4,-11.451,0,0.122,0.457,1.7e-05,0.117,0.515,78.169,audio_features,spotify:track:1k1Bqnv2R0uJXQN4u6LKYt,125093,4,happy
2,745H5CctFr12Mo7cqa1BMH,My Girl,The Temptations,0.572,0.418,0,-10.738,1,0.0349,0.635,0.0,0.0961,0.694,104.566,audio_features,spotify:track:745H5CctFr12Mo7cqa1BMH,165000,4,happy
3,3zBhihYUHBmGd2bcQIobrF,(Sittin' On) the Dock of the Bay,Otis Redding,0.769,0.367,2,-11.226,1,0.0312,0.684,1.6e-05,0.081,0.535,103.621,audio_features,spotify:track:3zBhihYUHBmGd2bcQIobrF,163756,4,happy
4,3SdTKo2uVsxFblQjpScoHy,Stand by Me,Ben E. King,0.65,0.306,9,-9.443,1,0.0393,0.57,7e-06,0.0707,0.605,118.068,audio_features,spotify:track:3SdTKo2uVsxFblQjpScoHy,180056,4,happy


In [28]:
df.drop(['type'],inplace=True,axis=1)
df.head()

Unnamed: 0,track_id,track_name,first_artist,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,uri,duration_ms,time_signature,mood
0,7tqhbajSfrz2F7E1Z75ASX,Ain't No Mountain High Enough,Marvin Gaye,0.663,0.6,7,-10.87,1,0.032,0.43,0.0,0.184,0.8,129.991,spotify:track:7tqhbajSfrz2F7E1Z75ASX,151667,4,happy
1,1k1Bqnv2R0uJXQN4u6LKYt,Ain't No Sunshine,Bill Withers,0.527,0.415,4,-11.451,0,0.122,0.457,1.7e-05,0.117,0.515,78.169,spotify:track:1k1Bqnv2R0uJXQN4u6LKYt,125093,4,happy
2,745H5CctFr12Mo7cqa1BMH,My Girl,The Temptations,0.572,0.418,0,-10.738,1,0.0349,0.635,0.0,0.0961,0.694,104.566,spotify:track:745H5CctFr12Mo7cqa1BMH,165000,4,happy
3,3zBhihYUHBmGd2bcQIobrF,(Sittin' On) the Dock of the Bay,Otis Redding,0.769,0.367,2,-11.226,1,0.0312,0.684,1.6e-05,0.081,0.535,103.621,spotify:track:3zBhihYUHBmGd2bcQIobrF,163756,4,happy
4,3SdTKo2uVsxFblQjpScoHy,Stand by Me,Ben E. King,0.65,0.306,9,-9.443,1,0.0393,0.57,7e-06,0.0707,0.605,118.068,spotify:track:3SdTKo2uVsxFblQjpScoHy,180056,4,happy


## Packaging it into a csv file

In [29]:
df.to_csv('spotifytrackinfo.csv')

### Following the same steps, we create a toplist dataset to use our model on

Here we are looking at a the Billboard Year End Hot 100 Chart. There is a spotify playlist created around the same which we will use.

In [35]:
ids, names, artists = get_playlist_tracks('7CxLCjNsxPdZbSmc5gbYJN')
first_artist_ye = []
for i in range(0,len(artists)):
    first_artist_ye.append(artists[i][0]['name'])
audio_features_fe = sp.audio_features(ids)


In [37]:
year_end_df = pd.DataFrame(audio_features_fe)
year_end_df.insert(loc=0, column='track_id', value=ids)
year_end_df.insert(loc=1, column='track_name', value=names)
year_end_df.insert(loc=2, column='first_artist', value=first_artist_ye)
year_end_df.head()


Unnamed: 0,track_id,track_name,first_artist,danceability,energy,key,loudness,mode,speechiness,acousticness,...,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,7K3BhSpAxZBznislvUMVtn,Last Night,Morgan Wallen,0.492,0.675,6,-5.456,1,0.0389,0.467,...,0.142,0.478,203.759,audio_features,7K3BhSpAxZBznislvUMVtn,spotify:track:7K3BhSpAxZBznislvUMVtn,https://api.spotify.com/v1/tracks/7K3BhSpAxZBz...,https://api.spotify.com/v1/audio-analysis/7K3B...,163855,4
1,0yLdNVWF3Srea0uzk55zFn,Flowers,Miley Cyrus,0.707,0.681,0,-4.325,1,0.0668,0.0632,...,0.0322,0.646,117.999,audio_features,0yLdNVWF3Srea0uzk55zFn,spotify:track:0yLdNVWF3Srea0uzk55zFn,https://api.spotify.com/v1/tracks/0yLdNVWF3Sre...,https://api.spotify.com/v1/audio-analysis/0yLd...,200455,4
2,3OHfY25tqY28d16oZczHc8,Kill Bill,SZA,0.644,0.728,8,-5.75,1,0.0351,0.0543,...,0.161,0.43,88.993,audio_features,3OHfY25tqY28d16oZczHc8,spotify:track:3OHfY25tqY28d16oZczHc8,https://api.spotify.com/v1/tracks/3OHfY25tqY28...,https://api.spotify.com/v1/audio-analysis/3OHf...,153947,4
3,0V3wPSX9ygBnCm8psDIegu,Anti-Hero,Taylor Swift,0.637,0.643,4,-6.571,1,0.0519,0.13,...,0.142,0.533,97.008,audio_features,0V3wPSX9ygBnCm8psDIegu,spotify:track:0V3wPSX9ygBnCm8psDIegu,https://api.spotify.com/v1/tracks/0V3wPSX9ygBn...,https://api.spotify.com/v1/audio-analysis/0V3w...,200690,4
4,2dHHgzDwk4BJdRwy9uXhTO,Creepin' (with The Weeknd & 21 Savage),Metro Boomin,0.715,0.62,1,-6.005,0,0.0484,0.417,...,0.0822,0.172,97.95,audio_features,2dHHgzDwk4BJdRwy9uXhTO,spotify:track:2dHHgzDwk4BJdRwy9uXhTO,https://api.spotify.com/v1/tracks/2dHHgzDwk4BJ...,https://api.spotify.com/v1/audio-analysis/2dHH...,221520,4


In [38]:
year_end_df.to_csv('bbyearend.csv')