<a href="https://colab.research.google.com/github/Zhiyang123/SC1015_Hit_Song_Predictor/blob/main/Audio_Features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preparation for Project (Spotify API) (NEW)

Since our group intends to compare audio features to popularity of songs from Spotify, we would first have to obtain our dataset with the help of Spotify API. This python notebook would walk you through our thought process of obtaining and cleaning our data.

In [None]:
#importing of essential libraries
import pandas as pd
import numpy as np 
import seaborn as sb 
import matplotlib.pyplot as plt
sb.set()

Thankfully, there exists a Spotipy library, which is a lightweight Python library for the Spotify Web API. With Spotipy, we get full access to all of the music data provided by the Spotify platform.

In [None]:
#importing the Spotipy python library and API key
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

#individual API key to gain access to Spotify data
cid = '6b2418c9674f4f7c9f5e2809aa2b3678'
secret = '781801fad3b646c6b7844583a170957c'
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager
=
client_credentials_manager)

Firstly, we extracted the following data. 

1. Artist Name
2. Track Name
3. Track ID
4. Popularity

While attempting to extract the data, we faced the issue that the API only allowed us to extract 50 songs in each iteration, and a total of 1000 songs from each year. 

Since we needed a sufficiently large data set to do any considerable machine learning (n ≈ 20000), we decided to use two for loops to circumvent the issue. The first for loop iterates through the year 2000 to 2020. <br>
The second for loop iterate through the 1000 possible songs that can be extracted.

In [None]:
#Initialising of list to hold necessary data
artist_name = []
track_name = []
popularity = []
track_id = []

#Creating a list for year 2000 to 2020
start_year = 2000
years = []
for i in range(20):
    years.append(str(start_year))
    start_year += 1


#Using API to extract 20000 songs from Spotify
for yr in range(20):
    for i in range(20):
        track_results = sp.search(q='year:{}'.format(years[yr]), type='track', limit=50,offset=i*50)
        for i, t in enumerate(track_results['tracks']['items']):
            artist_name.append(t['artists'][0]['name'])
            track_name.append(t['name'])
            track_id.append(t['id'])
            popularity.append(t['popularity'])  

Since the extraction of songs were purely random, there was the possibility to duplicates within our data. Our group initially decided to remove duplicates based on songs with the same name.

In [None]:
#Converting list to Dataframe to check unique song names
df = pd.DataFrame(track_name, columns = ['Track name'])
len(df['Track name'].unique())

17617

However, it became apparant to us that there were multiple songs with the same name, but not from the same artists. Hence, we decided to filter out duplicates through the unique track IDs that each track is labelled with.

In [None]:
#Creating Dataframe to hold all data that we extracted using Spotify API
df = pd.DataFrame({'track_id': track_id, 'track_name': track_name, 'artist_name': artist_name, 'popularity': popularity})

#Comparing the unique number of track IDs and track names
print(len(df['track_id'].unique()))
print(len(df['track_name'].unique()))

19419
17617


In [None]:
df_no_duplicate = df.drop_duplicates(subset=['track_id'])

After removing all the duplicate track IDs, we proceeded on to use the remaining track IDs to extract the corresponding audio features of the songs.

We once again encountered the issue of Spotify API only allowing us to extract audio features of 100 songs at a time. Hence we had to use a for loop to iterate through our large amount of data. Pardon the little bit of hard coding we had to do.

In [None]:
#Creating a list with just track IDs
ls = list(df['track_id'])

#Creating list to hold audio features data
audio_features = []

#
for ids in range(0, 19400, 100):
    audio_features.append(sp.audio_features(ls[ids:ids+100]))

audio_features.append(sp.audio_features(ls[19400:19419]))

Since the audio features data extracted were in a nested list format, we created individual lists for each audio feature and filled them up by accessing the main list through indexing.

We soon discovered that for a handful amount of songs, though there was no error indicated, the audio_feature function returned None, we decided to assign audio features of this nature with an extreme value for the ease of identifying and removing when we combine the audio features with the main data set. In this case, we settled with -10000 since no audio feature had value within that range. 

In [None]:
features = ['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness',
            'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature']

danceability = []
energy = []
key = []
loudness = []
mode = []
speechiness = []
acousticness = []
instrumentalness = []
liveness = []
valence = []
tempo = []
duration = []
time_signature = []

for i in range(0, 194):
    for j in range(100):
        if audio_features[i][j] != None:
            danceability.append(audio_features[i][j]['danceability'])
            energy.append(audio_features[i][j]['energy'])
            key.append(audio_features[i][j]['key'])
            loudness.append(audio_features[i][j]['loudness'])
            mode.append(audio_features[i][j]['mode'])
            speechiness.append(audio_features[i][j]['speechiness'])
            acousticness.append(audio_features[i][j]['acousticness'])
            instrumentalness.append(audio_features[i][j]['instrumentalness'])
            liveness.append(audio_features[i][j]['liveness'])
            valence.append(audio_features[i][j]['valence'])
            tempo.append(audio_features[i][j]['tempo'])
            duration.append(audio_features[i][j]['duration_ms'])
            time_signature.append(audio_features[i][j]['time_signature'])
            
        else:
            danceability.append(-10000)
            energy.append(-10000)
            key.append(-10000)
            loudness.append(-10000)
            mode.append(-10000)
            speechiness.append(-10000)
            acousticness.append(-10000)
            instrumentalness.append(-10000)
            liveness.append(-10000)
            valence.append(-10000)
            tempo.append(-10000)
            duration.append(-10000)
            time_signature.append(-10000)
            
        


i = 194
for j in range(19):
        if audio_features[i][j] != None:
            danceability.append(audio_features[i][j]['danceability'])
            energy.append(audio_features[i][j]['energy'])
            key.append(audio_features[i][j]['key'])
            loudness.append(audio_features[i][j]['loudness'])
            mode.append(audio_features[i][j]['mode'])
            speechiness.append(audio_features[i][j]['speechiness'])
            acousticness.append(audio_features[i][j]['acousticness'])
            instrumentalness.append(audio_features[i][j]['instrumentalness'])
            liveness.append(audio_features[i][j]['liveness'])
            valence.append(audio_features[i][j]['valence'])
            tempo.append(audio_features[i][j]['tempo'])
            duration.append(audio_features[i][j]['duration_ms'])
            time_signature.append(audio_features[i][j]['time_signature'])
            
        else:
            danceability.append(-10000)
            energy.append(-10000)
            key.append(-10000)
            loudness.append(-10000)
            mode.append(-10000)
            speechiness.append(-10000)
            acousticness.append(-10000)
            instrumentalness.append(-10000)
            liveness.append(-10000)
            valence.append(-10000)
            tempo.append(-10000)
            duration.append(-10000)
            time_signature.append(-10000)
        

In [None]:
#Creating of dataframe consisting of all audio features
audio_features = pd.DataFrame({'danceability': danceability, 'energy': energy, 
                               'key': key, 'loudness': loudness,
                              'mode': mode, 'speechiness': speechiness,
                              'acousticness': acousticness, 'instrumentalness': instrumentalness,
                              'liveness': liveness, 'valence': valence,
                              'tempo': tempo, 'duration': duration,
                              'time_signature': time_signature})

In [None]:
#Audio feature dataframe
audio_features

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration,time_signature
0,0.949,0.661,5,-4.244,0,0.0572,0.03020,0.000000,0.0454,0.760,104.504,284200,4
1,0.325,0.768,2,-7.510,1,0.0491,0.04260,0.000004,0.2700,0.454,176.600,247787,4
2,0.654,0.810,4,-6.260,0,0.0288,0.00719,0.002510,0.1650,0.661,114.623,200307,4
3,0.206,0.990,10,-3.565,1,0.1300,0.00176,0.000081,0.3350,0.666,187.433,144627,4
4,0.371,0.268,1,-10.506,1,0.0281,0.74800,0.051700,0.1040,0.165,102.617,227093,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
19414,0.625,0.554,10,-5.531,1,0.0618,0.41700,0.000000,0.3050,0.846,111.911,213254,4
19415,0.569,0.996,5,-0.450,1,0.2240,0.19800,0.011400,0.7070,0.461,100.020,201452,3
19416,0.917,0.657,8,-5.716,1,0.0921,0.31800,0.000004,0.0754,0.818,131.030,165733,4
19417,0.568,0.787,4,-6.765,1,0.0462,0.11600,0.003050,0.5220,0.695,108.015,308477,4


In [None]:
#reseting the index of main dataframe, please ignore the index column
data = df_no_duplicate.reset_index()
data

Unnamed: 0,index,track_id,track_name,artist_name,popularity
0,0,3yfqSUWxFvZELEM4PmlwIR,The Real Slim Shady,Eminem,91
1,1,3LMpZcOhaz2CUX5rfoCNRs,Anthem for the Year 2000,Silverchair,50
2,2,2MLHyLy5z5l5YRp7momlgw,Island In The Sun,Weezer,81
3,3,6pM25DLzJb5oWj74d3ElXI,2000 Light Years Away,Green Day,49
4,4,7D0RhFcb3CrfPuTJ0obrod,Sparks,Coldplay,86
...,...,...,...,...,...
19414,19995,2DR9a7bO5PKoVZil3iGsBv,2019 new year Halosui bag,cocone,0
19415,19996,3eeyKNIBrt5TH4y0y3QkOu,Hot Now,YoungBoy Never Broke Again,68
19416,19997,4hHj4CxhjyEGnG9aaIXRro,Happy New Year 2019 Special,Neha Raj,0
19417,19998,18Gcl5bxdd8bpBhWOj1rCR,HAHA,Lil Darkie,70


In [None]:
#Combining the audio features dataframe and the main dataframe
database = pd.concat([data, audio_features], axis = 1)
database

Unnamed: 0,index,track_id,track_name,artist_name,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration,time_signature
0,0,3yfqSUWxFvZELEM4PmlwIR,The Real Slim Shady,Eminem,91,0.949,0.661,5,-4.244,0,0.0572,0.03020,0.000000,0.0454,0.760,104.504,284200,4
1,1,3LMpZcOhaz2CUX5rfoCNRs,Anthem for the Year 2000,Silverchair,50,0.325,0.768,2,-7.510,1,0.0491,0.04260,0.000004,0.2700,0.454,176.600,247787,4
2,2,2MLHyLy5z5l5YRp7momlgw,Island In The Sun,Weezer,81,0.654,0.810,4,-6.260,0,0.0288,0.00719,0.002510,0.1650,0.661,114.623,200307,4
3,3,6pM25DLzJb5oWj74d3ElXI,2000 Light Years Away,Green Day,49,0.206,0.990,10,-3.565,1,0.1300,0.00176,0.000081,0.3350,0.666,187.433,144627,4
4,4,7D0RhFcb3CrfPuTJ0obrod,Sparks,Coldplay,86,0.371,0.268,1,-10.506,1,0.0281,0.74800,0.051700,0.1040,0.165,102.617,227093,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19414,19995,2DR9a7bO5PKoVZil3iGsBv,2019 new year Halosui bag,cocone,0,0.625,0.554,10,-5.531,1,0.0618,0.41700,0.000000,0.3050,0.846,111.911,213254,4
19415,19996,3eeyKNIBrt5TH4y0y3QkOu,Hot Now,YoungBoy Never Broke Again,68,0.569,0.996,5,-0.450,1,0.2240,0.19800,0.011400,0.7070,0.461,100.020,201452,3
19416,19997,4hHj4CxhjyEGnG9aaIXRro,Happy New Year 2019 Special,Neha Raj,0,0.917,0.657,8,-5.716,1,0.0921,0.31800,0.000004,0.0754,0.818,131.030,165733,4
19417,19998,18Gcl5bxdd8bpBhWOj1rCR,HAHA,Lil Darkie,70,0.568,0.787,4,-6.765,1,0.0462,0.11600,0.003050,0.5220,0.695,108.015,308477,4


In [None]:
#Removal of rows which had no audio features, these row had columns that contained the extreme value of -10000 that we assigned
features = ['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness',
            'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature']

test_database = database
test_database = test_database[test_database.danceability != -10000]
test_database = test_database[test_database.energy != -10000]
test_database = test_database[test_database.key != -10000]
test_database = test_database[test_database.loudness != -10000]
test_database = test_database[test_database.acousticness != -10000]
test_database = test_database[test_database.instrumentalness != -10000]
test_database = test_database[test_database.liveness != -10000]
test_database = test_database[test_database.valence != -10000]
test_database = test_database[test_database.tempo != -10000]
test_database = test_database[test_database.duration != -10000]
test_database = test_database[test_database.time_signature != -10000]


test_database.reset_index(inplace=True)
test_database

Unnamed: 0,level_0,index,track_id,track_name,artist_name,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration,time_signature
0,0,0,3yfqSUWxFvZELEM4PmlwIR,The Real Slim Shady,Eminem,91,0.949,0.661,5,-4.244,0,0.0572,0.03020,0.000000,0.0454,0.760,104.504,284200,4
1,1,1,3LMpZcOhaz2CUX5rfoCNRs,Anthem for the Year 2000,Silverchair,50,0.325,0.768,2,-7.510,1,0.0491,0.04260,0.000004,0.2700,0.454,176.600,247787,4
2,2,2,2MLHyLy5z5l5YRp7momlgw,Island In The Sun,Weezer,81,0.654,0.810,4,-6.260,0,0.0288,0.00719,0.002510,0.1650,0.661,114.623,200307,4
3,3,3,6pM25DLzJb5oWj74d3ElXI,2000 Light Years Away,Green Day,49,0.206,0.990,10,-3.565,1,0.1300,0.00176,0.000081,0.3350,0.666,187.433,144627,4
4,4,4,7D0RhFcb3CrfPuTJ0obrod,Sparks,Coldplay,86,0.371,0.268,1,-10.506,1,0.0281,0.74800,0.051700,0.1040,0.165,102.617,227093,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19411,19414,19995,2DR9a7bO5PKoVZil3iGsBv,2019 new year Halosui bag,cocone,0,0.625,0.554,10,-5.531,1,0.0618,0.41700,0.000000,0.3050,0.846,111.911,213254,4
19412,19415,19996,3eeyKNIBrt5TH4y0y3QkOu,Hot Now,YoungBoy Never Broke Again,68,0.569,0.996,5,-0.450,1,0.2240,0.19800,0.011400,0.7070,0.461,100.020,201452,3
19413,19416,19997,4hHj4CxhjyEGnG9aaIXRro,Happy New Year 2019 Special,Neha Raj,0,0.917,0.657,8,-5.716,1,0.0921,0.31800,0.000004,0.0754,0.818,131.030,165733,4
19414,19417,19998,18Gcl5bxdd8bpBhWOj1rCR,HAHA,Lil Darkie,70,0.568,0.787,4,-6.765,1,0.0462,0.11600,0.003050,0.5220,0.695,108.015,308477,4


In [None]:
#Removing of unnecessary columns
test_database.drop(['index', 'level_0'], axis=1, inplace = True)
test_database

Unnamed: 0,track_id,track_name,artist_name,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration,time_signature
0,3yfqSUWxFvZELEM4PmlwIR,The Real Slim Shady,Eminem,91,0.949,0.661,5,-4.244,0,0.0572,0.03020,0.000000,0.0454,0.760,104.504,284200,4
1,3LMpZcOhaz2CUX5rfoCNRs,Anthem for the Year 2000,Silverchair,50,0.325,0.768,2,-7.510,1,0.0491,0.04260,0.000004,0.2700,0.454,176.600,247787,4
2,2MLHyLy5z5l5YRp7momlgw,Island In The Sun,Weezer,81,0.654,0.810,4,-6.260,0,0.0288,0.00719,0.002510,0.1650,0.661,114.623,200307,4
3,6pM25DLzJb5oWj74d3ElXI,2000 Light Years Away,Green Day,49,0.206,0.990,10,-3.565,1,0.1300,0.00176,0.000081,0.3350,0.666,187.433,144627,4
4,7D0RhFcb3CrfPuTJ0obrod,Sparks,Coldplay,86,0.371,0.268,1,-10.506,1,0.0281,0.74800,0.051700,0.1040,0.165,102.617,227093,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19411,2DR9a7bO5PKoVZil3iGsBv,2019 new year Halosui bag,cocone,0,0.625,0.554,10,-5.531,1,0.0618,0.41700,0.000000,0.3050,0.846,111.911,213254,4
19412,3eeyKNIBrt5TH4y0y3QkOu,Hot Now,YoungBoy Never Broke Again,68,0.569,0.996,5,-0.450,1,0.2240,0.19800,0.011400,0.7070,0.461,100.020,201452,3
19413,4hHj4CxhjyEGnG9aaIXRro,Happy New Year 2019 Special,Neha Raj,0,0.917,0.657,8,-5.716,1,0.0921,0.31800,0.000004,0.0754,0.818,131.030,165733,4
19414,18Gcl5bxdd8bpBhWOj1rCR,HAHA,Lil Darkie,70,0.568,0.787,4,-6.765,1,0.0462,0.11600,0.003050,0.5220,0.695,108.015,308477,4


In [None]:
#Converting our duration audio feature from milliseconds to minutes
test_database['duration'] = test_database['duration'].div(60000).round(2)
test_database

Unnamed: 0,track_id,track_name,artist_name,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration,time_signature
0,3yfqSUWxFvZELEM4PmlwIR,The Real Slim Shady,Eminem,91,0.949,0.661,5,-4.244,0,0.0572,0.03020,0.000000,0.0454,0.760,104.504,4.74,4
1,3LMpZcOhaz2CUX5rfoCNRs,Anthem for the Year 2000,Silverchair,50,0.325,0.768,2,-7.510,1,0.0491,0.04260,0.000004,0.2700,0.454,176.600,4.13,4
2,2MLHyLy5z5l5YRp7momlgw,Island In The Sun,Weezer,81,0.654,0.810,4,-6.260,0,0.0288,0.00719,0.002510,0.1650,0.661,114.623,3.34,4
3,6pM25DLzJb5oWj74d3ElXI,2000 Light Years Away,Green Day,49,0.206,0.990,10,-3.565,1,0.1300,0.00176,0.000081,0.3350,0.666,187.433,2.41,4
4,7D0RhFcb3CrfPuTJ0obrod,Sparks,Coldplay,86,0.371,0.268,1,-10.506,1,0.0281,0.74800,0.051700,0.1040,0.165,102.617,3.78,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19411,2DR9a7bO5PKoVZil3iGsBv,2019 new year Halosui bag,cocone,0,0.625,0.554,10,-5.531,1,0.0618,0.41700,0.000000,0.3050,0.846,111.911,3.55,4
19412,3eeyKNIBrt5TH4y0y3QkOu,Hot Now,YoungBoy Never Broke Again,68,0.569,0.996,5,-0.450,1,0.2240,0.19800,0.011400,0.7070,0.461,100.020,3.36,3
19413,4hHj4CxhjyEGnG9aaIXRro,Happy New Year 2019 Special,Neha Raj,0,0.917,0.657,8,-5.716,1,0.0921,0.31800,0.000004,0.0754,0.818,131.030,2.76,4
19414,18Gcl5bxdd8bpBhWOj1rCR,HAHA,Lil Darkie,70,0.568,0.787,4,-6.765,1,0.0462,0.11600,0.003050,0.5220,0.695,108.015,5.14,4


In [None]:
#Exporting our dataframe as CSV to be used for our project
test_database.to_csv('spotify.csv')