# Predicting Hit Songs

## Introduction

Major music record labels are a powerful asset for an upcoming artist's success. The majority of artists one would see on the Billboard Top 100 charts are signed by one of the three major record labels: Universal Music Group, Sony Music Entertainment, and Warner Music Group. The average listener is aware of the song and the artist, however, the production of the music goes way deeper. Record labels work behind the scenes to produce, market, and optimize songs in order to generate profit from their content.

There are many factors that go into creating a hit song. The right artist has to be matched with a song, the recording of the song must be spectacular, the production team must be able to promote the song effectively, and the release date would also play an important role.

For this project, I will create a machine learning model that would be capable of classifying whether a song will land in the Billboard Top 100 or not.

In [1]:
#pip install billboard.py
#!pip install spotipy

In [1]:
import pandas as pd
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
import spotipy
from spotipy.oauth2 import SpotifyOAuth
import spotipy.util as util
from spotipy.oauth2 import SpotifyClientCredentials

## Importing Spotify Data

In [2]:
# auth_url = 'https://accounts.spotify.com/api/token'

client_id = '09bdcc4a9eb04b11ac939e95f0ab8325'
client_secret = 'ff9ccbe0edd44f0980eca844049f92cd'

client_credentials_manager = SpotifyClientCredentials(client_id, client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

In [3]:
playlist=sp.user_playlist_tracks('spotify','37i9dQZF1DXe2bobNYDtW8')
playlist['items'][1]['track']['popularity']

84

Function that takes in a playlist uri and number of iterations and returns a dataframe of audio features. Max 100 songs per iteration.

In [4]:
def get_attributes(uri, iterations):
    
    # Create empty dataframe
    playlist_features_list = ["artist", "album", "release_date", "track_name", "track_id", "popularity", 
                             "danceability", "energy", "key", "loudness", "mode", "speechiness",
                             "instrumentalness", "liveness", "valence", "tempo", "duration_ms", "time_signature"]
    playlist_df = pd.DataFrame(columns = playlist_features_list)
    
    # Create empty dict
    d = {}
    offset=0
    for iteration in range(iterations):
        # Loop through every track in the playlist, extract features and append the features to the playlist df
        playlist = sp.playlist_tracks(playlist_id=uri, offset = offset)["items"]
        for track in playlist:
            # Get metadata for each track
            d["artist"] = track["track"]["album"]["artists"][0]["name"]
            d["album"] = track["track"]["album"]["name"]
            d['release_date'] = track['track']['album']['release_date']
            d["track_name"] = track["track"]["name"]
            d["track_id"] = track["track"]["id"]
            d['popularity'] = track['track']['popularity']

            # Get audio features
            audio_features = sp.audio_features(d["track_id"])[0]
            for feature in playlist_features_list[6:]:
                d[feature] = audio_features[feature]

            # Concat the dfs
            track_df = pd.DataFrame(d, index = [0])
            playlist_df = pd.concat([playlist_df, track_df], ignore_index = True)
            
        offset += 100

    return playlist_df

Thank you to the following Spotify users for creating the End of Year Hot-100 Billboard Playlists: Rudy Evan (2000 - 2009), martinh0205 (2010), Courtney Micky Ericks (2011), 11165570549 (2012), gisi0722 (2013), cegomez12 (2014, 2015), wickeddreamer96 (2016, 2017), whe1998 (2018), 21ksdt3ca2433ykekswnkdmi (2019), and Oscar Zheng (2020)

In [5]:
uris = ['0JnJKrKjU6BHVm8D2AyLvm','6qsTClrBMf59rUNnD3fzWc','2Inm8T8QcA90nbOGshxHLo','1yI0s6n02tAYVVl94vS621','36QMGKn8SlbkIOa9ni9pMd','5PQypC2evM3DNjdGiHyuCQ','3JbWD8OGutoTKUbR3RvR8u','2XPEN88QyrPQ9zGqS8uS2x','6J4j6rDdlbUD021gMr06Sl','5lO5SKSvwbLetiBt6k7wNX','1WBljFutuk7uLQtfqfmjWV']

In [6]:
uris

playlist_features_list = ["artist", "album", "release_date", "track_name", "track_id", "popularity", 
                             "danceability", "energy", "key", "loudness", "mode", "speechiness",
                             "instrumentalness", "liveness", "valence", "tempo", "duration_ms", "time_signature"]

df = []

for i in uris:
    new_df = get_attributes(i,1)
    df.append(new_df)

hits = pd.concat(df)

#create new column to annotate these songs were hits
hits['hit'] = int(1)

hits.head()

Unnamed: 0,artist,album,release_date,track_name,track_id,popularity,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,hit
0,Kesha,Animal + Cannibal (Deluxe Edition),2010-11-19,TiK ToK,4srpHYFHKjVGRDNHqRpWxi,0,0.755,0.832,2,-2.741,0,0.116,1.27e-06,0.291,0.735,120.032,199693,4,1
1,Lady A,Need You Now,2010-01-01,Need You Now,11EX5yhxr9Ihl3IN1asrfK,70,0.587,0.622,4,-5.535,1,0.0303,0.000636,0.2,0.231,107.943,277573,4,1
2,Train,"Hey, Soul Sister",2009-08-06,"Hey, Soul Sister",0KpfYajJVVGgQ32Dby7e9i,1,0.675,0.885,1,-4.432,0,0.0436,0.0,0.086,0.768,97.03,216667,4,1
3,Katy Perry,California Gurls (feat. Snoop Dogg),2010,California Gurls - feat. Snoop Dogg,0uslfroVWg7qGWhdduwC0K,0,0.791,0.76,0,-3.785,1,0.0503,0.0,0.127,0.444,125.04,236253,4,1
4,Usher,Raymond v Raymond,2010-03-30,OMG,3r04p85xiJh9Wqk59YDYdc,0,0.781,0.745,4,-5.81,0,0.0332,1.14e-05,0.36,0.326,129.998,269493,4,1


These playlists were compiled from Spotify's public playlists: classical, rock, hip/hop, edm, country, latino

In [13]:
uris_2 = ['37i9dQZF1DX2TRYkJECvfC','37i9dQZF1DX9wa6XirBPv8','37i9dQZF1DX4UtSsGT1Sbe','0CFuMybe6s77w6QQrJjW7d','37i9dQZF1DXa6YOhGMjjgx','65zr5tIVNr5spXyjsX77Gp','7ppwau1XzaZvvW6qnBbjuE','37i9dQZF1DWWEJlAGA9gs0','37i9dQZF1DX10zKzsJ2jva','37i9dQZF1DXbTxeAdrVG2l','37i9dQZF1DWVA1Gq4XHa6U','37i9dQZF1DXa9xHlDa5fc6','37i9dQZF1DX4sWSpwq3LiO']

Gathering more songs: 

In [14]:
uris_2

playlist_features_list = ["artist", "album", "release_date", "track_name", "track_id", "popularity", 
                             "danceability", "energy", "key", "loudness", "mode", "speechiness",
                             "instrumentalness", "liveness", "valence", "tempo", "duration_ms", "time_signature"]

df = []

for i in uris_2:
    new_df = get_attributes(i,4)
    df.append(new_df)

playlist_1 = pd.concat(df)
playlist_1

Unnamed: 0,artist,album,release_date,track_name,track_id,popularity,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,BLOND:ISH,Waves (feat. Grace Tither),2021-03-19,Waves (feat. Grace Tither),6ymVxYG0UHqIjXmclbE1cu,59,0.631,0.7630,9,-9.774,0,0.0352,0.299,0.0875,0.7150,121.008,202501,4
1,DJ Qness,L'owe L'owe,2020-12-23,L'owè L'owè,2kvJDwV4Mfm2ItUUMmkzSx,44,0.800,0.6170,5,-9.875,0,0.0539,0.634,0.1350,0.0728,122.010,449149,4
2,&ME,Discoteca,2021-03-26,Discoteca,0ENV8cY0bwun9qSQkh195f,58,0.772,0.4850,11,-14.388,0,0.0611,0.847,0.0913,0.1560,120.000,385103,4
3,Teen Daze,Peaceful Groove,2020-09-18,Peaceful Groove,5JP4iOnbCRu7zfYd95oG7V,22,0.652,0.6320,11,-12.653,0,0.0415,0.901,0.1140,0.7020,102.023,211765,4
4,Braxton,Indigo EP,2021-04-13,Indigo,0riJCeCbLqBwoVSnG3FMXM,33,0.480,0.7650,1,-6.246,0,0.0345,0.891,0.0889,0.3950,124.015,372897,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
308,Lummus Park,Ocean And 17th,2018-05-06,Ocean And 17th,2IUuk04dZVG8KRBJpcrpzt,57,0.502,0.0441,0,-22.503,1,0.0384,0.943,0.0992,0.0888,77.476,189402,4
309,Ricki Westberg,When We First Met,2019-08-24,When We First Met,0AACyZj1JuHVFtG19P8GTC,60,0.383,0.0180,10,-28.209,1,0.0345,0.95,0.1530,0.3880,89.114,112750,4
310,Mae Ji-Yoon,Vibrations,2017-09-10,Vibrations,1COWV6U3LKqCBpiHPthbiB,58,0.671,0.2000,8,-23.978,1,0.0365,0.939,0.1140,0.5300,107.517,173719,3
311,Novo Talos,Carry On,2018-02-08,Carry On,2e6uDYgJsa7rgIYzbNVoCi,56,0.289,0.0176,8,-24.931,1,0.0360,0.85,0.1110,0.0477,73.598,115485,4


playlist of 10,000 songs:

In [16]:
uri_longest_playlist = '6FKDzNYZ8IW1pvYVF4zUN2'

playlist_features_list = ["artist", "album", "release_date", "track_name", "track_id", "popularity", 
                             "danceability", "energy", "key", "loudness", "mode", "speechiness",
                             "instrumentalness", "liveness", "valence", "tempo", "duration_ms", "time_signature"]

playlist_df = pd.DataFrame(columns = playlist_features_list)

# Create empty dict
d = {}
offset=0
for iteration in range(100):
    # Loop through every track in the playlist, extract features and append the features to the playlist df
    playlist = sp.playlist_tracks(playlist_id=uri_longest_playlist, offset = offset)["items"]
    
    for track in playlist:
        # Get metadata for each track
        d["artist"] = track["track"]["album"]["artists"][0]["name"]
        d["album"] = track["track"]["album"]["name"]
        d['release_date'] = track['track']['album']['release_date']
        d["track_name"] = track["track"]["name"]
        d["track_id"] = track["track"]["id"]
        d['popularity'] = track['track']['popularity']

        # Get audio features
        audio_features = sp.audio_features(d["track_id"])[0]
        for feature in playlist_features_list[6:]:
            d[feature] = audio_features[feature]


        # Concat the dfs
        track_df = pd.DataFrame(d, index = [0])
        playlist_df = pd.concat([playlist_df, track_df], ignore_index = True)
        
    offset += 100

In [17]:
playlist_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   artist            10000 non-null  object 
 1   album             10000 non-null  object 
 2   release_date      10000 non-null  object 
 3   track_name        10000 non-null  object 
 4   track_id          10000 non-null  object 
 5   popularity        10000 non-null  object 
 6   danceability      10000 non-null  float64
 7   energy            10000 non-null  float64
 8   key               10000 non-null  object 
 9   loudness          10000 non-null  float64
 10  mode              10000 non-null  object 
 11  speechiness       10000 non-null  float64
 12  instrumentalness  10000 non-null  float64
 13  liveness          10000 non-null  float64
 14  valence           10000 non-null  float64
 15  tempo             10000 non-null  float64
 16  duration_ms       10000 non-null  object 

In [18]:
df = pd.concat([hits, playlist_df,playlist_1], ignore_index = True)

In [20]:
df.to_csv('data/songs.csv', index=False)