# Getting the billboard data

## Description
This notebook explains the functionality related to querying and processing the data into the dataset used later for the models. The documentation is in two major parts which are retrieving the data from APIs and processing the data into ready to use dataset.<br />

All functionality used here is stored in **data** folder and you can check out the source there (types are in **data/types/** to give more insight on what structure the used variables have). <br />An overview of the function that is used is provided here.<br />

### Index
[Query](#query)
 - Kaggle API
     - [Billboard Top 100 dataset](#billboard)
 - Spotify API
     - [Spotify song IDs for Billboard songs](#ids)
     - [Spotify song features for Billboard songs with song IDs](#features)
     - [Not hit songs with Billboard song album IDs](#notHits)
     
[Process](#Process)


[Requirements](#requirements)

In [1]:
# First set up the environment. Code sources are in folders which are in the parent folder of this notebooks scope.
import sys; sys.path.insert(0, '..') # add parent folder path, now files are queriable from parent folder

# Install Kaggle api package (not included in the docker image) and spotipy
!pip install kaggle spotipy



## <a id='query'></a>Query
In this part data is fetched from different sources with APIs.

In [3]:
# All imports + initialize the Spotify API
from data.query.util import initializeSpotifyAPI, saveJson
api = initializeSpotifyAPI()

from data.query.billboard import getBillboardData
from data.query.spotify_api import getSpotifyDataFromBillboardSongs, getSpotifyAudioFeatures, getSongsWithAlbums

### <a id='billboard'></a>Billboard Top 100 dataset from Kaggle

### Dataset
First query the dataset from Kaggle. https://www.kaggle.com/dhruvildave/billboard-the-hot-100-songs
**downloadBillboardData** is quite straight forward, it takes in the name of the dataset as a string handles the Kaggle credentials and calls **downloadKaggleDataset** which will download the dataset in the defined path. 

As default the path is set to **data/datasets/billboard/** if you do not have the folders for this path, they are created.

*Credentials can be passed to the function directly (check source)

In [3]:
from data.query.billboard import downloadBillboardData

# Download the kaggle billboard dataset
datasetName = 'dhruvildave/billboard-the-hot-100-songs'
downloadBillboardData(datasetName)

Now the billboard songs are queryable with **getBillboardData** function. It reads the zip file that contains the dataset, takes the file that is given as input (default charts.csv) and parses the required data. Finally it returns list of billboard songs.

In [5]:
# Fetch billboard song data from zip file (kaggle ds)
billboardTracks = getBillboardData('../data/datasets/billboard/billboard-the-hot-100-songs.zip')

In [5]:
# View 10 first songs
print(billboardTracks[:10])

[{'rank': '1', 'song': 'Butter', 'artist': 'BTS', 'last-week': '1', 'peak-rank': '1', 'weeks-on-board': '7', 'date': '2021-07-17'}, {'rank': '2', 'song': 'Good 4 U', 'artist': 'Olivia Rodrigo', 'last-week': '2', 'peak-rank': '1', 'weeks-on-board': '8', 'date': '2021-07-17'}, {'rank': '3', 'song': 'Levitating', 'artist': 'Dua Lipa Featuring DaBaby', 'last-week': '4', 'peak-rank': '2', 'weeks-on-board': '40', 'date': '2021-07-17'}, {'rank': '4', 'song': 'Kiss Me More', 'artist': 'Doja Cat Featuring SZA', 'last-week': '3', 'peak-rank': '3', 'weeks-on-board': '13', 'date': '2021-07-17'}, {'rank': '5', 'song': 'Montero (Call Me By Your Name)', 'artist': 'Lil Nas X', 'last-week': '8', 'peak-rank': '1', 'weeks-on-board': '15', 'date': '2021-07-17'}, {'rank': '6', 'song': 'Bad Habits', 'artist': 'Ed Sheeran', 'last-week': '5', 'peak-rank': '5', 'weeks-on-board': '2', 'date': '2021-07-17'}, {'rank': '7', 'song': 'Leave The Door Open', 'artist': 'Silk Sonic (Bruno Mars & Anderson .Paak)', 'last-

### <a id='ids'></a>Spotify song ids


Now that the billboard song data is fetched from Kaggle. Next the songs are needed to map to songs in spotify database. 

Unfortunately the dataset doesn't have Spotify IDs ready to use and query the song information.

To get the IDs the songs have to be searched first with the song and artist name. **getSpotifyDataFromBillboardSongs** will do the trick.

Querying the songs is not totally straight forward. The song names can differ in billboard data and spotify data + it can be that the song is not available in spotify (market, artist etc. reasons).

Matching the results with billboard data information is done so that first just the song is used as the query string. If the song names match and the billboard data artist is in the list of artists that the song has listed, all good the song is added to the data.

The songs that do not match directly, a new query is made where song and artist name is used as query string. If the tokenized sets of the query string and spotify result song + artists string matches close enough it is added to the data. Mainly this collect songs with unknown characters in song or artists names or song name having for example "feat. 'some artist' " in the song name, but in Spotify data the featured artists are put in artist list and then the song names do not match. 

This approach raises a second problem which is that in Spotify there are remix, instrumental and or karaoke versions. These songs are not wanted in the data so blacklisting is implemented. All song names containing these are blacklisted and not added (with few exceptions that are whitelisted).

The billboard data has top 100 songs per week , therefore a song that is in the top list two weeks in row will be queried two times. During this querying all the duplicates are ignored.

Finally all search results are stored in json file.

*Total number songs to query prints total number of songs in billboard dataset, but one song is queried only ones

*Takes about 2-3 hours to query the full dataset

In [5]:
# Query song information from spotify with song names
queryResultPath = '../data/datasets/spotify/query_results.json'
billboardSpotifyTrackData = getSpotifyDataFromBillboardSongs(api, billboardTracks, savePath=queryResultPath)

Total number of songs to query:  328487
Queried songs:  0
Actual queries:  0
Queried songs:  1000
Actual queries:  231
Queried songs:  2000
Actual queries:  338
Timed out trying again...
Queried songs:  3000
Actual queries:  474
Queried songs:  4000
Actual queries:  606
Queried songs:  5000
Actual queries:  705
Queried songs:  6000
Actual queries:  828
Queried songs:  7000
Actual queries:  976
Queried songs:  8000
Actual queries:  1114
Queried songs:  9000
Actual queries:  1206
Queried songs:  10000
Actual queries:  1320
Queried songs:  11000
Actual queries:  1405
Queried songs:  12000
Actual queries:  1501
Queried songs:  13000
Actual queries:  1584
Queried songs:  14000
Actual queries:  1684
Queried songs:  15000
Actual queries:  1801
Queried songs:  16000
Actual queries:  1922
Queried songs:  17000
Actual queries:  2032
Queried songs:  18000
Actual queries:  2127
Queried songs:  19000
Actual queries:  2204
Queried songs:  20000
Actual queries:  2266
Queried songs:  21000
Actual quer

Queried songs:  183000
Actual queries:  12049
Queried songs:  184000
Actual queries:  12118
Queried songs:  185000
Actual queries:  12182
Queried songs:  186000
Actual queries:  12232
Queried songs:  187000
Actual queries:  12301
Queried songs:  188000
Actual queries:  12374
Queried songs:  189000
Actual queries:  12434
Queried songs:  190000
Actual queries:  12495
Timed out trying again...
Queried songs:  191000
Actual queries:  12552
Queried songs:  192000
Actual queries:  12621
Queried songs:  193000
Actual queries:  12685
Queried songs:  194000
Actual queries:  12752
Queried songs:  195000
Actual queries:  12821
Queried songs:  196000
Actual queries:  12898
Queried songs:  197000
Actual queries:  12958
Queried songs:  198000
Actual queries:  13045
Queried songs:  199000
Actual queries:  13114
Queried songs:  200000
Actual queries:  13185
Queried songs:  201000
Actual queries:  13251
Queried songs:  202000
Actual queries:  13308
Queried songs:  203000
Actual queries:  13372
Queried 

### <a id='features'></a>Spotify song features for billboard songs
The spotify ID information for billboard songs are queried and now the spotify features can be fetched.

These features are used in the model.

TODO some info about features

**getSpotifyAudioFeatures** will query the audio features from spotify. Results are stored in json file.

In [8]:
# Query features for billboard songs
hitSongPath = '../data/datasets/spotify/hit_song.json'
billboardHitFeatures = getSpotifyAudioFeatures(api, billboardSpotifyTrackData)
# Save the results
saveJson(billboardHitFeatures, hitSongPath)

In [8]:
print(billboardHitFeatures[:5])

[{'info': {'spotifyData': {'name': 'Butter', 'songID': '2bgTY4UwhfBYhGT4HUYStN', 'artists': [{'name': 'BTS', 'artistID': '3Nrfpe0tUJi4K4DXYWgMUX'}], 'album': {'name': 'Butter (Hotter, Sweeter, Cooler)', 'albumID': '1HnJKmB4P6Z8RBdLMWx18w', 'totalTracks': 5, 'releaseDate': '2021-06-04'}}, 'searchQuery': 'Butter', 'minMatchingRatioUsed': 100, 'originalData': {'rank': '1', 'song': 'Butter', 'artist': 'BTS', 'last-week': '1', 'peak-rank': '1', 'weeks-on-board': '7', 'date': '2021-07-17'}}, 'features': {'timeSignature': 4, 'durationMS': 164442, 'key': 8, 'mode': 1, 'acousticness': 0.00323, 'danceability': 0.759, 'energy': 0.459, 'instrumentalness': 0, 'liveness': 0.0906, 'loudness': -5.187, 'speechiness': 0.0948, 'valence': 0.695, 'tempo': 109.997}}, {'info': {'spotifyData': {'name': 'good 4 u', 'songID': '4ZtFanR9U6ndgddUvNcjcG', 'artists': [{'name': 'Olivia Rodrigo', 'artistID': '1McMsnEElThX1knmY4oliG'}], 'album': {'name': 'SOUR', 'albumID': '6s84u2TUpR3wdUv4NgKA2j', 'totalTracks': 11, '

### <a id='notHits'></a>Random songs with album ids & features for not hit songs


To use supervised machine learing methods, the model needs examples of the data with labels. To find difference between a billboard song (considered as hit song) and a not billboard song (considered as not hit), the other part of the data is now fetched to be used. 

Unfortunately the model needs the not hit samples too to make any sense of the difference (in theory). Just random songs could be fetched from the Spotify API that are not on billboard lists and use them. The problem is: what to search for ?

Solution for this is to use songs that shares the album with a hit song as we do have the album information. Also when the artist is usually the same, there should not be such a big difference in the spotify audio feture perspective than compairing a random song from a random artist. Therefore the line between hit and not hit would be potentially more accurate.

For implementation of this querying **getSongsWithAlbums** does it all.

First it will take the album id for every billboard song element and query the album information.

Next a random sample of the tracks is taken from the songs of the album query results.

If the song is the same as the song used for querying or the song name has blacklisted elements it is ignored.

Song information is parsed and the results are considered to be NOT hit songs.

Finally the results are stored in json file.

Using 5 random songs for every album where every unique billboard song has an album (about 20k) this is going to take a long time.

In [1]:
import sys; sys.path.insert(0, '..') # add parent folder path, now files are queriable from parent folder

# All imports + initialize the Spotify API
from data.query.util import initializeSpotifyAPI, saveJson
api = initializeSpotifyAPI()

from data.query.billboard import getBillboardData
from data.query.spotify_api import getSpotifyDataFromBillboardSongs, getSpotifyAudioFeatures, getSongsWithAlbums

# Fetch billboard song data from zip file (kaggle ds)
billboardTracks = getBillboardData('../data/datasets/billboard/billboard-the-hot-100-songs.zip')

# Query song information from spotify with song names
queryResultPath = '../data/datasets/spotify/query_results.json'
billboardSpotifyTrackData = getSpotifyDataFromBillboardSongs(api, billboardTracks, savePath=queryResultPath)

In [2]:
# Query non hit songs with album information
numSongsFromAlbum = 5
notHitSongPath = '../data/datasets/spotify/not_hit_song.json'
billboardNOTHitFeatures = getSongsWithAlbums(api, billboardSpotifyTrackData, numSongsFromAlbum)
saveJson(billboardNOTHitFeatures, notHitSongPath)

Query number:
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700
4800
4900
5000
5100
5200
5300
5400
5500
5600
5700
5800
5900
6000
6100
6200
6300
6400
6500
6600
6700
6800
6900
7000
7100
7200
7300
7400
7500
7600
7700
7800
7900
8000
8100
8200
8300
8400
8500
8600
8700
8800
8900
9000
9100
9200
9300
9400
9500
9600
9700
9800
9900
10000
10100
10200
10300
10400
10500
10600
10700
10800
10900
11000
11100
11200
11300
11400
11500
11600
11700
11800
11900
12000
12100
12200
12300
12400
12500
12600
12700
12800
12900
13000
13100
13200
13300
13400
13500
13600
13700
13800
13900
14000
14100
14200
14300
14400
14500
14600
14700
14800
14900
15000
15100
15200
15300
15400
15500
15600
15700
15800
15900
16000
16100
16200
16300
16400
16500
16600
16700
16800
16900
17000
17100
17200
17300
17400
17500
17600
17700
17800
17900
18000
18100
18200


In [3]:
# Sneak peak to all the data we have
print("Original Billboard data: ")
for info in billboardTracks[:10]:
    print(info)
    
print("\nSong meta data after query: ")
for info in billboardSpotifyTrackData[:10]:
    print(info)
    
print("\nHit Song features: ")
for info in billboardHitFeatures[:10]:
    print(info)
    
print("\nNOT Hit Song features: ")
for info in billboardNOTHitFeatures[:10]:
    print(info)

Original Billboard data: 
{'rank': '1', 'song': 'Butter', 'artist': 'BTS', 'last-week': '1', 'peak-rank': '1', 'weeks-on-board': '7', 'date': '2021-07-17'}
{'rank': '2', 'song': 'Good 4 U', 'artist': 'Olivia Rodrigo', 'last-week': '2', 'peak-rank': '1', 'weeks-on-board': '8', 'date': '2021-07-17'}
{'rank': '3', 'song': 'Levitating', 'artist': 'Dua Lipa Featuring DaBaby', 'last-week': '4', 'peak-rank': '2', 'weeks-on-board': '40', 'date': '2021-07-17'}
{'rank': '4', 'song': 'Kiss Me More', 'artist': 'Doja Cat Featuring SZA', 'last-week': '3', 'peak-rank': '3', 'weeks-on-board': '13', 'date': '2021-07-17'}
{'rank': '5', 'song': 'Montero (Call Me By Your Name)', 'artist': 'Lil Nas X', 'last-week': '8', 'peak-rank': '1', 'weeks-on-board': '15', 'date': '2021-07-17'}
{'rank': '6', 'song': 'Bad Habits', 'artist': 'Ed Sheeran', 'last-week': '5', 'peak-rank': '5', 'weeks-on-board': '2', 'date': '2021-07-17'}
{'rank': '7', 'song': 'Leave The Door Open', 'artist': 'Silk Sonic (Bruno Mars & Ander

NameError: name 'billboardHitFeatures' is not defined

All the needed data is now stored in JSON files. These are processed in the next section to create the datasets that the model will use.

## <a id='Process'></a>Processing the data into dataset
In this part the fetched data is processed into a dataset ready to be consumed by the models.


### <a id='requirements'></a>Requirements
There are requirements to use the Kaggle and Spotify API. If you do want to actually make the data queries these need the credentials to be. Information how to set the credentials is below. **To just read through what is done in this document you do not need the credentials!**

#### env file
API credentials are read directly from env.ini file in config folder. You need to create the env.ini file in the config folder. Check the Readme in config folder to see the structure of the env.ini file.
##### Kaggle
Information how to make Kaggle API token: https://www.kaggle.com/docs/api <br /> after token creation check your kaggle.json for the token.
>If you are using the Kaggle CLI tool, the tool will look for this token at ~/.kaggle/kaggle.json on Linux, OSX, and other UNIX-based operating systems, and at C:\Users<Windows-username>.kaggle\kaggle.json on Windows. If the token is not there, an error will be raised. Hence, once you’ve downloaded the token, you should move it from your Downloads folder to this folder.
    
#### Spotify
Spotify credentials can be created at https://developer.spotify.com/dashboard/ 
1. It requires a Spotify account, but if you have one just sign-in.
2. After sign-in, go to **Dashboard** and click **Create an app**, fill info and click **Create**.
3. The **Dashboard** should now display your app. Click your app and you can find **CLIENT_ID** and **CLIENT_SECRET**.
4. Put **CLIENT_ID** and **CLIENT_SECRET** hash (weird string of numbers and letters) to your **env.ini** file, in config folder:
'userId' and 'userKey'.
<br />Done.