# Main index

[Collect the data](#collect)
<br />[Analyse the gathered data](#analysis)
<br />[Preprocess the data](#preprocess)
<br />[Analyse the processed data](#pro_analysis)
<br />[Model the data](#modelling)


Environment set ups

In [None]:
# First set up the environment. Code sources are in folders which are in the parent folder of this notebooks scope.
#import sys; sys.path.insert(0, '..') # add parent folder path, now files are queriable from parent folder

# Install Kaggle api package (not included in the docker image) and spotipy
!pip install kaggle spotipy pandas

# Install packages for fuzzywuzzy
# This is used to calculate ratio between two uneven string
%conda install -c conda-forge python-levenshtein
%pip install fuzzywuzzy

# <a id='collect'></a>Collecting the data

### Index

[Kaggle and billboard data](#kaggle)

[Spotify and audio features](#spotify)
<br />[Spotify song query and matching](#spotify_matching)
<br />[Spotify song features for billboard songs](#spotify_features)
<br />[Spotify collecting the not hits](#spotify_not_hits)
<br />[Spotify filter hit and not hit songs](#spotify_filter)




Credential requirements are explained [here](notes/data.ipynb#dataset_requirements)
<br />The credentials go [here](config/env.ini)

In [None]:
# All imports + initialize the Spotify API
from data.query.util import initializeSpotifyAPI, saveJson
api = initializeSpotifyAPI()

from data.query.billboard import getBillboardData
from data.query.spotify_api import (
    getSpotifyDataFromBillboardSongs, 
    getSpotifyAudioFeatures, 
    getSongsWithAlbums
)

## <a id='kaggle'></a>Kaggle and billboard data

First query the dataset from Kaggle. https://www.kaggle.com/dhruvildave/billboard-the-hot-100-songs
[**downloadBillboardData**](data/query/billboard.py) (line 29) is quite straight forward, it takes in the name of the dataset as a string handles the Kaggle credentials and calls [**downloadKaggleDataset**](data/util.py) (line 52) which will download the dataset in the defined path.

Default path to store the dataset is **data/datasets/billboard/** and it is created during the data loading.

In [None]:
from data.query.billboard import downloadBillboardData

# Download the kaggle billboard dataset
datasetName = 'dhruvildave/billboard-the-hot-100-songs'
downloadBillboardData(datasetName)

Now the billboard songs are queryable with [**getBillboardData**](data/query/billboard.py) (line 10) function. It reads the zip file that contains the dataset, takes the file that is given as input (default charts.csv) and parses the required data. Finally it returns list of billboard songs.

In [None]:
# Fetch billboard song data from zip file (kaggle ds)
billboardTracks = getBillboardData('data/datasets/billboard/billboard-the-hot-100-songs.zip')

In [None]:
# View 10 first songs
print(billboardTracks[:10]) # Change to Pandas !!!

## <a id='spotify'></a>Spotify and audio features

Now that the billboard song data is fetched from Kaggle. Next the songs are needed to map to songs in spotify database.

Unfortunately the kaggle billboard dataset doesn't have Spotify IDs ready to use and query the song information.

To get the IDs the songs have to be searched first with the song and artist name.<br />[**getSpotifyDataFromBillboardSongs**](data/query/spotify_api.py) (line 146) will do the trick.

### Finding the right song

Querying the songs is not totally straight forward. The song names can differ in billboard data and spotify data + it can be that the song is not available in spotify (market, artist etc. reasons).

Matching the results with billboard data information is done so that first just the song is used as the query string. If the song names match and the billboard data artist is in the list of artists that the song has listed, all good the song is added to the data.<br />[**songMatching**](data/query/spotify_api.py) (line 120)

The songs that do not match directly, a new query is made where song and artist name is used as query string. If the tokenized sets of the query string and spotify result song + artists string matches close enough it is added to the data. Mainly this collect songs with unknown characters in song or artists names or song name having for example "feat. 'some artist' " in the song name, but in Spotify data the featured artists are put in artist list and then the song names do not match.<br />[**getSpotifyDataFromBillboardSongs**](data/query/spotify_api.py) (line 224)

### Song version blacklist

This approach raises a second problem which is that in Spotify there are remix, instrumental and or karaoke versions. These songs are not wanted in the data so blacklisting is implemented. All song names containing these are blacklisted and not added (with a few exceptions that are whitelisted).<br />
[**checkIfBlackListed**](data/query/spotify_api.py) (line 50, called in songMatching)

### Ignoring duplicates

The billboard data has top 100 songs per week , therefore a song that is in the top list two weeks in row will be queried two times. During this querying all the duplicates are ignored.<br />[**fetchSongsByNameFromSpotify**](data/query/spotify_api.py) (line 162, implemented via unMatchedIndexes variable)

Finally all search results are stored in json file.

*Total number songs to query prints total number of songs in billboard dataset, but one song is queried only ones

*Takes about 2-3 hours to query the full dataset


### <a id='spotify_matching'></a>Spotify song query and matching

In [None]:
# Query song information from spotify with song names
queryResultPath = '../data/datasets/spotify/query_results.json'
billboardSpotifyTrackData = getSpotifyDataFromBillboardSongs(
    api, 
    billboardTracks, 
    savePath=queryResultPath
)

### <a id='spotify_features'></a>Spotify song features for billboard songs
The spotify ID information for billboard songs are queried and now the spotify features can be fetched.

[**getSpotifyAudioFeatures**](data/query/spotify_api.py) (line 236) will query the audio features from spotify. Results are stored in json file.

In [None]:
# Query features for billboard songs
hitSongPath = 'data/datasets/spotify/hit_song.json'
billboardHitFeatures = getSpotifyAudioFeatures(api, billboardSpotifyTrackData)
# Save the results
saveJson(billboardHitFeatures, hitSongPath)

In [None]:
print(billboardHitFeatures[:5]) # Change to pandas

This set of songs will act as hit songs in this work.

### <a id='spotify_not_hits'></a>Spotify collecting the not hits

To use supervised machine learing methods, the model needs examples of the data with labels. To find difference between a billboard song (considered as hit song) and a not billboard song (considered as not hit).

The model needs the not hit samples to compare the differences. Just random songs could be fetched from the Spotify API that are not on billboard lists and use them. The problem is: what to search for ?

Solution for this is to use songs that shares the album. All collected songs from spotify API do have a information about the album they were released (this is not directly the original album and in some cases links to a hit collection). 

Also when the artist is usually the same, there should not be such a big difference in the spotify audio feture perspective than compairing a random song from a random artist. Therefore the hit and not hit songs are closer to each other in audio feature perspective, which makes finding a line to separate them harder, but would render a model which answers more specifically on what audio features are the difference in songs from billboard hit list featured artists?

For implementation of this querying [**getSongsWithAlbums**](data/query/spotify_api.py) (line 303) does it all.

First it will take the album id for every billboard song element and query the album information.<br />[**fetchAlbumTracks**](data/query/spotify_api.py) (line 280 onwards)

Next a random sample of the tracks is taken from the songs of the album query results.<br />[**fetchAlbumTracks**](data/query/spotify_api.py) (line 288)

If the random song is not the hit song used to query the album and the random song is has no blacklisted elements it is collected.<br />[**fetchAlbumTracks**](data/query/spotify_api.py) (line 293)

Song information is parsed and the audio features are fetched for the collected random songs.<br />[**getSongsWithAlbums**](data/query/spotify_api.py) (line 312)

Finally the results are stored in json file.

Using 5 random songs for every album where every unique billboard song has an album (about 20k) this is going to take a long time.

In [None]:
# Query non hit songs with album information
numSongsFromAlbum = 5
notHitSongPath = '../data/datasets/spotify/not_hit_song.json'
billboardNOTHitFeatures = getSongsWithAlbums(api, billboardSpotifyTrackData, numSongsFromAlbum)
saveJson(billboardNOTHitFeatures, notHitSongPath

### <a id='spotify_filter'></a>Spotify filter hit and not hit songs

Finally the data can be checked for duplicates and data with no audio features. 

Duplicates are a hit song in not hit song dataset. These will be removed.

Also for sanity check, if a song do not have spotify song features it will be removed from the final data.

In [None]:
from data.query.util import loadJson

hitSongPath = 'data/datasets/spotify/hit_song.json'
notHitSongPath = 'data/datasets/spotify/not_hit_song.json'

# First load the saved data
hits = loadJson(hitSongPath)
notHits = loadJson(notHitSongPath)

print(f"Hit songs: {len(hits)}")
print(f"Not Hit songs: {len(notHits)}")

In [None]:
# Check if there are songs without features and create id list
removeFromHits = []
# Check if not hit songs share hit song ids
hitSpotifyIds = set()
for i, hit in enumerate(hits):
    hitSpotifyIds.add(hit['info']['spotifyData']['songID'])
    
    # So if the features is none, one of the values is none or all the values are zeros the song will be added to the list
    if hit['features'] == None or all(value == 0 for value in hit['features'].values()):
        removeFromHits.append(i)

print(f"Empty features: {len(removeFromHits)}")

Just two empty feature sets in all hit songs. Not bad. These will be removed in a bit.

In last cell the set of all hit song spotify ids was created, so now the not hit song ids can be checked not to actually be hit songs.

Also if the features are null they are added to the remove list like hits before.

In [None]:
removeFromNotHits = []
for i, notHit in enumerate(notHits):
    if notHit['info']['spotifyData']['songID'] in hitSpotifyIds or notHit['features'] == None or all(value == 0 for value in notHit['features'].values()):
        removeFromNotHits.append(i)

print(f"Empty features or actually hits: {len(removeFromNotHits)}")

Almost 10214 songs! It was expected to be a high number as the songs were randomly taken from albums. Let's say a hit song is registered to an album with top hit tracks there's a good possibility to take songs that will appear in the billboard list.

Now the empty feature or duplicate songs will be removed.

In [None]:
# Delete empty feature hits, reverse so the order is not messed up
for deleteId in list(reversed(removeFromHits)):
    hits.pop(deleteId)
    
# same for not hits
for deleteId in list(reversed(removeFromNotHits)):
    notHits.pop(deleteId)

print(f"Hit songs: {len(hits)}")
print(f"Not Hit songs: {len(notHits)}")

Alright now just store the data like before, the previous data will be overwritten

In [None]:
saveJson(hits, hitSongPath)
saveJson(notHits, notHitSongPath)

All the needed data is now stored in JSON files. These are processed in the next section to create the datasets that the model will use.

# <a id='analysis'></a> Analyse the gathered data

Index


Collection part left us with a dataset that looks like this in the percpective of labels

In [None]:
from data.plotting.util import makeHistogram

# Here is the number of hits and not hits
makeHistogram([0, 1], [len(notHits), len(hits)])

Song amounts by year

In [None]:
for i, year in enumerate(sorted(hitsDataset['year'].unique())):
    print(f"Year: {year} Number of songs: {len(hitsDataset[hitsDataset['year'] == year])}")

# <a id='preprocess'></a> Song data preprocessing

Index

[Load the data](#load)
</br>[Add labels](#process_label)
</br>[Sample by year](#process_sample)
</br>[Fit transformation scaler](#process_fit_transform)


### <a id='process_load'></a>Load the gathered data

First step is to load the previously gathered data into memory.

In [None]:
# Fetch the downloaded data
from data.query.util import loadJson

notHitSongPath = '../data/datasets/spotify/not_hit_song.json'
hitSongPath = '../data/datasets/spotify/hit_song.json'

# Load the data
hits = loadJson(hitSongPath)
notHits = loadJson(notHitSongPath)

The hits data will be collected and stored in a pandas dataframe for further sampling and processing.

In [None]:
from pandas import DataFrame
from data.process.util import parseYearFromDate

features = []
for hit in hits:
    year = parseYearFromDate(hit['info']['spotifyData']['album']['releaseDate'])
    features.append({**hit['features'], 'year': year[2:]})

hitsDataset = DataFrame.from_records(features)
hitsDataset

Now the same to not hits.

In [None]:
features = []
for notHit in notHits:
    year = parseYearFromDate(notHit['info']['spotifyData']['album']['releaseDate'])
    features.append({**notHit['features'], 'year': year[2:]})

notHitsDataset = DataFrame.from_records(features)
notHitsDataset

### <a id='process_label'></a>Add labels

The original collected dataset of hits **hitsDataset** and not hits **notHitsDataset** are loaded and ready for further manipulations.

But as a first step the labels can be added to the datasets as a feature.

That is easy, for all hits a label value of 1 and not hits a label value of 0.

In [None]:
# Now labels in place
hitsDataset['label'] = 1
notHitsDataset['label'] = 0

eghtyFiveSampleSize = 250
if int(year) > 85 or int(year) < 22 <- 85 logic

### <a id='process_sample'></a>Sample the data by year

Dataset is balanced with [**sampleByYears**](data/process/balance.py).

All unique years in the dataset is looped and the first step is to check wheter the currently processed year is in between the ones that are wanted. The only interest is for songs between 1965 and 2021, so years not between those will be skipped.

Next all the songs are taken randomly and added to the dataset.

In [1]:
from data.process.balance import sampleByYears

sampleSize = 192
balancedHitDataset, balancedNotHitDataset = sampleByYears(
                                                hitsDataset, 
                                                notHitsDataset,
                                                sampleSize,
                                                65,
                                                21
                                            )

NameError: name 'Dataset' is not defined

### <a id='process_fit_transform'></a>Apply transformations

The data needs some transformations.

Some features are scaled to be between 0 and 1.

Categorical features are oneHot encoded.

And finally the scaler/transformer is fitted with the data.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder

# First the balanced dataset needs to be concatenated
dataset = concat([balancedHitDataset, balancedNotHitDataset], ignore_index=True)

# Initialize the transformer
transformer = ColumnTransformer([ 
    ("scale", MinMaxScaler(), ['durationMS', 'loudness', 'tempo', 'year']),
    ("onehot", OneHotEncoder(), ['timeSignature', 'key', 'mode'])
    ], remainder='passthrough')

# Fit the transformer with the whole sampled dataset
transformer.fit(dataset)

# <a id='pro_analysis'></a> Processed data analysis


# <a id='modelling'></a> Modelling

Index

[Initial stuff](#model_init)
</br>[Logistic Regression](#logreg)
</br>[Support Vector Machine](#svm)
</br>[Multi-Layer Perceptron](#MLP)
</br>[Random Forest](#rf)


### <a id='model_init'></a>Initial stuff

Most of the models need hyperparameters to be set. As the values for them are not totally clear, it is common to try multiple and use the best model measured by some performance measure.

Scikit-learn GridSearch will be used to search good hyperparameters. It takes in as parameters all the combinations of hyperparameters to test. It creates and fits the model one by one with all of the hyperparameter combinations. The best performing model is then usable as a result.

Another instrument in play here is the Group K-fold scikit-learn object. This is used to apply a cross validation on the training.

Maybe more info here....

Also the data needs to be transformed. In the [section](#process_fit_transform) where transformer/scaler was fitted, is used here.

In [None]:
from sklearn.model_selection import GridSearchCV, GroupKFold
from sklearn.metrics import classification_report

groups = dataset.label
group_kfold = GroupKFold(n_splits=5)

In [None]:
transformed_data = transformer.transform(dataset.drop(['label'], axis=1))
labels = dataset.label

### <a id='logreg'></a>Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

params = [
    {
        'solver': ['liblinear'], 
        'penalty': ['l1', 'l2'],  
        'multi_class': ['ovr'],
        'max_iter': [100, 250, 500, 1000],
        'C': [0.01, 0.1, 1]
    }, 
    {
        'solver': ['saga'], 
        'penalty': ['elasticnet'], 
        'multi_class': ['ovr'], 
        'n_jobs': [-1], 
        'warm_start': [True],
        'max_iter': [1000, 10000, 100000],
        'C': [0.01, 0.1, 1],
        'l1_ratio': [0.4, 0.5, 0.6]
    }, 
]

# Classifier trained with dataset One
logReg = GridSearchCV(LogisticRegression(), params, n_jobs=-1, scoring='accuracy')
logReg.fit(transformed_data, labels, groups=groups)

print(f"Best results: {logReg.best_score_}")


### <a id='svm'></a>Support Vector Machine

In [None]:
from sklearn.svm import SVC

params = [
    {
        'kernel': ['linear', 'rbf'], 
        'class_weight': [None, 'balanced'],
        'max_iter': [-1],
        'C': [2.0, 3.0, 10.0],
        'gamma': [0.1, 'auto']
    },
]

# Classifier trained with dataset One
svm = GridSearchCV(SVC(), params, n_jobs=-1, scoring='accuracy')
svm.fit(transformed_data, labels, groups=groups)

print(f"Best results: {svm.best_score_}")


### <a id='MLP'></a>Multi-Layer Perceptron

In [None]:
from sklearn.neural_network import MLPClassifier

params = [
    {
        'hidden_layer_sizes': [
            (10,),
            (20,),
        ], 
        'activation': ['relu', 'logistic'],  
        'solver': ['lbfgs', 'adam'],
        'max_iter': [10000],
        'alpha': [0.0001, 0.001],
        'early_stopping': [True]
    }, 
]

# Classifier trained with dataset One
MLP = GridSearchCV(MLPClassifier(), params, n_jobs=-1, scoring='accuracy')
MLP.fit(transformed_data, labels, groups=groups)

print(f"Best results: {MLP.best_score_}")


### <a id='rf'></a>Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

params = [
    {
        'n_estimators': [800, 1600],
        'n_jobs': [-1],
        'max_features': [8, 10, 'auto']
    },
]

# Classifier trained with dataset One
forest = GridSearchCV(RandomForestClassifier(), params, n_jobs=-1, scoring='accuracy')
forest.fit(transformed_data, labels, groups=groups)


print(f"Best results: {forestOne.best_score_}")
