# DATASET CREATION
*Creating Pandas Dataframe containing all the songs from [Songs used in commercials](https://www.songfacts.com/category/songs-used-in-commercials), their Spotify ID and their acoustic features.*

### Importing Libraries and Connecting to *Spotify API*

*Importing ``pandas``, ``numpy`` and ``spotify.py`` module to access* Spotify *API.*

In [None]:
from spotify import SpotifyAPI
import pandas as pd
import numpy as np

*Accessing* Spotify API *with user credentials.*

In [None]:
client_id = '_CLIENT_ID_' # Add your client ID here
client_secret = '_SECRET_ID_' # Add your secret ID here
spoti = SpotifyAPI(client_id, client_secret)

## 1. Reading Songs and Collecting Their Spotify ID

### 1.1 Reading Songs

*Reading songs from ``songs_from_songfacts.txt`` and creating a cleaner python list.*

In [None]:
with open('songs_from_songfacts.txt') as f:
    data = f.readlines()
data = [line.replace('\n','').split(' - ') for line in data]

### 1.2 Retrieving *Spotify* IDs

*Retrieving* Spotify *IDs for each song by searching title and artist.*

In [None]:
for el in data:
    title = el[0]
    artist = el[1]
    res = spoti.search(title + ' ' + artist) # search func returns a result dict
    try:
        song = res['tracks']['items'][0]
        el.append(song['id'])
    except Exception:
        el.append('') 

### 1.3 DataFrame Creation and Exporting

*Creating a ``pandas`` Dataframe.*

In [None]:
songs_df = pd.DataFrame(data, columns = ['TITLE', 'ARTIST', 'ID'])

*Some wrong IDs found, replacing them after a manual check.*

In [None]:
wrong_ids = [
        '6I0tz1wUfI3ibXHhjBdsv1',
        '3RTTs0HDk9rX2jqQa1BM1m',
        '7rgEfw0lhGuUM2Yrk950fJ',
        '50WEQ7sN1QT8xOEBSkLCgV',
        '4QFoJ4MirBaeDulxhifTEU',
        '5Tl3mhKpEAmso5QitemwJn',
        '5ATmWQHi5cAZgqBUzf8qS9',
        '57kR5SniQIbsbVoIjjOUDa'
        ]

right_ids = [
        '7bTlID6vzqECC5Vq61mysd',
        '5vVZaiK2mIL9WE1GWikOE6',
        '0H1GWNCOu7BcrxufcFgo9Z',
        '0DbcdVCzmY1IjilPTPaSOe',
        '2razPef7w7IoxNvG8plxsC',
        '0iC9wa4ARcVk0oV1zXpwjv',
        '3BfYDQC9SKBN1QCWVGb7C4',
        '7K6J5sDjzyON20FjMCW92Z'
        ]

songs_df['ID'] = songs_df['ID'].replace(wrong_ids, right_ids)

*Exporting the Dataframe as a ``csv`` file*

In [None]:
songs_df.to_csv('songs.csv', index = False)

## 2. Extracting Acoustic Features

*Loading ``songs.csv`` from folder.*

In [None]:
songs_df = pd.read_csv('songs.csv')

### 2.1 Defining Features

*Defining a list of features to retrieve and creating an empty python dictionary*

In [None]:
features = ['danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']
feat_dict = {f:[] for f in features}

### 2.2 Retrieving Features

*Populating the feature dictionary with values retrieved with* Spotify API.\
***Note:*** *``max_ids`` defines the maximum numbers of IDs for which* Spotify *allows to retrieve values with one single request (i.e. 100).*

In [None]:
IDs = list(songs_df['ID'])
max_ids = 100
cuts = [i*max_ids for i in range(1,len(IDs)//max_ids+1)]
IDs = np.split(IDs, cuts)

for idx in IDs:
    idx = idx.tolist()
    res = spoti.get_features(idx)['audio_features']
    for feat in features:
        for song in res:
            feat_dict[feat].append(song[feat])

### 2.3 Updating DataFrame and Exporting

In [None]:
feat_df = pd.DataFrame(feat_dict, columns = features)
feat_df = pd.concat([songs_df, feat_df], axis = 1)
feat_df.to_csv('features.csv', index = False)