<div style="text-align: center; background-color: #750E21; font-family: 'Trebuchet MS', Arial, sans-serif; color: white; padding: 20px; font-size: 40px; font-weight: bold; border-radius: 0 0 0 0; box-shadow: 0px 6px 8px rgba(0, 0, 0, 0.2);">
  FINAL PROJECT: RESEARCHING ON MUSIC TASTE WORDWIDELY 📌
</div>

<div style="text-align: center; background-color: #0766AD; font-family: 'Trebuchet MS', Arial, sans-serif; color: white; padding: 20px; font-size: 40px; font-weight: bold; border-radius: 0 0 0 0; box-shadow: 0px 6px 8px rgba(0, 0, 0, 0.2);">
  Stage 01 - Collecting Data (Extension)📌
</div>

## **PURPOSE** 🚀

🌟 While exploring data on Spotify, we found out that there exists a really interesting data on music, which is called `audio_feature`.\
🌟 Hence, in this notebook, we will continue to mine the data with `Spotify API`.\
🌟 The main purpose of this sample of data is that we are meant to use it in the part of building model. We believe that it will produce an meaningful insight into the music features that could come off in the music market; thus enhancing the awareness of writing song with high values. High values in both popularity and its core quality. 

## **IMPORT LIBRARY** 🎄

In [1]:
import pickle
import pandas as pd
import time
import retry
import spotipy
import threading
from concurrent.futures import ThreadPoolExecutor
from spotipy.oauth2 import SpotifyOAuth
from spotipy.exceptions import SpotifyException


<div style="text-align: left; font-family: 'Trebuchet MS', Arial, sans-serif; color: #FF90BC; padding: 20px; font-size: 38px; font-weight: bold; border-radius: 0 0 0 0">
  STEP 1: Get Song Id on Spotify with Spotipy 🔥
</div>

🔴 API key created to call for service with the library `spotipy`

In [2]:
SPOTIPY_CLIENT_ID = '430a68d6969446dc8e23957ca8829114'
SPOTIPY_CLIENT_SECRET = '78d05fe6892847389edbd9abe3f1d090'
SPOTIPY_REDIRECT_URI = 'http://localhost:8888/callback'

In [3]:
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(client_id=SPOTIPY_CLIENT_ID,
                                                        client_secret=SPOTIPY_CLIENT_SECRET,
                                                        redirect_uri=SPOTIPY_REDIRECT_URI,
                                                        scope='user-library-read'))

🟠 In the previous notebook, we have already created file that defined clearly the `song title` and `artist name`. This would be a meaningfull help in tracking the `id` of song on spotify before coming to extract its audio features.

In [4]:
spotify_data = pd.read_csv('../data/raw/raw_spotify_data.csv')
kworb_data = pd.read_csv('../data/raw/raw_kworb_data.csv')

song_name = spotify_data['Song Name'].tolist()
artist_name = spotify_data['Artist Name'].tolist()
youtube_title = kworb_data['Title'].tolist()

🟡 Writing function and use it for getting the `track id` by calling for spotify API.

In [None]:
def get_track_id(song_name, artist_name, youtube_title):
    try:
        search_query = f'artist:{artist_name} track:{song_name}'
        result = sp.search(search_query)
        return result['tracks']['items'][0]['id']
    except:
        try:
            result = sp.search(youtube_title)
            return result['tracks']['items'][0]['id']
        except Exception as e:
            print(f'Error: {e}')
            print(f'Could not find {song_name} by {artist_name}')
            return None


track_ids = []
with ThreadPoolExecutor(max_workers=10) as executors:
    futures = [executors.submit(get_track_id, song_name[i], artist_name[i], youtube_title[i]) for i in range(len(song_name))]
    for future in futures:
        track_ids.append(future.result())

🟢 Now, we save it into the csv file for later use. Due to the fact that the request to Spotify API would be limited for each generated API. Therefore, in case the kernel is crashed and restarted, we don't have to request to gain the `id` again.\
🟢 File name saved: `id.features.csv`.

In [25]:
audio_features = pd.DataFrame(track_ids)
audio_features.to_csv("../data/raw/id_features.csv", index=False)
audio_features

Unnamed: 0,0
0,03UrZgTINDqvnUMbbIMhql
1,2tpWsVSb9UEmDRxAl1zhX1
2,7qiZfU4dY1lWllzX7mPBI3
3,34gCuhDGsG4bRPIf9bb02f
4,6F5c58TMEs1byxUstkzVeM
...,...
2495,0mjwunHon4ve2wM4wPyZC4
2496,1DDbHdoDBDDk4ftBXrV0KP
2497,6I3mqTwhRpn34SLVafSH7G
2498,5oPcQjaNQjqCW7CV4SjDDt


<div style="text-align: left; font-family: 'Trebuchet MS', Arial, sans-serif; color: #FF90BC; padding: 20px; font-size: 38px; font-weight: bold; border-radius: 0 0 0 0">
  STEP 2: Get the Song Features of each song 🔥
</div>

🔵 We get the `song id` by reading the files saved in the previous step.\
🔵 Then, transform the list of `song id` into list for latter use.

In [4]:
audio_features_id = pd.read_csv("../data/raw/id_features.csv")
track_ids = audio_features_id.iloc[:,0]


In [7]:
track_ids = track_ids.to_list()
type(track_ids)


list

🟤 In the following code, we start to write function for extracting audio features, function named `get_audio_features`.\
🟤 And Because the amount of samples to get is quite lare, requiring so many requests in short time, leading to Error. Hence, we decide to devide the data sample into batch with the batch size of 500.

In [5]:
def get_audio_features(track_id):
    try:
        return sp.audio_features(track_id)[0]
    except Exception as e:
        print(f'Error: {e}')
        return None
    
def audio_features_with_batch(track_ids, start, batch_size):
    audio_features = []
    for i in range(start, start+batch_size):
        audio_features.append(get_audio_features(track_ids[i]))
    return audio_features

🟨 Working on **`BATCH 1: 0-500`**.\
🟨 Save to the file `audio_feature_1.csv`


In [28]:
audio_features_1 = audio_features_with_batch(track_ids, 0, 500)

In [29]:
audio_features_1 = pd.DataFrame(audio_features_1, columns = ['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness','instrumentalness', 'liveness', 'valence', 'tempo', 'type', 'id', 'uri', 'track_href', 'analysis_url', 'duration_ms', 'time_signature'])
audio_features_1

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,0.727,0.937,11,-2.871,0,0.2860,0.004170,0.000000,0.0910,0.749,132.067,audio_features,03UrZgTINDqvnUMbbIMhql,spotify:track:03UrZgTINDqvnUMbbIMhql,https://api.spotify.com/v1/tracks/03UrZgTINDqv...,https://api.spotify.com/v1/audio-analysis/03Ur...,219493,4
1,0.664,0.705,1,-4.972,0,0.0382,0.065400,0.000000,0.1180,0.477,122.016,audio_features,2tpWsVSb9UEmDRxAl1zhX1,spotify:track:2tpWsVSb9UEmDRxAl1zhX1,https://api.spotify.com/v1/tracks/2tpWsVSb9UEm...,https://api.spotify.com/v1/audio-analysis/2tpW...,257267,4
2,0.825,0.652,1,-3.183,0,0.0802,0.581000,0.000000,0.0931,0.931,95.977,audio_features,7qiZfU4dY1lWllzX7mPBI3,spotify:track:7qiZfU4dY1lWllzX7mPBI3,https://api.spotify.com/v1/tracks/7qiZfU4dY1lW...,https://api.spotify.com/v1/audio-analysis/7qiZ...,233713,4
3,0.781,0.445,2,-6.061,1,0.0295,0.474000,0.000000,0.1840,0.591,78.998,audio_features,34gCuhDGsG4bRPIf9bb02f,spotify:track:34gCuhDGsG4bRPIf9bb02f,https://api.spotify.com/v1/tracks/34gCuhDGsG4b...,https://api.spotify.com/v1/audio-analysis/34gC...,281560,4
4,0.554,0.772,7,-4.821,0,0.0418,0.004870,0.000007,0.3540,0.455,179.984,audio_features,6F5c58TMEs1byxUstkzVeM,spotify:track:6F5c58TMEs1byxUstkzVeM,https://api.spotify.com/v1/tracks/6F5c58TMEs1b...,https://api.spotify.com/v1/audio-analysis/6F5c...,223546,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,0.697,0.648,7,-7.123,1,0.1280,0.425000,0.000000,0.1910,0.313,186.125,audio_features,6DPp7xKJ4WS71HldGA038M,spotify:track:6DPp7xKJ4WS71HldGA038M,https://api.spotify.com/v1/tracks/6DPp7xKJ4WS7...,https://api.spotify.com/v1/audio-analysis/6DPp...,199813,4
496,0.949,0.661,5,-4.244,0,0.0572,0.030200,0.000000,0.0454,0.760,104.504,audio_features,3yfqSUWxFvZELEM4PmlwIR,spotify:track:3yfqSUWxFvZELEM4PmlwIR,https://api.spotify.com/v1/tracks/3yfqSUWxFvZE...,https://api.spotify.com/v1/audio-analysis/3yfq...,284200,4
497,0.765,0.339,8,-8.965,1,0.0365,0.278000,0.000000,0.1310,0.864,123.950,audio_features,1l77YWrGUp3qX3NS1rz7lq,spotify:track:1l77YWrGUp3qX3NS1rz7lq,https://api.spotify.com/v1/tracks/1l77YWrGUp3q...,https://api.spotify.com/v1/audio-analysis/1l77...,202560,4
498,0.726,0.652,1,-7.764,0,0.1100,0.004510,0.000033,0.2260,0.803,163.879,audio_features,6HGoVbCUr63SgU3TjxEVj6,spotify:track:6HGoVbCUr63SgU3TjxEVj6,https://api.spotify.com/v1/tracks/6HGoVbCUr63S...,https://api.spotify.com/v1/audio-analysis/6HGo...,180056,4


In [30]:
audio_features_1.to_csv("../data/raw/audio_features_1.csv", index=False)

🟩 Working on **`BATCH 2: 500-1000`**.\
🟩 Save to the file `audio_features_2.csv`.

In [7]:
audio_features_2 = audio_features_with_batch(track_ids, 500, 500)

In [9]:
audio_features_2 = pd.DataFrame(audio_features_2, columns = ['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness','instrumentalness', 'liveness', 'valence', 'tempo', 'type', 'id', 'uri', 'track_href', 'analysis_url', 'duration_ms', 'time_signature'])
audio_features_2.to_csv("../data/raw/audio_features_2.csv", index=False)

🟦 Working on **`BATCH 3: 1000-1500`**.\
🟦 Save to the file `audio_features_3.csv`.

In [10]:
audio_features_3 = audio_features_with_batch(track_ids, 1000, 500)

In [11]:
audio_features_3 = pd.DataFrame(audio_features_3, columns = ['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness','instrumentalness', 'liveness', 'valence', 'tempo', 'type', 'id', 'uri', 'track_href', 'analysis_url', 'duration_ms', 'time_signature'])
audio_features_3.to_csv("../data/raw/audio_features_3.csv", index=False)

🟪 Working on **`BATCH 4: 1500-2000`**.\
🟪 Save to the file `audio_features_4.csv`.

In [6]:
audio_features_4 = audio_features_with_batch(track_ids, 1500, 500)

In [7]:
audio_features_4 = pd.DataFrame(audio_features_4, columns = ['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness','instrumentalness', 'liveness', 'valence', 'tempo', 'type', 'id', 'uri', 'track_href', 'analysis_url', 'duration_ms', 'time_signature'])
audio_features_4.to_csv("../data/raw/audio_features_4.csv", index=False)

🟫 Working on **`BATCH 5: 2000-2500`**.\
🟫 Save to the file `audio_features_5.csv`.

In [8]:
audio_features_5 = audio_features_with_batch(track_ids, 2000, 500)

In [9]:
audio_features_5 = pd.DataFrame(audio_features_5, columns = ['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness','instrumentalness', 'liveness', 'valence', 'tempo', 'type', 'id', 'uri', 'track_href', 'analysis_url', 'duration_ms', 'time_signature'])
audio_features_5.to_csv("../data/raw/audio_features_5.csv", index=False)

<div style="text-align: left; font-family: 'Trebuchet MS', Arial, sans-serif; color: #FF90BC; padding: 20px; font-size: 38px; font-weight: bold; border-radius: 0 0 0 0">
  STEP 3: Gather data form 5 batches. 🔥
</div>

🔶🔶🔶 We gather all the data that saved from 5 batches and start to concatenate it.\
🔶🔶🔶 The final file produced is named `audio_features.csv`. \
🔶🔶🔶 This file is used for later phase.

In [11]:
auf_1 = pd.read_csv("../data/raw/audio_features_1.csv")
auf_2 = pd.read_csv("../data/raw/audio_features_2.csv")
auf_3 = pd.read_csv("../data/raw/audio_features_3.csv")
auf_4 = pd.read_csv("../data/raw/audio_features_4.csv")
auf_5 = pd.read_csv("../data/raw/audio_features_5.csv")

In [15]:
# Concatenate all audio features
audio_features = pd.concat([auf_1, auf_2, auf_3, auf_4, auf_5])
audio_features.to_csv("../data/raw/audio_features.csv", index=False)