# Extracting Features From the Original Dataset

This notebook is used to extract features from the original dataset, which gives us limited information about the songs. Here, we use the "ari.py" script to extract a set of features about each song, along with the popularities of both the artist and song itself, along with genres

In [1]:
#Import from the other file
from scripts.ari import ari_to_features
import pandas as pd
from tqdm import tqdm
import re

In [2]:
#Load the raw_data from the repo
dataPath = '../data/dataset.csv'
df = pd.read_csv(dataPath)
df.head(5)
df.columns

Index(['Unnamed: 0', 'track_id', 'artists', 'album_name', 'track_name',
       'popularity', 'duration_ms', 'explicit', 'danceability', 'energy',
       'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature',
       'track_genre'],
      dtype='object')

In [None]:
# #Edit the track-uris to a more usable format
# df["track_uri"] = df["track_uri"].apply(lambda x: re.findall(r'\w+$', x)[0])
# df["track_uri"]

In [3]:
testDF = df
#feature = ari_to_features(df["track_uri"])
#feature_df_test = pd.DataFrame(feature)
#feature_df_test.head()

## Included Features

The code cell below gives an example of the features extracted from each track, showing the kind of information that is used to cluster the data further on.

In [None]:
#Test the feature extraction script, and display features
ari_to_features(df["track_id"][:5])

## Extraction

Below here, we extract features from each track using the Spotify API and the associated URI. This is done in 3 sections, due to the extremely long runtime of this process. We build a DataFrame containing these features.

We only take 10000 first songs:

In [4]:
# full 114k songs
# first_half = df["track_id"].unique()[:19000]
# second_half = df["track_id"].unique()[19000:38000]
# third_half = df["track_id"].unique()[38000:57000]
# fourth_half = df["track_id"].unique()[57000:76000]
# fifth_half = df["track_id"].unique()[76000:95000]
# sixth_half = df["track_id"].unique()[95000:]
# dataLIST = [first_half,second_half,third_half,fourth_half,fifth_half,sixth_half]
first_half = df["track_id"].unique()[5000:6000]
second_half = df["track_id"].unique()[6000:7000]
third_half = df["track_id"].unique()[7000:8000]
fourth_half = df["track_id"].unique()[8000:9000]
fifth_half = df["track_id"].unique()[9000:10000]
dataLIST = [first_half,second_half,third_half,fourth_half,fifth_half]

In [5]:
featureLIST = []

In [None]:
try:
    featureLIST.append(ari_to_features(first_half))
except Exception as e:
    print(f"Error processing track ID: {e}")

In [6]:
# Chia `first_half` thành các cụm 100 IDs mỗi cụm
chunks = [fifth_half[i:i+100] for i in range(0, len(fifth_half), 100)]
# Gọi `ari_to_features` cho từng cụm 100 IDs
for chunk in tqdm(chunks):
    try:
        featureLIST.extend(ari_to_features(chunk))
    except Exception as e:
        print(f"An error occurred: {e}")


100%|██████████| 8/8 [11:09<00:00, 83.72s/it]


In [None]:
for i in tqdm([uri for uri in dataLIST[1]]):
    try:
        featureLIST.append(ari_to_features(i))
    except:
        continue

In [None]:
for i in tqdm([uri for uri in dataLIST[2]]):
    try:
        featureLIST.append(ari_to_features(i))
    except:
        continue

In [7]:
#Preview the DataFrame
featureDF = pd.DataFrame(featureLIST)
featureDF

Unnamed: 0,id,artist_pop,artist_name,artist_uri,track_name,album_uri,album_name,image_uri,track_pop,genres
0,1G20khK4puqSUcMUhQ3m9U,75,Ludovico Einaudi,spotify:artist:2uFUBdaVGtyMqckSeCl0Qj,Le Onde,spotify:album:0jMze9qUrWL2ZCy3FQlYRi,Sleep: 111 Pieces Of Classical Music For Bedtime,https://i.scdn.co/image/ab67616d0000b2734b4e05...,0,compositional_ambient neo-classical
1,7wP9WrSAnNgJqS48KRPyWs,51,Dustin O'Halloran,spotify:artist:6UEYawMcp2M4JFoXVOtZEq,Departures n1,spotify:album:7eqmPHPSp9xMENCaHAmt7l,Like Crazy (Original Motion Picture Score),https://i.scdn.co/image/ab67616d0000b273eea9df...,20,compositional_ambient
2,5NJiqDbIe4bb0BIkQzkJu3,58,Alexis Ffrench,spotify:artist:58R31AvN8JMHM7xkNpVLjX,Viva Vida Amor - Solo Piano Version,spotify:album:5EMB8WAOcT4BqrRZFsW0aO,Truth - The Solo Piano Collection,https://i.scdn.co/image/ab67616d0000b273905c5a...,26,neo-classical
3,4cmCWVR8RlLYyASMarxU0e,42,Karen Dalton,spotify:artist:5O5V29YvM6AzAQ0rNt59fy,Something on Your Mind,spotify:album:36WryGURqME9Y2URzzzmio,In My Own Time (50th Anniversary Edition),https://i.scdn.co/image/ab67616d0000b273fc2f52...,52,ambient_folk drone_folk folk native_american_t...
4,5KwLjZ0oJ5kNl7jGtdiIOC,62,Jóhann Jóhannsson,spotify:artist:3IpQziA6YwD53PQ5xbwgLF,"The Beast - From ""Sicario""",spotify:album:04FRFSqcTfN9zfmFfzhbHn,Sicario (Original Motion Picture Soundtrack),https://i.scdn.co/image/ab67616d0000b27397877b...,45,compositional_ambient experimental_classical i...
...,...,...,...,...,...,...,...,...,...,...
795,4QYGss1JbmGTxte5nA4JNX,75,Mrs. GREEN APPLE,spotify:artist:4QvgGvpgzgyUOo8Yp8LDm9,私,spotify:album:4seExhof6lZ2yg5dZfengb,TWELVE,https://i.scdn.co/image/ab67616d0000b273566489...,46,anime_rock j-pop j-rock
796,7pqcJ1n6HiGx7vOiDcb1Ml,46,Adrian Barba,spotify:artist:5KK1FO30lzYPqnPYyS9bu5,Sola Nunca Estarás,spotify:album:2b16yWqku8m9us1Y2zkEPI,Somos Una Sola Mente,https://i.scdn.co/image/ab67616d0000b273ce69fe...,32,anime_latino
797,0ijeNNh5BfPrZb2mPIUuR2,36,The Covers Duo,spotify:artist:0vlbXMsO1PRqmfJv5tAJ8G,Atraparlos Ya! (Pokemon),spotify:album:2MWOz473PC86eJ4uasKHHK,Anime Openings 1,https://i.scdn.co/image/ab67616d0000b273fab27a...,35,anime_latino
798,2ccfTK4zy5LEZWsWmdPush,44,Doblecero,spotify:artist:6qqvdLm9ZVjxCgHxGcL5ZW,Rap de Buda,spotify:album:2b2J2SRfyGCgNsU2KEI8gB,Rap de Buda,https://i.scdn.co/image/ab67616d0000b2732af3f6...,33,latin_viral_rap rap_anime


## Finalising and Export

We finally merge the feature DataFrame with the original dataset, as this also contains useful information in the artist name and track name. This is then exported, as our processed data.

In [8]:
new_df = pd.merge(testDF,featureDF, left_on = "track_id", right_on= "id")

Write to processed_data.csv for the 1st time:

In [None]:
new_df.to_csv('../data/processed_data.csv')

Write to processed_date.csv for next times:

In [9]:
new_df.index = pd.Index(range(5965, 5965 + len(new_df)))

In [10]:
new_df.to_csv('../data/processed_data.csv', mode='a', header=False, index=True)