# Extracting Features From the Original Dataset

This notebook is used to extract features from the original dataset, which gives us limited information about the songs. Here, we use the "ari.py" script to extract a set of features about each song, along with the popularities of both the artist and song itself, along with genres

In [1]:
#Import from the other file
from scripts.ari import ari_to_features
import pandas as pd
from tqdm import tqdm
import re

In [2]:
#Load the raw_data from the repo
dataPath = '../data/raw_data.csv'
df = pd.read_csv(dataPath)
df.head()

Unnamed: 0.1,Unnamed: 0,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms,album_name,name
0,0,0,Missy Elliott,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,Throwbacks
1,1,1,Britney Spears,spotify:track:6I9VzXrHxO9rA9A5euc8Ak,spotify:artist:26dSoYclwsYLMAKD3tpOr4,Toxic,spotify:album:0z7pVBGOD7HCIB7S8eLkLI,198800,In The Zone,Throwbacks
2,2,2,Beyoncé,spotify:track:0WqIKmW4BTrj3eJFmnCKMv,spotify:artist:6vWDO969PvNqNYHIOW5v0m,Crazy In Love,spotify:album:25hVFAxTlDvXbx2X2QkUkE,235933,Dangerously In Love (Alben für die Ewigkeit),Throwbacks
3,3,3,Justin Timberlake,spotify:track:1AWQoqb9bSvzTjaLralEkT,spotify:artist:31TPClRtHm23RisEBtV3X7,Rock Your Body,spotify:album:6QPkyl04rXwTGlGlcYaRoW,267266,Justified,Throwbacks
4,4,4,Shaggy,spotify:track:1lzr43nnXAijIGYnCT8M8H,spotify:artist:5EvFsr3kj42KNv97ZEnqij,It Wasn't Me,spotify:album:6NmFmPX56pcLBOFMhIiKvF,227600,Hot Shot,Throwbacks


In [3]:
#Edit the track-uris to a more usable format
df["track_uri"] = df["track_uri"].apply(lambda x: re.findall(r'\w+$', x)[0])
df["track_uri"]

0        0UaMYEvWZi0ZqiDOoHU3YI
1        6I9VzXrHxO9rA9A5euc8Ak
2        0WqIKmW4BTrj3eJFmnCKMv
3        1AWQoqb9bSvzTjaLralEkT
4        1lzr43nnXAijIGYnCT8M8H
                  ...          
67498    5uCax9HTNlzGybIStD3vDh
67499    0P1oO2gREMYUCoOkzYAyFu
67500    2oM4BuruDnEvk59IvIXCwn
67501    4Ri5TTUgjM96tbQZd5Ua7V
67502    5RVuBrXVLptAEbGJdSDzL5
Name: track_uri, Length: 67503, dtype: object

In [4]:
testDF = df
#feature = ari_to_features(df["track_uri"])
#feature_df_test = pd.DataFrame(feature)
#feature_df_test.head()

## Included Features

The code cell below gives an example of the features extracted from each track, showing the kind of information that is used to cluster the data further on.

In [5]:
#Test the feature extraction script, and display features
ari_to_features(df["track_uri"][0])

{'danceability': 0.904,
 'energy': 0.813,
 'key': 4,
 'loudness': -7.105,
 'mode': 0,
 'speechiness': 0.121,
 'acousticness': 0.0311,
 'instrumentalness': 0.00697,
 'liveness': 0.0471,
 'valence': 0.81,
 'tempo': 125.461,
 'type': 'audio_features',
 'id': '0UaMYEvWZi0ZqiDOoHU3YI',
 'uri': 'spotify:track:0UaMYEvWZi0ZqiDOoHU3YI',
 'track_href': 'https://api.spotify.com/v1/tracks/0UaMYEvWZi0ZqiDOoHU3YI',
 'analysis_url': 'https://api.spotify.com/v1/audio-analysis/0UaMYEvWZi0ZqiDOoHU3YI',
 'duration_ms': 226864,
 'time_signature': 4,
 'artist_pop': 74,
 'genres': 'dance_pop hip_hop hip_pop pop pop_rap r&b rap urban_contemporary virginia_hip_hop',
 'track_pop': 69}

## Extraction

Below here, we extract features from each track using the Spotify API and the associated URI. This is done in 3 sections, due to the extremely long runtime of this process. We build a DataFrame containing these features.

In [6]:
first_half = df["track_uri"].unique()[:10000]
second_half = df["track_uri"].unique()[10000:20000]
third_half = df["track_uri"].unique()[20000:]
dataLIST = [first_half,second_half,third_half]

In [7]:
featureLIST = []

for i in tqdm([uri for uri in dataLIST[0]]):
    try:
        featureLIST.append(ari_to_features(i))
    except:
        continue


100%|██████████████████████████████████████████████████████████████████████████| 10000/10000 [1:22:58<00:00,  2.01it/s]


In [8]:
for i in tqdm([uri for uri in dataLIST[1]]):
    try:
        featureLIST.append(ari_to_features(i))
    except:
        continue

100%|██████████████████████████████████████████████████████████████████████████| 10000/10000 [1:23:07<00:00,  2.00it/s]


In [9]:
for i in tqdm([uri for uri in dataLIST[2]]):
    try:
        featureLIST.append(ari_to_features(i))
    except:
        continue

  4%|███▎                                                                        | 622/14443 [05:09<3:37:43,  1.06it/s]HTTP Error for GET to https://api.spotify.com/v1/tracks/656TZlNdVe90zHvmebFt9U with Params: {'market': None} returned 404 due to non existing id
 61%|█████████████████████████████████████████████▊                             | 8830/14443 [1:13:22<18:50,  4.96it/s]HTTP Error for GET to https://api.spotify.com/v1/tracks/5GiU7GOYjDH2yp7fMf9w9j with Params: {'market': None} returned 404 due to non existing id
100%|██████████████████████████████████████████████████████████████████████████| 14443/14443 [2:00:09<00:00,  2.00it/s]


In [10]:
#Preview the DataFrame
featureDF = pd.DataFrame(featureLIST)
featureDF

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,...,type,id,uri,track_href,analysis_url,duration_ms,time_signature,artist_pop,genres,track_pop
0,0.904,0.813,4,-7.105,0,0.1210,0.03110,0.006970,0.0471,0.810,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
1,0.774,0.838,5,-3.914,0,0.1140,0.02490,0.025000,0.2420,0.924,...,audio_features,6I9VzXrHxO9rA9A5euc8Ak,spotify:track:6I9VzXrHxO9rA9A5euc8Ak,https://api.spotify.com/v1/tracks/6I9VzXrHxO9r...,https://api.spotify.com/v1/audio-analysis/6I9V...,198800,4,84,dance_pop pop post-teen_pop,83
2,0.664,0.758,2,-6.583,0,0.2100,0.00238,0.000000,0.0598,0.701,...,audio_features,0WqIKmW4BTrj3eJFmnCKMv,spotify:track:0WqIKmW4BTrj3eJFmnCKMv,https://api.spotify.com/v1/tracks/0WqIKmW4BTrj...,https://api.spotify.com/v1/audio-analysis/0WqI...,235933,4,86,dance_pop pop r&b,25
3,0.892,0.714,4,-6.055,0,0.1410,0.20100,0.000234,0.0521,0.817,...,audio_features,1AWQoqb9bSvzTjaLralEkT,spotify:track:1AWQoqb9bSvzTjaLralEkT,https://api.spotify.com/v1/tracks/1AWQoqb9bSvz...,https://api.spotify.com/v1/audio-analysis/1AWQ...,267267,4,82,dance_pop pop,79
4,0.853,0.606,0,-4.596,1,0.0713,0.05610,0.000000,0.3130,0.654,...,audio_features,1lzr43nnXAijIGYnCT8M8H,spotify:track:1lzr43nnXAijIGYnCT8M8H,https://api.spotify.com/v1/tracks/1lzr43nnXAij...,https://api.spotify.com/v1/audio-analysis/1lzr...,227600,4,75,pop_rap reggae_fusion,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34435,0.669,0.228,2,-12.119,1,0.0690,0.79200,0.065000,0.0944,0.402,...,audio_features,3uCHI1gfOUL5j5swEh0TcH,spotify:track:3uCHI1gfOUL5j5swEh0TcH,https://api.spotify.com/v1/tracks/3uCHI1gfOUL5...,https://api.spotify.com/v1/audio-analysis/3uCH...,189184,4,47,unknown,27
34436,0.493,0.727,1,-5.031,1,0.2170,0.08730,0.000000,0.1290,0.289,...,audio_features,0P1oO2gREMYUCoOkzYAyFu,spotify:track:0P1oO2gREMYUCoOkzYAyFu,https://api.spotify.com/v1/tracks/0P1oO2gREMYU...,https://api.spotify.com/v1/audio-analysis/0P1o...,263680,4,39,australian_r&b,37
34437,0.702,0.524,7,-10.710,1,0.0793,0.33200,0.055300,0.2980,0.265,...,audio_features,2oM4BuruDnEvk59IvIXCwn,spotify:track:2oM4BuruDnEvk59IvIXCwn,https://api.spotify.com/v1/tracks/2oM4BuruDnEv...,https://api.spotify.com/v1/audio-analysis/2oM4...,189213,4,55,canadian_contemporary_r&b modern_alternative_rock,49
34438,0.509,0.286,8,-14.722,1,0.1230,0.40200,0.000012,0.1310,0.259,...,audio_features,4Ri5TTUgjM96tbQZd5Ua7V,spotify:track:4Ri5TTUgjM96tbQZd5Ua7V,https://api.spotify.com/v1/tracks/4Ri5TTUgjM96...,https://api.spotify.com/v1/audio-analysis/4Ri5...,194720,4,4,unknown,16


## Finalising and Export

We finally merge the feature DataFrame with the original dataset, as this also contains useful information in the artist name and track name. This is then exported, as our processed data.

In [11]:
new_df = pd.merge(testDF,featureDF, left_on = "track_uri", right_on= "id")

In [12]:
new_df.to_csv('../data/processed_data.csv')