# Data Collection - Gathering audio data and features

This notebook focuses on both the collection of audio data and audio features for songs found in the `gather_nonwinning_tabular_data.ipynb` and `gather_winning_tabular_data.ipynb` files. I decided to split the Award Show Winners and the Non-winners, despite sharing similar processes. I feel that adds to the notebook's readability, organization, and it preserves the order in which I obtained the data (*Non-winners was obtained at a later date*).

In [208]:
import os
import pandas as pd
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # Suppress TensorFlow logs

In [209]:
from datetime import datetime 
def parse_date(date_str):
    for fmt in ('%B %d, %Y', '%Y-%m-%d'):
        try:
            return datetime.strptime(date_str, fmt)
        except ValueError:
            continue
    return pd.NaT  # Return NaT if parsing fails

## Initializing Model Class

In [210]:
from essentia_models import EssentiaPredictor

predictor = EssentiaPredictor()

INFO:essentia_models:Initializing models in batches...
INFO:essentia_models:Initializing model 1: vggish_embedding_model...
INFO:essentia_models:Model 1 (vggish_embedding_model) initialized in 0.49 seconds.
INFO:essentia_models:Initializing model 2: effnet_embedding_model...
INFO:essentia_models:Model 2 (effnet_embedding_model) initialized in 0.25 seconds.
INFO:essentia_models:Initializing model 3: vggish_dance_model...
INFO:essentia_models:Model 3 (vggish_dance_model) initialized in 0.22 seconds.
INFO:essentia_models:Initializing model 4: vggish_party_model...
INFO:essentia_models:Model 4 (vggish_party_model) initialized in 0.21 seconds.
INFO:essentia_models:Initializing model 5: vggish_happy_model...
INFO:essentia_models:Model 5 (vggish_happy_model) initialized in 0.22 seconds.
INFO:essentia_models:Initializing model 6: vggish_sad_model...
INFO:essentia_models:Model 6 (vggish_sad_model) initialized in 0.20 seconds.
INFO:essentia_models:Initializing model 7: effnet_party_model...
INFO

# Award Show Non-winners

In [211]:
nonwinning_df = pd.read_csv('data/tables/all_award_show_all_place.csv', keep_default_na=False, na_values=[''])
nonwinning_df.head(5)

Unnamed: 0,Show,Date,Placement,Artist,Song,Total
0,Inkigayo,2022-01-09,1,IVE,ELEVEN,8533
1,Inkigayo,2022-01-16,1,IVE,ELEVEN,6583
2,Inkigayo,2022-01-23,1,IVE,ELEVEN,5927
3,Inkigayo,2022-01-30,1,GOT the beat,Step Back,5612
4,Inkigayo,2022-02-20,1,GOT the beat,Step Back,7224


In [212]:
nonwinning_df['Artist_Song'] = nonwinning_df.Artist + " " + nonwinning_df.Song
nonwinning_df['Date'] = pd.to_datetime(nonwinning_df['Date'], format='%Y-%m-%d')

# nonwinning_unique_queries = nonwinning_df['Artist_Song'].unique().tolist()
# nonwinning_unique_queries.sort()
# nonwinning_unique_queries

In [213]:
%%capture download_non_wins

from download_audio import process_songs_dataframe
nonwinning_linked_df = process_songs_dataframe(nonwinning_df)

In [214]:
import json

non_winning_feature_filepath = 'data/tables/award_non_winning_features.json'

if os.path.exists(non_winning_feature_filepath):
    with open(non_winning_feature_filepath, 'r') as f:
        link_to_non_winning_features = json.load(f)
else:
    link_to_non_winning_features = dict()    

In [215]:
%%capture process_non_wins

unique_non_winning_file_paths = nonwinning_linked_df.file_path.unique().tolist()
for query in unique_non_winning_file_paths:
    if query not in link_to_non_winning_features:
        link_to_non_winning_features[query] = predictor.predict_all(query)  
    print(f"Processed {query} with features: {link_to_non_winning_features[query]}")

In [216]:
import json
with open('data/tables/award_non_winning_features.json', 'w') as f:
    json.dump(link_to_non_winning_features, f, indent=4)

In [217]:
# feature_list = ['effnet_approachability', 'vggish_party', 'vggish_happy', 'vggish_sad', 'effnet_party', 'effnet_happy', 'effnet_sad', 'effnet_approachability', 'effnet_engagement', 'effnet_timbre_bright', 'tempo']
# plot_features_subplots({feature: [x[feature] for x in link_to_non_winning_features.values()] for feature in feature_list})

# Award Show Winners

Upon introspection on how to utilize both datasets, I noticed that the initial Award Show Winning dataset contained a lot of inconsistencies. For example, 'TOMORROW X TOGETHER' was referenced both as 'TXT' and 'TOMORROW X TOGETHER'. Additionally, there were dates that were published that didn't correspond to when an episode happened or would happen since these Music Shows run a consistent weekly basis. Therefore, I decided to simply merge data that isn't found in the KPOP Database website. The information found on the Wikipedia pages appear to be updated more frequently and contain certain episodes of a program that weren't covered in the database. This could be due to episodes not being aired but still announcing a winner for the week. 

In [218]:
import pandas as pd

win_df = pd.read_csv('data/tables/updated_all_award_show_winners.csv', index_col=0)
win_df.head(5)

Unnamed: 0,Episode,Date,Artist,Song,Points,Award Show,Search Query
0,1121,"January 9, 2022",Ive,Eleven,8533,Inkigayo,track:Eleven artist:Ive year:2000-2022
1,1122,"January 16, 2022",Ive,Eleven,6583,Inkigayo,track:Eleven artist:Ive year:2000-2022
2,1123,"January 23, 2022",Ive,Eleven,5927,Inkigayo,track:Eleven artist:Ive year:2000-2022
3,1124,"January 30, 2022",Got the Beat,Step Back,5612,Inkigayo,track:Step Back artist:Got the Beat year:2000-...
4,1125,"February 20, 2022",Got the Beat,Step Back,7224,Inkigayo,track:Step Back artist:Got the Beat year:2000-...


In [219]:
win_df['Date'] = win_df['Date'].apply(parse_date)
win_df['Date'] = pd.to_datetime(win_df['Date'], format='%Y-%m-%d')

dates_only_in_winner = win_df[~win_df['Date'].isin(nonwinning_df['Date'])].reset_index(drop=False)

dates_only_in_winner.head(5)

Unnamed: 0,index,Episode,Date,Artist,Song,Points,Award Show,Search Query
0,143,1264,2025-04-20,Le Sserafim,Hot,4556,Inkigayo,track:Hot artist:Le Sserafim year:2000-2025
1,288,882,2025-04-17,J-Hope,Mona Lisa,7643,M Countdown,track:Mona Lisa artist:J-Hope year:2000-2025
2,338,—,2022-12-16,Kara,When I Move,5618,Music Bank,track:When I Move artist:Kara year:2000-2022
3,339,—,2022-12-23,Younha,Event Horizon,4233,Music Bank,track:Event Horizon artist:Younha year:2000-2022
4,340,—,2022-12-30,NCT Dream,Candy,10528,Music Bank,track:Candy artist:NCT Dream year:2000-2022


In [220]:
import pandas as pd
from thefuzz import process, fuzz

# Matching Artists found in Wikipedia dataset to match names in KPOP DB
def fuzzy_match_artists(winners, df_full, artist_col="Artist", score_cutoff=85):
    full_artist_list = df_full[artist_col].dropna().unique().tolist()
    results = []

    for artist in winners:
        match_result = process.extractOne(artist, full_artist_list, score_cutoff=score_cutoff)

        if match_result is not None:
            match, score = match_result
            results.append({
                "original_artist": artist,
                "matched_artist": match,
                "match_score": score
            })
        else:
            results.append({
                "original_artist": artist,
                "matched_artist": artist,
                "match_score": 100
            })

    return pd.DataFrame(results).sort_values('match_score', ascending=False).drop_duplicates()

# Matching the Song Artist columns to ensure that songs are matching to the correct Artist
def fuzzy_match_artist_song(winners, df_full, score_cutoff=85):
    artist_song_to_song = df_full[['Artist_Song', 'Song']]
    artist_song_to_song = df_full.set_index('Artist_Song')['Song'].to_dict()

    full_artist_song_list = list(artist_song_to_song.keys())

    results = []

    for artist_song, song in winners:
        match_result = process.extractOne(artist_song, full_artist_song_list, score_cutoff=score_cutoff)

        if match_result is not None:
            matched_artist_song, score = match_result
        else:
            matched_artist_song, score = artist_song, 100

        # Look up actual Song title
        matched_song = artist_song_to_song.get(matched_artist_song, song)

        results.append({
            "original_artist_song": artist_song,
            "matched_artist_song": matched_artist_song,
            "original_song": song,
            "matched_song": matched_song,
            "match_score": score
        })

    return pd.DataFrame(results).sort_values('match_score', ascending=False).drop_duplicates()

In [221]:
matched_df = fuzzy_match_artists(dates_only_in_winner.Artist, nonwinning_df, score_cutoff=95)
match_dict = dict(zip(matched_df.original_artist, matched_df.matched_artist))
dates_only_in_winner['Artist'] = dates_only_in_winner['Artist'].map(match_dict).combine_first(dates_only_in_winner['Artist'])

dates_only_in_winner['Artist_Song'] = dates_only_in_winner['Artist'] + " " + dates_only_in_winner['Song']

In [222]:
art_song_matched_df = fuzzy_match_artist_song(list(zip(dates_only_in_winner.Artist_Song, dates_only_in_winner.Song)), nonwinning_df, score_cutoff=100)
artist_song_map = dict(zip(art_song_matched_df.original_artist_song, art_song_matched_df.matched_artist_song))
song_map = dict(zip(art_song_matched_df.matched_artist_song, art_song_matched_df.matched_song))

dates_only_in_winner['Artist_Song'] = dates_only_in_winner['Artist_Song'].map(artist_song_map).combine_first(dates_only_in_winner['Artist_Song'])
dates_only_in_winner['Song'] = dates_only_in_winner['Artist_Song'].map(song_map).combine_first(dates_only_in_winner['Song'])
dates_only_in_winner

Unnamed: 0,index,Episode,Date,Artist,Song,Points,Award Show,Search Query,Artist_Song
0,143,1264,2025-04-20,LE SSERAFIM,HOT,4556,Inkigayo,track:Hot artist:Le Sserafim year:2000-2025,LE SSERAFIM HOT
1,288,882,2025-04-17,J-Hope,MONA LISA,7643,M Countdown,track:Mona Lisa artist:J-Hope year:2000-2025,J-Hope MONA LISA
2,338,—,2022-12-16,Kara,When I Move,5618,Music Bank,track:When I Move artist:Kara year:2000-2022,Kara When I Move
3,339,—,2022-12-23,YOUNHA,Event Horizon,4233,Music Bank,track:Event Horizon artist:Younha year:2000-2022,YOUNHA Event Horizon
4,340,—,2022-12-30,NCT DREAM,Candy,10528,Music Bank,track:Candy artist:NCT Dream year:2000-2022,NCT DREAM Candy
5,393,1189,2024-01-05,NCT 127,Be There For Me,7392,Music Bank,track:Be There For Me artist:NCT 127 year:2000...,NCT 127 Be There For Me
6,394,1190,2024-01-12,NCT 127,Be There For Me,6690,Music Bank,track:Be There For Me artist:NCT 127 year:2000...,NCT 127 Be There For Me
7,395,1191,2024-01-19,ITZY,UNTOUCHABLE,10679,Music Bank,track:Untouchable artist:Itzy year:2000-2024,ITZY UNTOUCHABLE
8,444,—[note 1],2024-12-27,STRAY KIDS,Walkin On Water,5599,Music Bank,track:Walkin on Water artist:Stray Kids year:2...,STRAY KIDS Walkin On Water
9,459,1239,2025-04-11,CLOSE YOUR EYES,All My Poetry,5507,Music Bank,track:All My Poetry artist:Close Your Eyes yea...,CLOSE YOUR EYES All My Poetry


In [223]:
# df['Artist_Song'] = df.Artist + " " + df.Song

# unique_queries = df['Artist_Song'].unique().tolist()
# unique_queries.sort()
# unique_queries

In [224]:
%%capture download_wins

dates_only_in_winner = process_songs_dataframe(dates_only_in_winner)

In [225]:
len(dates_only_in_winner), len(dates_only_in_winner.file_path.unique())

(20, 16)

In [226]:
import json
winning_feature_filepath = 'data/tables/award_winning_features.json'

if os.path.exists(winning_feature_filepath):
    with open(winning_feature_filepath, 'r') as f:
        link_to_features = json.load(f)
else:
    link_to_features = dict()    


In [227]:
%%capture process_wins

unique_file_paths = dates_only_in_winner.file_path.unique().tolist()

for query in unique_file_paths:
    if query not in link_to_features:
        link_to_features[query] = predictor.predict_all(query)  
    print(f"Processed {query} with features: {link_to_features[query]}")

In [228]:
# import matplotlib.pyplot as plt
# import numpy as np

# def plot_features(features, title):
#     plt.figure(figsize=(10, 6))
#     plt.hist(features, color='blue', alpha=0.7)
#     plt.title(title)
#     plt.xlabel('Value')
#     plt.ylabel('Frequency')
#     plt.grid(axis='y', alpha=0.75)
#     plt.show()

# def plot_features_subplots(features_dict):
#     num_features = len(features_dict)
#     fig, axs = plt.subplots(num_features, 1, figsize=(10, 6 * num_features))
#     fig.tight_layout(pad=5.0)

#     for i, (title, features) in enumerate(features_dict.items()):
#         axs[i].hist(features, color='blue', alpha=0.7)
#         axs[i].set_title(title)
#         axs[i].set_xlabel('Value')
#         axs[i].set_ylabel('Frequency')
#         axs[i].grid(axis='y', alpha=0.75)

#     plt.show()

In [229]:
# tempo_values = [x['tempo'] for x in link_to_features.values()]
# plot_features(tempo_values, "Tempo Distribution")

In [230]:
# appr_values = [x['effnet_approachability'] for x in link_to_features.values()]
# plot_features(appr_values, "Approachability Distribution")

In [231]:
# feature_list = ['effnet_approachability', 'vggish_party', 'vggish_happy', 'vggish_sad', 'effnet_party', 'effnet_happy', 'effnet_sad', 'effnet_approachability', 'effnet_engagement', 'effnet_timbre_bright', 'tempo']
# plot_features_subplots({feature: [x[feature] for x in link_to_features.values()] for feature in feature_list})


As seen in the histograms above, some of the model distributions appear to be either right or left skewed. Therefore, we will need to normalize the values once we have both datasets, award-winning and non-award-winning, before normalizing the values.

In [232]:
import json
with open('data/tables/award_winning_features.json', 'w') as f:
    json.dump(link_to_features, f, indent=4)

# Merging Dataframes

In [233]:
win_feature_df = pd.DataFrame.from_dict(link_to_features, orient='index')
dates_only_in_winner = dates_only_in_winner.join(win_feature_df, on='file_path')

In [234]:
dates_only_in_winner = dates_only_in_winner.reset_index(drop=True)
dates_only_in_winner['Placement'] = 1
dates_only_in_winner.columns

Index(['index', 'Episode', 'Date', 'Artist', 'Song', 'Points', 'Award Show',
       'Search Query', 'Artist_Song', 'file_path', 'vggish_dance',
       'vggish_party', 'vggish_happy', 'vggish_sad', 'effnet_party',
       'effnet_happy', 'effnet_sad', 'effnet_approachability',
       'effnet_engagement', 'effnet_timbre_bright', 'tempo', 'Placement'],
      dtype='object')

In [235]:
# Drop unnecessary columns and rename for clarity and compatibility with non-winning data
win_linked_df = dates_only_in_winner.copy().drop(columns=['Episode', 'Search Query', 'Artist_Song', 'index'])
win_linked_df.rename(columns={"Points": "Total", "Award Show": "Show"}, inplace=True)

# Reorder columns to match the non-winning data
columns_reorder = ["Show", "Date", "Artist", "Song", "Total", "Placement"]
win_linked_df = pd.concat([win_linked_df[columns_reorder], win_linked_df.drop(columns=columns_reorder)], axis=1)
win_linked_df.head()

Unnamed: 0,Show,Date,Artist,Song,Total,Placement,file_path,vggish_dance,vggish_party,vggish_happy,vggish_sad,effnet_party,effnet_happy,effnet_sad,effnet_approachability,effnet_engagement,effnet_timbre_bright,tempo
0,Music Bank,2025-04-11,CLOSE YOUR EYES,All My Poetry,5507,1,data/audio/all_my_poetry_close_your_eyes.flac,94.772673,39.468187,58.600849,39.521444,25.972188,28.459117,65.364033,94.837797,89.03749,52.617425,89.0
1,Music Bank,2024-01-19,ITZY,UNTOUCHABLE,10679,1,data/audio/untouchable_itzy.flac,97.282368,8.796155,81.554157,85.176849,0.169356,39.911935,93.669564,94.967508,94.734287,50.575477,105.0
2,M Countdown,2025-04-17,J-Hope,MONA LISA,7643,1,data/audio/mona_lisa_j-hope.flac,97.054577,17.678125,60.819304,67.917317,10.585602,39.631873,87.287062,87.912464,92.784697,45.539203,138.0
3,Show! Music Core,2025-04-12,JENNIE,like JENNIE,6095,1,data/audio/like_jennie_jennie.flac,96.556515,10.664071,75.223869,90.521652,1.441297,64.432132,94.390357,60.059088,98.381501,45.949566,130.0
4,Music Bank,2022-12-16,Kara,When I Move,5618,1,data/audio/when_i_move_kara.flac,93.176335,19.549541,76.299649,78.993183,3.005806,41.917771,88.828254,91.262579,92.842674,53.098851,116.0


In [236]:
win_linked_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Show                    20 non-null     object        
 1   Date                    20 non-null     datetime64[ns]
 2   Artist                  20 non-null     object        
 3   Song                    20 non-null     object        
 4   Total                   20 non-null     int64         
 5   Placement               20 non-null     int64         
 6   file_path               20 non-null     object        
 7   vggish_dance            20 non-null     float64       
 8   vggish_party            20 non-null     float64       
 9   vggish_happy            20 non-null     float64       
 10  vggish_sad              20 non-null     float64       
 11  effnet_party            20 non-null     float64       
 12  effnet_happy            20 non-null     float64     

In [None]:
lose_feature_df = pd.DataFrame.from_dict(link_to_non_winning_features, orient='index')
nonwinning_linked_df = nonwinning_linked_df.join(lose_feature_df, on='file_path')

In [238]:
nonwinning_to_merge_df = nonwinning_linked_df.copy().drop(['Artist_Song'], axis=1).reset_index(drop=True)
nonwinning_to_merge_df = pd.concat([nonwinning_to_merge_df[columns_reorder], nonwinning_to_merge_df.drop(columns=columns_reorder)], axis=1)
nonwinning_to_merge_df.head(5)

Unnamed: 0,Show,Date,Artist,Song,Total,Placement,file_path,vggish_dance,vggish_party,vggish_happy,vggish_sad,effnet_party,effnet_happy,effnet_sad,effnet_approachability,effnet_engagement,effnet_timbre_bright,tempo
0,Inkigayo,2024-02-18,(G)I-DLE,Super Lady,6996,2,data/audio/super_lady_(g)i-dle.flac,93.854207,11.460681,79.334843,90.879661,3.159283,58.017582,90.699446,83.888745,96.587032,49.879,122.0
1,Show! Music Core,2023-06-03,(G)I-DLE,Queencard,8133,1,data/audio/queencard_(g)i-dle.flac,92.974675,14.018901,81.472057,88.088566,1.055446,87.547749,94.290841,80.356991,98.429632,48.344669,130.0
2,Show! Music Core,2023-05-27,(G)I-DLE,Queencard,7757,1,data/audio/queencard_(g)i-dle.flac,92.974675,14.018901,81.472057,88.088566,1.055446,87.547749,94.290841,80.356991,98.429632,48.344669,130.0
3,The Show,2022-03-22,(G)I-DLE,TOMBOY,7970,1,data/audio/tomboy_(g)i-dle.flac,93.242282,15.568084,78.373891,86.487371,3.316267,63.539046,92.144006,76.364082,96.781135,49.883446,124.0
4,M Countdown,2024-02-01,(G)I-DLE,Wife,6305,2,data/audio/wife_(g)i-dle.flac,96.563458,8.666032,76.261991,91.656339,1.623682,72.608387,95.349288,66.247165,98.272377,48.901996,124.0


In [239]:
merged_df = pd.concat([win_linked_df, nonwinning_to_merge_df], ignore_index=True)
merged_df = merged_df.reset_index(drop=True).sort_values(by=['Date', 'Placement'])
merged_df.head(5)

Unnamed: 0,Show,Date,Artist,Song,Total,Placement,file_path,vggish_dance,vggish_party,vggish_happy,vggish_sad,effnet_party,effnet_happy,effnet_sad,effnet_approachability,effnet_engagement,effnet_timbre_bright,tempo
940,Music Bank,2022-01-07,NCT U,Universe (Let's Play Ball),5930,1,data/audio/universe_(let's_play_ball)_nct_u.flac,90.834355,24.404429,69.484723,76.007283,1.9501,47.70259,92.827344,84.677964,96.71334,50.498027,90.0
511,Show! Music Core,2022-01-08,IVE,ELEVEN,6408,1,data/audio/eleven_ive.flac,94.969481,14.487433,81.253177,86.618626,2.623723,44.954056,88.753748,91.413623,92.12752,51.92278,120.0
547,Inkigayo,2022-01-09,IVE,ELEVEN,8533,1,data/audio/eleven_ive.flac,94.969481,14.487433,81.253177,86.618626,2.623723,44.954056,88.753748,91.413623,92.12752,51.92278,120.0
752,M Countdown,2022-01-13,Kep1er,WA DA DA,6500,1,data/audio/wa_da_da_kep1er.flac,96.806103,8.892013,78.541768,93.300325,1.427681,66.826451,95.882249,71.226579,97.789687,49.050614,126.0
746,Music Bank,2022-01-14,Kep1er,WA DA DA,3678,1,data/audio/wa_da_da_kep1er.flac,96.806103,8.892013,78.541768,93.300325,1.427681,66.826451,95.882249,71.226579,97.789687,49.050614,126.0


In [240]:
merged_df.to_csv('data/tables/merged_award_show_winners.csv', index=False)