# Distance!
In this notebook, I will be trying out different paths towards the creation of a demo using distance formulae to determine music similarity. Most of the experimentation will be with **normalization** and the different **distance equations** used, maybe later on I'll play around with **weight adjustments** and I could also play with **Dimensionality Reduction**.

### Imports and Setup

In [50]:
import numpy as np
import pandas as pd
from pprint import pprint
from copy import deepcopy
from typing import Callable
from sklearn.preprocessing import StandardScaler, MinMaxScaler, normalize

DATASET_PATH = '/Users/nico/Desktop/CIIC/CAPSTONE/essentia_demo/03_29_25_test_features_full.csv'
base_df      = pd.read_csv(DATASET_PATH)
features     = ['popularity', 'mfcc_peak_energy', 'mfcc_avg_energy', 'pitch_salience', 'bpm', 'energy', 'integrated_loudness', 'loudness_range',
                'danceable_effnet', 'aggressive_effnet', 'happy_effnet', 'party_effnet', 'relaxed_effnet', 'sad_effnet', 'acoustic_effnet',
                'electronic_effnet', 'instrumental_effnet', 'female_effnet', 'tonal_effnet', 'bright_effnet', 'bright_nsynth_effnet',
                'approachable_effnet', 'approachability_effnet', 'engaging_effnet', 'engagement_effnet']

### Normalization
Here we'll just create three dataframes from the `base_df` using different normalization methods. Based on previous tests, the one that worked best was the `z_score_df` which uses **Z Score Normalization**.

In [51]:
z_score_df  = deepcopy(base_df)
min_max_df  = deepcopy(base_df)
unit_vec_df = deepcopy(base_df)

z_score_df[features]  = StandardScaler().fit_transform(z_score_df[features])
min_max_df[features]  = MinMaxScaler(feature_range=(0,1)).fit_transform(min_max_df[features])
unit_vec_df[features] = normalize(unit_vec_df[features], norm = 'l2') # What... this is just the L2 Norm? Idk, weird
                                                                      # This one sucks real bad... idk how would it be
                                                                      # any good...

### Defining Distance Formulae
Here I'm defining some of the distance formulae that I believe will be useful for our project. I'm not using Chebyshev at all because I find it to not be applicable to our use case, I think it would suck here.

In [54]:
def dist_helper(compare_df : pd.DataFrame, dist_func : Callable, track_index : int = 0):
  result_dict = {}
  input_track = compare_df.iloc[track_index]
  print(f"Selected Track : {input_track['ARTIST'], input_track['TITLE']}")


  for ix, curr_track in compare_df.iterrows():
    if ix == track_index: continue
    track_name = "_".join([curr_track['ARTIST'], curr_track['TITLE']])

    dist_score = dist_func(features, input_track, curr_track)
    result_dict[track_name] = dist_score

  return result_dict


def print_top(distance_dict : dict[str, float], top_ix : int = 15):
  for ix, items in enumerate(distance_dict.items()):
    if ix >= top_ix: break
    key, val = items
    print(key, val)


def manhattan_dist(feature_list : list[str], input_track, comparison_track):
  """Same as taxicab or L1 Norm. It's all the same, it's like breathing air."""
  dist_score = 0
  for feature in feature_list:
    input_feat   = input_track[feature]
    compare_feat = comparison_track[feature]
    dist_score   += abs(input_feat - compare_feat)

  return float(dist_score)


def euclidean_dist(feature_list : list[str], input_track, comparison_track):
  """Same as L2 Norm"""
  dist_score = 0
  for feature in feature_list:
    input_feat   = input_track[feature]
    compare_feat = comparison_track[feature]
    dist_score   += (input_feat - compare_feat)**2

  return float(dist_score)


def cosine_dist(feature_list : list[str], input_track, comparison_track):
  input_feat   = input_track[feature_list]
  compare_feat = comparison_track[feature_list]
  return np.dot(input_feat, compare_feat) / (np.linalg.norm(input_feat) * np.linalg.norm(input_feat))


def mahalanobis_dist(compare_df : pd.DataFrame, feature_list: list[str], track_index : int = 0):
    """
    Computes Mahalanobis distance between two tracks, given an inverse covariance matrix.

    Parameters:
      feature_list: List of feature names to include in the distance calculation.
      input_track: Dictionary-like or Series object with features of input track.
      comparison_track: Dictionary-like or Series object with features of comparison track.
      inv_cov_matrix: Precomputed inverse covariance matrix of the dataset.

    Returns:
      float: Mahalanobis distance between input_track and comparison_track.
    """

    # Set up basics
    result_dict    = {}
    input_track    = compare_df.iloc[track_index]
    print(f"Selected Track : {input_track['ARTIST'], input_track['TITLE']}")

    # Acquire necessary values for mahalanobis
    cov_matrix     = np.cov(compare_df[feature_list].values, rowvar=False)
    inv_cov_matrix = np.linalg.inv(cov_matrix)

    for ix, comp_track in compare_df.iterrows():
      if ix == track_index : continue
      track_name     = "_".join([comp_track['ARTIST'], comp_track['TITLE']])


      # Create the feature vectors from the track data
      input_vector   = np.array([input_track[feature] for feature in feature_list])
      compare_vector =  np.array([comp_track[feature] for feature in feature_list])

      # Calculate the difference vector
      diff_vector    = input_vector - compare_vector

      # Calculate the Mahalanobis distance explicitly
      dist_score     = np.sqrt(diff_vector.T @ inv_cov_matrix @ diff_vector)
      result_dict[track_name] = float(dist_score)

    return result_dict

### Testing

In [None]:
output = mahalanobis_dist(base_df, features, track_index=118)
output = dict(sorted(output.items(), key=lambda item: item[1]))



Selected Track : ('yeule', 'Pretty Bones')
Lorde_Supercut 3.643982692806503
MIKE_Iz u Stupid 3.72531128162808
A. G. Cook_Lucifer 3.7782813505846775
ESPRIT 空想_Secret 3.805614566300338
Ecco2k_Cc 3.809434024842212
yeule_inferno 3.9260747736054578
Burial_Ghost Hardware 3.9787762489986145
Vince Staples_Homage 4.031025355948976
Billie Eilish_wish you were gay 4.0401741096443935
Yung Lean_Yellowman 4.107697950793843
Eartheater_High Tide 4.116632213917167
yeule_Pocky Boy 4.121100919504808
Della Zyr_여름: 모호함 속의 너 / 2악장 / 놓아줄 때가 되면 놓아주기 4.123151573162734
Ecco2k_Peroxide 4.19165551412244
Draag Me_There Is a Party Where I'm Going 4.235121789091024


In [53]:
output = dist_helper(z_score_df, euclidean_dist, track_index=118)
output = dict(sorted(output.items(), key=lambda item: item[1]))

for ix, items in enumerate(output.items()):
  if ix >= 15: break
  key, val = items
  print(key, val)


Selected Track : ('yeule', 'Pretty Bones')
yeule_inferno 8.85066056859401
James Blake_I Want You To Know 10.070945155183141
yeule_Pixel Affection 10.572584561745277
ESPRIT 空想_Secret 10.687195588811813
George Clanton_Everything I Want 10.930947828455755
Ecco2k_Cc 11.2141818748849
yeule_Pocky Boy 12.095229612444989
Burial_Ghost Hardware 12.209164880124579
James Blake_Fire The Editor 13.020157766179357
Sam Gellaitry_Embark 13.312151063170084
Yung Lean_Yellowman 13.40018212836539
Della Zyr_여름: 모호함 속의 너 / 2악장 / 놓아줄 때가 되면 놓아주기 13.496214046590266
Burial_Archangel 14.343680736848539
MIKE_Iz u Stupid 14.4292352741993
Sam Gellaitry_Fall 14.602567326887867
