# Similarity Investigation

In [1]:
__import__('sys').path.append('../scripts/'); __import__('notebook_utils').table_of_contents('similarity_investigation.ipynb')

<h3>Table of contents</h3>


[Similarity Investigation](#Similarity-Investigation)
- [Setup](#Setup)
- [How others compute similarity](#How-others-compute-similarity)
- [Attempt 1. Check for common tracks in FMA & MSD](#Attempt-1.-Check-for-common-tracks-in-FMA-&-MSD)
- [Attempt 2. Use UPF Essentia](#Attempt-2.-Use-UPF-Essentia)
- [Attempt 3. Use another dataset: Music4All-Onion](#Attempt-3.-Use-another-dataset:-Music4All-Onion)
- [Attempt ?. Replicate the similarity algorithm MSD used](#Attempt-?.-Replicate-the-similarity-algorithm-MSD-used)

## Setup

In [9]:
# IMPORTS
import json
import os
import numpy
import pandas
import sys

from notebook_utils import md, h3, h4, h5
import FMA_code.utils

In [None]:
# PATHS
class paths():
    # General
    DATA_F = 'data/'

    # MSD files
    MSD_F = DATA_F + 'MSD/'
rr4
    LASTFM_SUBSET_F = MSD_F + 'lastfm_subset/'
    LASTFM_TRAIN_f = MSD_F + 'lastfm_train/'
    LASTFM_TEST_F = MSD_F + 'lastfm_test/'

    # FMA files
    FMA_F = DATA_F + 'FMA/'
    FMA_METADATA_F = FMA_F + 'fma_metadata/'
    FMA_SMALL_F = FMA_F + 'fma_small/'


## How others compute similarity

**SPOTIFY**


**[SPOTALIKE](https://spotalike.com/en)**<br>
They use **lastFM dataset** to provide similar songs within the MSD dataset


**[COSINE.CLUB](https://cosine.club/)**<br>
For the machine learning heads: it's using a contrastive learning model from the Music Technology Group at [UPF](https://essentia.upf.edu/models.html#discogs-effnet) that has been trained on triplets of mel-spectrograms of tracks to learn associations between a positive pair with a negative sample. Then by creating vector embeddings of each track the cosine similarity between the vectors can be used to find the most similar in the index. [[ref](https://www.reddit.com/r/TheOverload/comments/1csqg0j/i_built_a_music_search_engine_to_help_you_find/)]

**[Essentia - UPF](https://essentia.upf.edu/models.html#discogs-effnet):** 

## Attempt 1. Check for common tracks in FMA & MSD
Idea: use MSD similarity as a gold standard to finetune FMA similarity.<br>
<span style="color:red">**Failed attempt**: 0 pairs of similar MSD songs are present in FMA :')</span>

In [22]:
# SHOW METADATA FOR ONE EXAMPLE OF LASTFM SUBSET
lastfm_ex_path = paths.LASTFM_SUBSET_F + 'A/A/A/TRAAAAW128F429D538.json'
with open(lastfm_ex_path) as f:
    lastfm_ex = json.load(f)

for key in lastfm_ex.keys():
    print(f'{key:<15}', lastfm_ex[key][:2] + ['...'] if isinstance(lastfm_ex[key], list) else lastfm_ex[key])

artist          Casual
timestamp       2011-08-02 20:13:25.674526
similars        [['TRABACN128F425B784', 0.871737], ['TRIAINV12903CB4943', 0.751301], '...']
tags            [['Bay Area', '100'], ['hieroglyiphics', '100'], '...']
track_id        TRAAAAW128F429D538
title           I Didn't Mean To


In [None]:
# GET TRACKS PRESENT IN FMA
# Get title & artist tuple set for FMA
tracks = FMA_code.utils.load(paths.FMA_METADATA_F + 'tracks.csv')
FMA_title_artist_set = set(zip(tracks[('track', 'title')], tracks[('artist', 'name')]))

# Get list of tracks with similar tracks
with open(paths.SIMILAR_TRACKS) as f:
    tracks_with_similar = f.read().splitlines()

# Get MSD tracks that match FMA & save it in similars_dict
notfound = 0
similars_dict = {}

for track_id in tracks_with_similar:
    rel_track_path = f'{track_id[2]}/{track_id[3]}/{track_id[4]}/{track_id}.json'

    # Check on the two possible folders the track can be in (train & test)
    for supfolder in [paths.LASTFM_TEST_F, paths.LASTFM_TRAIN_f]:
        track_path = supfolder + rel_track_path
        if not os.path.exists(track_path):
            continue
    
        # Get the similar tracks in the format [(track_id, score), ...]
        with open(track_path) as f:
            track_dict = json.load(f)

            # Get the title & artist tuple
            title = track_dict['title']
            artist = track_dict['artist']

            # Check if the title & artist match
            if (title, artist) in FMA_title_artist_set:
                similars_dict[track_id] = {
                    'similars': track_dict['similars'],
                    'artist': track_dict['artist'],
                    'title': track_dict['title']   
                }

print(f"FMA and MSD match in {len(similars_dict)} tracks where MSD has similar tracks")

The Durks Chandeliers
Reggae War Zone Foot Village
Sailin' On The Red Thread
It Killed Mom Thee Oh Sees
Roka Calexico
When He Comes Holly Golightly & The Brokeoffs
Two Silver Trees Calexico
The Anvil Will Fall Harvey Milk
Vernon Jackson The Brought Low
Please Listen To My Demo EPMD
Mango Tree Chandeliers
OK Higgins
Better Be Good The Real Kids
Red F Dan Deacon
Plug Me In The Unsacred Hearts
Picking Scabs Thomas Function
War Harvey Milk
You Changed My Life Kevin Blechdom
When My Ship Comes In Clockcleaner
Victor Jara's Hands Calexico
River Matteah Baim
Forgotten Lovers Gary Wilson
Address Book Misty Roses
Feel Anymore Captain Ahab
1978 The Unsacred Hearts
Motionless Sian Alice Group
The Most Excruciating Vibe Larkin Grimm
Canker Ut
Wall of Dumb Higgins
The Language Clan Destined
There He Is Higgins
Bound Feet & Feathered Mia Doi Todd
Microcastle Deerhunter
Want Me Ariel Pink's Haunted Graffiti
Armour of the Shroud Bobb Trimble
One Big Holiday My Morning Jacket
Plan B Clan Destined
All A

In [23]:
# CHECK IF SIMILAR TRACKS ARE IN FMA
# Get all MSD tracks similar to a track in FMA
fma_similar_tracks = {t[0] for v in similars_dict.values() for t in v['similars']}

# Get the subset of those similar tracks that are also in FMA
fma_similar_tracks = fma_similar_tracks.intersection(FMA_title_artist_set)
print(f"Similar tracks present in FMA: {len(fma_similar_tracks)}")


Similar tracks present in FMA: 0


**Info**
Use the [Discogs-EffNet](https://essentia.upf.edu/models.html#discogs-effnet), used by the [cosine.club](https://cosine.club/) website
Audio embedding models trained with classification and contrastive learning objectives using an in-house dataset annotated with Discogs metadata. The classification model was trained to predict music style labels. The contrastive learning models were trained to learn music similarity capable of grouping audio tracks coming from the same artist, label (record label), release (album), or **segments of the same track itself (self-supervised learning)**. Additionally, **multi was trained in multiple similarity targets simultaneously**.

**Note:** cosine.club has [positive reviews on reddit](https://www.reddit.com/r/TheOverload/comments/1csqg0j/i_built_a_music_search_engine_to_help_you_find/) indicating it works well

## Attempt 2. Use UPF Essentia
<span>Idea: Use Essentia python library to get similarity between audio tracks.</span><br>

<span style="color: orange">
ABANDONED
<ul>
  <li>Works only with python<=3.10</li>
  <li>Could not replicate what they do on the instructions</li>
  <li>It would require us to understand how it works, convert mp3 to "wav" format, and then get similarity -> very computationally expensive (?)</li>
  <li>Not sure if the "similarity" it is trained on is the type of similarity we are interested in</li>
</ul>
</span>

In [2]:
# INSTALL ESSENTIA
# Does not work on python 3.12. Installed in python 3.11:
!conda create -n essentia_env python=3.10 -y
!conda activate essentia_env
!pip install essentia
!pip install ipykernel 
!pip install "numpy<2"  # Required for essentia

Channels:
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/cdt_wsl/miniconda3/envs/essentia_env

  added / updated specs:
    - python=3.10


The following NEW packages will be INSTALLED:

  _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
  _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
  bzip2              pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 
  ca-certificates    pkgs/main/linux-64::ca-certificates-2024.12.31-h06a4308_0 
  ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 
  libffi             pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 
  libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 
  libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 
  libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
  libuuid            pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 
  ncur

In [2]:
!pip freeze | grep essentia

essentia==2.1b6.dev1110


In [9]:
import essentia

In [4]:
from essentia.standard import MonoLoader, TensorflowPredictEffnetDiscogs

audio = MonoLoader(filename="audio.wav", sampleRate=16000, resampleQuality=4)()
model = TensorflowPredictEffnetDiscogs(graphFilename="discogs_track_embeddings-effnet-bs64-1.pb", output="PartitionedCall:1")
embeddings = model(audio)

ModuleNotFoundError: No module named 'essentia.standard'

## Attempt 3. Use data available in FMA

#### Possible idea:
Train models on features & lyrics to predict genre.


#### Works worth looking at:
**Medium: [Content-Based Music Recommendatation System](https://medium.com/@dibyendu19034/content-based-music-recommendation-system-74f30bccc239)**<br>
They combine FMA & MSD at different weights to get their final model. Steps:
1. Librosa is used to get the MFCC feature (including 7 features x 20 dimensions), using bagged random forest to convert MFCC to a level gendre feature vector
2. TF-IDF of lyrics


**[GitHub Repo](https://github.com/MiningMyBusiness/ExploringFreeMusicArchiveDataset): ExploringFreeMusicArthiveDataset**<br>
* **Motivation:** to make music recommendation fair to smaller or less reknowned artists and more useful to the consumer looking for music, we must see if the music itself (audio signal) can provide the necessary features for classification and categorization.
* **Observations:**
  * **Genre classification:** Some genres occur far more often than others in the dataset. There is a big gendre imbalance

**Kaggle: [Unsupervised ML project: Music Clustering](https://www.kaggle.com/code/shabanamir/unsupervised-ml-project-music-clustering)**


**Jupyter notebook: Ideas for [song similarity using the audio features](https://colab.research.google.com/github/jo-cho/genre_classification/blob/main/GTZAN/song_similarity.ipynb#scrollTo=5a912b81)**


## Other investigations done:

Comments:
* I saw this paper: [Music4All-Onion -- A Large-Scale Multi-faceted Content-Centric Music Recommendation Dataset](https://dl.acm.org/doi/10.1145/3511808.3557656)
but it is based on MSD
* [This blog post](http://millionsongdataset.com/blog/12-1-2-matching-errors-taste-profile-and-msd/) talks about matching errors between songs an tracks. [List of song-track pairs that shouldn't be trusted](http://millionsongdataset.com/blog/12-2-12-fixing-matching-errors/)

**How is similarity computed for the MSD:**
* Collaborative information: If user A&B like bands X & Y, they are likely to be similar
* Although, Spotify uses algorithm which measure and maps song on the basis of tempo, progression and frequency

**Similarity in Spotify:**
The audio features data available through Spotify API consists of 12 metrics

Yet, these audio features are just the first component of Spotify's audio analysis system. In addition to the audio feature extraction, a separate algorithm will also analyze the track's temporal structure and split the audio into different segments of varying granularity: from sections (defined by significant shifts in the song timbre or rhythm, that highlight transitions between key parts of the track such as verse, chorus, bridge, solo, etc.) down to tatums (representing the smallest cognitively meaningful subdivision of the main beat).


Reference: [Inside Spotify’s Recommender System:](https://www.music-tomorrow.com/blog/how-spotify-recommendation-system-works-a-complete-guide-2022) 

****