# Cross-dataset correlation
There are a lot of different in-house designed classifiers for high-level features. However, as we saw from the acousticbrainz stability testing (see other notebook), classifiers might not model exactly that which we hope they would, and are not always as stable on unseen data.

The goal is to look at features from several different platforms (acousticbrainz, MSD, spotify, Deezer, Musixmatch etc.) to see:

- If the classifier outputs have a very high degree of correlation (they should, since they are different models all trained with the same task)
- If not, how and why do the different classifiers give back different results


Ideally, we will compare the following:

[Essentia (acousticbrainz)](https://acousticbrainz.org/datasets/accuracy) - [Echonest](http://millionsongdataset.com/pages/example-track-description/) - [Spotify](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/) + more if found

First, we need to check the overlap between AcousticBrainz and MSD to be able to fetch the Echonest features, we use the mapping provided by AcousticBrainz: https://labs.acousticbrainz.org/million-song-dataset-mapping/

In [89]:
import pandas as pd
from pathlib import Path

# Load in acousticbrainz features and the mapping from MBID to MSD
acousticbrainz = acousticbrainz = pd.read_hdf(Path.cwd() / 'datasets' / 'acousticbrainz.h5')
mapping = pd.read_csv(Path.cwd() / 'datasets' / 'mappings' / 'msd-mbid.csv', names=['msd', 'mbid', 'title', 'artist'])
mbids = acousticbrainz.index.get_level_values(0)

# Filter out entries that do not have acousticbrainz features
mapping = mapping[mapping['mbid'].isin(mbids)]    

Next, we fetch the spotify ids from these mbids, we have a [mapping from MSD Song IDs to spotify IDS](https://labs.acousticbrainz.org/million-song-dataset-echonest-archive/), however we want to have MSD track IDs, so we first need to convert the Track ids to song ID, and then to spotify IDs

First step, going from track_id to song_id, using [track_metadata.db](http://millionsongdataset.com/pages/getting-dataset/) (under additional files)

In [90]:
# Load mapping from MSD track ID to MSD song ID
import sqlite3
db = sqlite3.connect(Path.cwd() / 'datasets' / 'track_metadata.db')
cursor = db.cursor()

# Add songid to mapping
msd_songid = []
for msd in mapping['msd']:
    cursor.execute("SELECT song_id FROM songs WHERE track_id='{tid}'".format(tid=msd))
    matches = cursor.fetchall()
    msd_songid.append(matches[0][0])  
cursor.close()
db.close()
mapping['msd_songid'] = msd_songid

However, the mapping from track to songid in MSD is [known to contain some errors](http://millionsongdataset.com/blog/12-2-12-fixing-matching-errors/), thus we use this file to filter out the entries for which the mapping is incorrect:

In [91]:
import re
mismatches = Path.cwd() / 'datasets' / 'sid_mismatches.txt'
# Regex to get all songIDs with an ERROR in matching
mismatches = re.findall(r'ERROR: <(.*) .*>', mismatches.read_text(encoding='utf-8'))

# Filter out the mismatches
mapping = mapping[~mapping['msd_songid'].isin(mismatches)]
mapping = mapping.reset_index(drop=True)

And now we can use https://labs.acousticbrainz.org/million-song-dataset-echonest-archive/ to get the spotify id:

In [92]:
import json
rosetta = Path.cwd() / 'datasets' / 'millionsongdataset_echonest'

def get_spotify_id(msd_songid):
    path = rosetta / msd_songid[2:4] / (msd_songid + '.json')
    f = json.loads(path.read_text(encoding='utf-8'))
    songs = f['response']['songs']
    if len(songs) == 0:
        return float('nan')
    tracks = songs[0]['tracks']    
    for track in tracks:
        if track['catalog'] == 'spotify':
            return track['foreign_id'].split(':')[-1]
    
    return float('nan')

mapping['spotify'] = mapping['msd_songid'].apply(get_spotify_id)


Not all ids are able to be mapped to spotify ids (due to echonest giving back an empty response in the archive). We filter these out and clean up the mapping a bit:

In [93]:
mapping = mapping.dropna()[['title', 'artist', 'msd', 'msd_songid', 'mbid', 'spotify']].reset_index(drop=True)
mapping

Unnamed: 0,title,artist,msd,msd_songid,mbid,spotify
0,No One Could Ever,Hudson Mohawke,TRMMMRX128F93187D9,SOGTUKN12AB017F4F1,7f9264e2-be8e-4e8e-a58e-04026d096a85,41RpZW2lxAdnqDd2nMBzLQ
1,(Looking for) The Heart of Saturday Night,Shawn Colvin,TRMMMUT128F42646E8,SOBARPM12A8C133DFF,d3e506fc-574a-4b0a-8c5d-9618c1644164,5SM86TB7dU5n9Y23wLgcBY
2,Ethos of Coercion,Dying Fetus,TRMMMQY128F92F0EA3,SOKOVRQ12A8C142811,ae7be227-5112-4d21-adbb-662e4c23a90d,0ghgsfOnoXJT7jsS63U8et
3,Nervous,Nicolette,TRMMMPN128F426610E,SOGFWVT12A8C137C64,f4fddf93-9e1d-4fcc-843a-e4f11ae69a28,6mxDT6y9Sdp2802sieuwOA
4,Fuckin Ethic People (999),DJ Craze,TRMMMTG128F426B5BB,SOEPAIN12A8C1396A7,498496ae-cc88-4ddf-98c8-fd4bb65194b4,4qvYeGamP4p4wwMwb7EilZ
...,...,...,...,...,...,...
143952,Dans la légende,Fonky Family,TRYYYDF128F9322908,SOLQZKM12AB017F838,7c897794-964f-46c8-9110-483899c13436,3v2L4aWnkYNKDMkrQvk93V
143953,A.I.M.,The Cooper Temple Clause,TRYYYEE128F930BA5F,SOREAOZ12A58A79F33,c62a6f3d-b9f1-42e9-a273-358b556dc026,3fi7HlTctbdQRrwsH8tdQz
143954,Parkhaus,Captain PlanET,TRYYYWE12903CE4FF9,SOIQVTG12AC468E9C0,22e405a0-deed-42d2-8a53-1642feda73c9,348lQoXFGBYde9jC66XlsX
143955,Inequality Street,Skyclad,TRYYYZM128F428E804,SOBODSE12A8C13EBD6,3ab76d62-a51a-4a51-8d72-36a19e3ea519,2QA63aCrxabOQbMZqMxZhW


In [94]:
# Write ID of spotify data that will need to be fetched
mapping['spotify'].drop_duplicates().to_csv(Path.cwd() / 'datasets' / 'spotify-tofetch.csv', index=False, header=False)
# Write MSD data that will need to be fetched
mapping['msd'].drop_duplicates().to_csv(Path.cwd() / 'datasets' / 'msd-tofetch.csv', index=False, header=False)