# Neural Network project, 2022
## - Artist similarity with Graph Neural Networks (re-implementation)
- Andrea Giuseppe Di Francesco, 1836928
- Giuliano Giampietro, 2024160

* In this notebook we present a complete re-implementation of the paper ['Artist similarity with Graph Neural Network'](https://arxiv.org/abs/2107.14541). 
* Since the project's code wasn't provided by the authors, except for the [dataset](https://gitlab.com/fdlm/olga://paperswithcode.com/paper/artist-similarity-with-graph-neural-networks), we attempted to repeat the experiments described in the paper, and we have additionally tried 4 additional GNNs architectures, provided by the Phd student Indro Spinelli.


## - Importing the libraries

In the following cell we import the libraries that we used to carry out our experiments, and to extract the dataset. Since we worked with the Graph Neural Networks (GNN), it was very helpful to use the [pytorch geometric library](https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html), that contains a lot of useful classes that implement the most famous GNNs architectures.

In [1]:
# !pip install plotly
# !pip install musicbrainzngs
# !pip install torchmetrics
# !pip install spacy-sentence-bert
import os
import torch
os.environ['TORCH'] = torch.__version__
print(torch.__version__)
# !pip install -q torch-scatter -f https://data.pyg.org/whl/torch-${TORCH}.html
# !pip install -q torch-sparse -f https://data.pyg.org/whl/torch-${TORCH}.html
# !pip install -q git+https://github.com/pyg-team/pytorch_geometric.git

1.12.1


  from .autonotebook import tqdm as notebook_tqdm


In [78]:
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import json
import pandas as pd
import torch.nn as nn
import torch.nn.functional as F
from torchmetrics.functional import pairwise_euclidean_distance
# from torch_geometric.nn import GATConv, SAGEConv
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import json
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import lr_scheduler
import random
from random import choice,randrange
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors
import math
import time
from collections import Counter

# from torch_geometric.nn import GCNConv,GraphConv,GATConv,SAGEConv
import musicbrainzngs as mbr
import spacy_sentence_bert
from utils import *

random_seed=80085

random.seed(random_seed)
np.random.seed(random_seed)
torch.manual_seed(random_seed)
sentence_selector = spacy_sentence_bert.load_model('en_stsb_roberta_base')


## Loading of the dataset Olga

 - In the next cell there is the raw dataset provided by the paper's authors **'Artist similarity with Graph Neural Networks'**.
 - Along the columns we have information about the [Musicbrainz_id](https://musicbrainz.org/) of an artist, its partition in the dataset (train,val,test), and the [AcousticBrainz](https://acousticbrainz.org/) low level features of the artist, taken from a sample of 25 songs. 
 - Unfortunately, at the best of our efforts, it was not possible to recover all the information that were described in the [paper](https://arxiv.org/pdf/2107.14541.pdf). Indeed the artists contained in Olga are 17.673, whereas we were able to extract [Allmusic](https://www.allmusic.com/) ids only from 11.261 artists. 



In [79]:
olga=pd.read_csv('olga.csv')
#train 0-14138, #val 14139-15905, #test 15906-17673 (indices)
# olga[:30]

# mbr.auth('ggiamp', 'fallinggiant')
# mbr.set_useragent(
#     "python-musicbrainzngs-example",
#     "0.1",
#     "https://github.com/alastair/python-musicbrainzngs/",
# )

### How do we retrieve the Graph topology?

- Thanks to the musicbrainz_id of each artist, we can get the link to its AllMusic profile, and from there we get also the information about its related artists. Each AllMusic link is related to a unique artist, indeed we can spot a 12 numbers identifier for each of these.
- After having obtained the AllMusic link for each artist in the dataset (if exists), we want to associate to each artist its related ones. We do this just for those that can be re-mapped in the dataset's musicbrainz_ids, because we have associated tracks features for those.
- Also this passage is probably very lossy, in fact, for each artist there could be a lot of similar artists (according to the AllMusic information), but we don't have the feature vectors for all of them.
- The following class contains methods that extract information about all the artists in the dataset, but the price to pay is a high computational cost. 
For this reason it was run just once, and the information was stored in different files, such as 'MsbMapped.json', 'graphSimilarities.json', 'dizofartist.json'.

In [80]:
occurrencies = load_data('occurrencies_real.json')
art_genre = load_data('artist_genres.json')
metadata = load_data('metadata.json')
genres = metadata['genre']
transitions = metadata['transitions']
data_genres = load_data('dataset_genres.json')



##############################################
# ol.get_the_Kmost_rated_genre(occurrencies, 150)

In [81]:
class DatasetOlga(): #In this class, we obtain through different methods the main characteristics of the graph of artists
                    # thanks to the available information in the olga dataset
    def __init__(self,olga):
        self.olga=olga
        self.mb=olga.musicbrainz_id
        self.artists={} #Needed for obtaining the mapping from musicbrainz to the allmusic ids
        self.l=len(self.mb)
        self.d={}       #Needed for obtaining a dict. where keys are artists, and values are the artists similar to them, based on self.artists
        self.NI={}      #Dict. that will contain the artist's features
    
    def get_mapping(self,i):                                                                                                 #This method returns the allmusic page of an artist (if exists), given his id from the dataset 
        response = requests.get(f'https://musicbrainz.org/ws/2/artist/{str(self.mb[i])}?inc=url-rels&fmt=json')
        if response.ok:
            data = response.json()
            refs = [r['url']['resource'] for r in data['relations'] if r['type'] == 'allmusic']        
            return refs[0] if len(refs) != 0 else "Not found"

    def get_artist_name(self,i):                                                                                                 #This method returns the allmusic page of an artist (if exists), given his id from the dataset 
        response = requests.get(f'https://musicbrainz.org/ws/2/artist/{str(self.mb[i])}?inc=url-rels&fmt=json')
        if response.ok:
            data = response.json()
            data = dict(data)
            return data['name']



    def get_mappingList(self,init,end,increm=500):
        Lmusicbrainz_id=self.mb[init:end] #We can specify the range of the artists of our interest, for the purpose of this NN task
        length=len(Lmusicbrainz_id)       #we will take all of them into consideration.
        c=0
        for i in range(len(Lmusicbrainz_id)):
            mapp=self.get_mapping(i)   #get_mapping method again.
            if mapp==None:
                while mapp==None:
                    mapp=self.get_mapping(i)
                    
            if mapp!="Not found":   #Some of the ids has not a respective allmusic id, so we lose that information
                mapp=str(mapp)      #Mapp are strings of links
                key=mapp[-12:]
                self.artists[key]=i
            c+=1
            if c%increm==0 or c==30:
                    print("{}/{} artists were processed".format(c,length)) #This is just to keep track of the processed artist
                    
            
        self.save_data(self.artists,'MsbMapped1.json')  #We do save the Artists Ids map, this function, when called, takes a lot
                                                        #of time, for this reason its result is already saved in the file:
        return self.artists                             # 'MsbMapped1.json'
    
    
    def get_GraphDict(self,name='MsbMapped.json',increm=500):
        session=HTMLSession()
        c=0 #Counter
        artID=self.load_data(name) #We load the mapped artists (between MusicBrainz Ids, and AllMusic Ids)
        # print(len(artID)
        length=len(artID.keys())
        for k in artID.keys(): #dict of mapped mbids, this has to be computed before from getmapping
            if k!=None:
                url='https://www.allmusic.com/artist/'+ k+ '/related' #k is just the code, every link for the artist is distinguished 
                r=session.get(url)                                    #by a unique code in the link.
                sess=r.html.find('body',first=True)
                div=sess.find('.overflow-container')                  #The information of the related artists are exctracted
                divn=div[0]                                           #from the html of the allmusic's related web page
                divn=divn.find('.content-container')
                divn=divn[0]
                divn=divn.find('.content')
                divn=divn[0]
                divn=divn.find('section',first=True)
                if divn==None:
                    self.d[artID[k]]=[] #That artist has not related artists (or we have missing information)
                    continue
                artists=divn.find('li')
                artistL=[]


                for i in range(len(artists)):
                    art=artists[i]
                    art=art.find('a')            #We look for all the k's related artists links
                    link=list(art[0].absolute_links)[0] #Absolute_link returns a one-element set, that we convert into a list and
                    link=str(link)[-12:]                #we get its code
                    if link in artID.keys(): #g is the dict of all the mapped musicbrainz_ids
                        artistL.append(self.artists[link]) #Some of the related artists may not be in the musicbrainz_ids list.
                self.d[artID[k]]=artistL
                c+=1
                if c%increm==0 or c==30:
                    print("{}/{} artists were processed".format(c,length))
        self.save_data(self.d,'graphSimilarities1.json') #Here we save the connection amongst the artists, obtained with this method
        print("Done...")     #Also it takes some time to process, for this reason the result of this method can be 
        return self.d        #found at the 'graphSimilarities.json' file.

    def get_genres(self,name='MsbMapped.json',increm=500):
        
        c=0 #Counter
        artID=self.load_data(name) #We load the mapped artists (between MusicBrainz Ids, and AllMusic Ids)
        valid_ids = sorted(list(artID.values()))
        self.occurrencies = {}
        self.artist_genre = {}
        print('Num of artist: {}', len(valid_ids))
        self.lost = 0
        for artist_num in range(len(valid_ids)):
          print("n° of genres: ", len(self.occurrencies))
          print('Artist {}/{}'.format(artist_num, len(valid_ids)))
          art_id = self.olga.iloc[valid_ids[artist_num], 1]
          result = mbr.get_artist_by_id(art_id,
              includes=['tags'])
          
          if 'tag-list' in result['artist']:
            genre_list = [genre['name'] for genre in result['artist']['tag-list']]
            genre_count = [{genre['count']: genre['name']} for genre in result['artist']['tag-list']]
            self.artist_genre[artist_num] = genre_count, result['artist']['name']
            

            for genre in genre_list:
              if genre not in self.occurrencies:
                self.occurrencies[genre] = 1
              else:
                self.occurrencies[genre] += 1
          else:
            self.lost += 1
            print('Lost artist {}'.format(self.lost))
            self.artist_genre[artist_num] = [], result['artist']['name']


        return 
    def get_artist_genres(self, art_genre, genres):
        self.artist_genre = {}
        for key in art_genre:
            if art_genre[key][0] == []:
                self.artist_genre[int(key)] = []
                # print(art_genre[key])
            else:
                # print()
                genre_c = art_genre[key][0]
                genre_c = list(map(self.trans_function, genre_c))
                importance_list = [int(list(genre_c[element].keys())[0]) for element in range(len(genre_c))]
                new = sorted(range(len(importance_list)), key = lambda k: importance_list[k], reverse = True)
                c = False
                for genre_idx in new:
                    current_g = list(genre_c[genre_idx].values())[0]
                    if current_g in genres:
                        print(current_g)
                        self.artist_genre[int(key)] = current_g
                        c = True
                        break
                if c:
                    continue
                
                candidate_diz = {}
                idx_diz = {}
                for element in range(len(new)):
                    points = {most_c : sentence_selector(list(genre_c[new[element]].values())[0] + ' is the genre played by '+ art_genre[key][1]).similarity(sentence_selector(most_c + ' is the genre played by '+ art_genre[key][1])) for most_c in genres}
                    candidate_diz[max(points, key = points.get)] = max(list(points.values()))
                    idx_diz[max(points, key = points.get)] = element
                self.artist_genre[int(key)] = max(candidate_diz, key = candidate_diz.get)
                print("Assigned to {}, the {} genre from {}....".format(art_genre[key][1], max(candidate_diz, key = candidate_diz.get), list(genre_c[idx_diz[max(candidate_diz, key = candidate_diz.get)]].values())[0]))
    
    def fill_unassigned_artists(self, data_genres_copy, adj):

        for idx in data_genres_copy:
            if data_genres_copy[idx] == []:
                ciao = list(map(int, list(torch.nonzero(adj[int(idx), :]))))
                l = []
                for el in ciao:
                    if data_genres_copy[str(el)] != []:
                        l.append(data_genres_copy[str(el)])
                        
                if len(l) == 0:
                    data_genres_copy[idx] = random.sample(genres, 1)[0]
                else:
                    data_genres_copy[idx] = self.most_common_policy(l)
        return data_genres_copy
    
    
    @staticmethod
    def get_the_Kmost_rated_genre(diz, k):
        l = []
        for i in range(k):
            key = max(diz, key = diz.get)
            l.append({key: diz[key]})
            diz.pop(key)
        return l
    @staticmethod
    def trans_function(el):
        k = list(el.keys())[0]
        v = list(el.values())[0]
    
        if v in transitions:
        
            el[k] = transitions[v]
        return el
    @staticmethod
    def most_common_policy(l):
        # Take the most common among the neighbors. If there are two alternative take the more specific name (The one with less count in the occurrencies)
        counter_diz = Counter(l).most_common()
        maximum = counter_diz[0][1]
        genre_c = counter_diz[0][0]
        minimum = occurrencies[genre_c]
        for el in counter_diz:
            if el[1] < maximum:
                break
            if occurrencies[el[0]] < minimum:
                genre_c = el[0]
                minimum = occurrencies[el[0]]
        return genre_c

ol = DatasetOlga(olga)

In [None]:
ol.get_artist_genres(art_genre, genres)

In [98]:
diz = Counter(occurrencies).most_common()
label_diz = {diz[idx][0] : idx for idx in range(len(diz))}


In [107]:
label_diz

{'jazz': 0,
 'hip hop': 1,
 'rock': 2,
 'indie': 3,
 'pop': 4,
 'alternative rock': 5,
 'metal': 6,
 'folk': 7,
 'pop rock': 8,
 'electronic': 9,
 'r&b': 10,
 'punk': 11,
 'dance': 12,
 'hard rock': 13,
 'soul': 14,
 'blues': 15,
 'country': 16,
 'classical': 17,
 'latin': 18,
 'progressive rock': 19,
 'new wave': 20,
 'funk': 21,
 'reggae': 22,
 'ambient': 23,
 'psychedelic rock': 24,
 'house': 25}

In [104]:
label = torch.tensor(lista)

In [101]:
lista = []
for element in data_genres_copy:
    lista.append(label_diz[data_genres_copy[element]])