# Artist Similarity with Graph Neural Network 2nd Notebook

In this notebook are shown the performances of the networks obtained from the training as described in the first notebook.  
In addition to the authors we have seen the quality of the recommended artists, with a query artists and the aid of the K-NN computation.  
It is in our interest to compare all the architectures and to see how the GAT ourtperforms in the results the GraphSAGE layer.

* Another important aspect that we see in this notebook is the possibility to create non-existing artists by only specifiyng fake relationship in the Graph with existing artists. The procedure is simple, we just create a feature vector for the fake artist and then we embed it with the other samples. This procedure also require that are specified one or more existing artist that are related to the fake one, in this way is possible to mix musical genres and see what are the recommended artists.

In [None]:
!apt install cm-super
! sudo apt-get install texlive-latex-recommended 
! sudo apt-get install dvipng texlive-latex-extra texlive-fonts-recommended  
! wget http://mirrors.ctan.org/macros/latex/contrib/type1cm.zip 
! unzip type1cm.zip -d /tmp/type1cm
! cd /tmp/type1cm/type1cm/ && sudo latex type1cm.ins
! sudo mkdir /usr/share/texmf/tex/latex/type1cm 
! sudo cp /tmp/type1cm/type1cm/type1cm.sty /usr/share/texmf/tex/latex/type1cm 
! sudo texhash 

[1;31mE: [0mImpossibile aprire il file di blocco /var/lib/dpkg/lock-frontend - open (13: Permesso negato)[0m
[1;31mE: [0mImpossibile acquisire il blocco sul frontend dpkg (/var/lib/dpkg/lock-frontend). È necessario essere root.[0m
[sudo] password di peppe: 

In [1]:
import os
import torch
os.environ['TORCH'] = torch.__version__
print(torch.__version__)
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import json
import torch.nn as nn
import torch.nn.functional as F
from torchmetrics.functional import pairwise_euclidean_distance
from torch_geometric.nn import GATConv, SAGEConv
from torch.optim import lr_scheduler
import random
from random import choice,randrange
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors
import math
import time
from torch_geometric import seed_everything

random_seed=280085

seed_everything(random_seed)

  from .autonotebook import tqdm as notebook_tqdm


1.13.0


In [2]:
from architectures import *
from utils import *

1.13.0


In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

With the help of the [Torch geometric framework](https://pytorch-geometric.readthedocs.io/en/latest/) was really easy to handle the graph attributes and nodes and then the training of the GNNs.

In [4]:
X = torch.load('instance').T.to(device)      # Instance matrix
A = torch.load('adjacencyCOO').to(device)    # Adjacency matrix in the COO format, that is that supported by torch geometric
A1 = torch.load('adjacency').to(device)      # Normal adjacency matrix format is obtained with torch.load('adjacency')
num_samples = X.shape[0]
print(num_samples)

11261


In [5]:
filt = True


art_of_interest = torch.load('intrst_artists.pt')
labels = torch.load('labels.pt')
if filt:
  X = X[art_of_interest].detach()
  A1 = A1[art_of_interest, :][:, art_of_interest].detach()
  labels = labels[art_of_interest].detach()
  A = torch.nonzero(A1).T.type(torch.LongTensor).detach()

In [6]:
''' These variables contain the information about the artists' names, and their position in the dataset, this makes easy to look for their name and to better draw conclusions at inference time '''
num2artist = load_data('data/artist_genres.json')

num2artist = {key: num2artist[key][1] for key in num2artist}
artist2num = {num2artist[key]:key for key in num2artist}

##### Data in the dataset corrensponds to a set of artist, we have stored their names in two dictionaries. These two data structure are fundamental to keep track of their names and eventually to do more specific experiments.

## Import the data with Torch geometric:


In [7]:
''' In order to conduct the experiments was fundamental to split the dataset, either the nodes and also the edges.
    The splitting was performed according to the information in the paper, and considering the fact that a lot of date were lost in the preprocessing part.'''
device = 'cuda' if torch.cuda.is_available() else 'cpu'

from torch_geometric.data import Data
from torch_geometric.utils import structured_negative_sampling
data = Data(x=X, edge_index = A).to(device)


In [8]:
# Number of layers #
# G1 = GraphSage(1)
# G2 = GraphSage(2)
# G3 = GraphSage(3)
n_heads = 1
n_layers = 2
GATm = GATSY(n_heads, n_layers)
load_model('./models/GATSY.pt', GATm, device)


{'loss_train': [2.269769463274214,
  1.6764137148857117,
  1.5608794490496318,
  1.5088596807585821,
  1.4438371923234727,
  1.4114753074116178,
  1.3958813746770222,
  1.334642145368788,
  1.3022982080777485,
  1.2736189166704814,
  1.229991422759162,
  1.2069514393806458,
  1.1658276716868083,
  1.13701421684689,
  1.1066575778855219,
  1.1017080379856958,
  1.0755914747714996,
  1.0590944389502208,
  1.0537470777829487,
  1.052053560813268],
 'loss_test': [0.03271468294163545,
  0.013341386259223023,
  0.021987991717954476,
  0.02511707941691081,
  0.02458483725786209,
  0.016383607949440677,
  0.012226634426042438,
  0.014135833674420914,
  0.008476998734598359,
  0.01528483722358942,
  0.027775512387355168,
  0.009570427549382051,
  0.014422536206742128,
  0.032951719438036285,
  0.015219115108872453,
  0.014495457677791515,
  0.036713103453318276,
  0.011406351579353213,
  0.02107905223965645,
  0.028615133836865425],
 'accuracy': [0.5377052658824211,
  0.5500791817884849,
  0.55

## What happens if we create an unreal artist?
* Since we are able now to embed artists, it is easy to augment our Graph Data Structure with new information, and we would expect to obtain plausible results as well.
* Thus we insert a new artist by specifying its connections in the graph, and then we compute its features as a linear combination of its neighbors features, namely the average.

* In the next cell we can either look for the neighbors of existing artist and also to create new artist and their neighborhoods.

In [9]:
def get_nearest_artists(embedding, artist_name, K, artist_to_id, id_to_artist):
  '''To get the nearest artist we use the K-NN algorithm:
      - embedding:   Is the embedding of artist,
      - artist_name: Query artist for which we are looking at its neighbors,
      - K:           Number of neighbors,
      - artist_to_id, and id_to_artist are the dictionary that keep track of the artists and their ids. '''
  Knew = K+50
  T=embedding.detach().to(torch.device("cpu")).numpy()
  neigh=NearestNeighbors(n_neighbors=Knew,algorithm='kd_tree').fit(T)#With the K-NN we get the nearest 
  if not filt:
    dist,ind = neigh.kneighbors(T[int(artist_to_id[artist_name])].reshape((1,-1)))
  else:
    idx = torch.nonzero(art_of_interest.where(art_of_interest == int(artist_to_id[artist_name]), torch.tensor([0]))).item()
    dist,ind = neigh.kneighbors(T[idx].reshape((1,-1)))


  neighbors_list = list(ind[0])[1:]
  dist_list = list(dist[0])[1:]
  neighbors_ = []
  c = 1
  while len(neighbors_)<K:
    if id_to_artist[str(neighbors_list[c])]!=None and id_to_artist[str(neighbors_list[c])] not in friend_artist_list:
      if not filt:
        neighbors_.append((id_to_artist[str(neighbors_list[c])],round(dist_list[c],4)))
        
      else:
        neighbors_.append((id_to_artist[str(art_of_interest[neighbors_list[c]].item())],round(dist_list[c],4)))


      c+=1
    else:
      c+=1

  #neighbors_list = [id_to_artist[str(artist)] for artist in neighbors_list if str(artist) in id_to_artist]
  
  return neighbors_

def get_embeddings(model, data):
  ''' This function simply computes the embeddings given a model name, and the data that we are using '''
  
  embedding = model(data.x, data.edge_index.to(device))

  return embedding

def add_new_artist(artist_query, friend_artist_list): 
  ''' This function augment the dataset if the artist_query is not already present in the dataset. 
      - artist_query:       artist for which we conduct the search,
      - friend_artist_list: list of correlated artists.                                            '''


  X_new = X.clone()
  A_new = A.clone()
  artist2num_new = artist2num.copy()
  num2artist_new = num2artist.copy()
  if artist_query not in artist2num and len(friend_artist_list) != 0:
    print("{} does not exist in  the dataset, or in real life. \n But we still can create it!".format(artist_query))
    artist2num_new[artist_query] = str(X_new.shape[0])
    num2artist_new[str(X_new.shape[0])] = artist_query
    feat_sum = torch.zeros(2613, device = device)
    for artist in friend_artist_list:
      if artist not in artist2num_new:
        print("{} is not in the dataset, so it is not valid for the neighbors list".format(artist))
      artist_num = artist2num_new[artist]
      A_new = torch.cat((A_new, torch.tensor([[int(artist2num_new[artist_query])],[int(artist2num_new[artist])]], device = device)), dim = 1)
      A_new = torch.cat((A_new, torch.tensor([[int(artist2num_new[artist])],[int(artist2num_new[artist_query])]], device = device)), dim = 1)
      feat_sum += X_new[int(artist2num_new[artist])]
    feat_sum /=len(friend_artist_list)
    X_new = torch.cat((X_new, feat_sum.unsqueeze(0)), dim = 0)

    data = Data(x=X_new, edge_index = A_new)
    print("\n{} has been created considering its neighbors:\n {}\n".format(artist_query, friend_artist_list))
  else:
    print("{} is an existing artist".format(artist_query))
    data = Data(x=X, edge_index = A)


  return data, artist2num_new, num2artist_new


data_n = data
artist2num_new = artist2num
num2artist_new = num2artist




Please run the 'friend_artist_list' before. Obviously it is required to specify the desired artist to correlate someone with the fictitious artist.

In [10]:
friend_artist_list = ['Ludwig van Beethoven']
embedding = get_embeddings(model = GATm.to(device), data = data_n)
artist_name = 'Ludwig van Beethoven'

if artist_name in list(artist2num_new.keys()):
    print(get_nearest_artists(embedding, artist_name, K = 10, artist_to_id = artist2num_new, id_to_artist = num2artist_new))
else:
    data_n, artist2num_new, num2artist_new = add_new_artist(artist_name, friend_artist_list = friend_artist_list)
    print('Run again the cell!!!')


[('Antal Doráti', 22.8701), ('Münchner Symphoniker', 22.8782), ('Bruce Broughton', 22.9002), ('Hilary Hahn', 22.9572), ('Piero Piccioni', 23.0188), ('Jascha Heifetz', 23.0403), ('Eubie Blake', 23.1264), ('Nigel Kennedy', 23.2416), ('Sir Neville Marriner', 23.4365), ('Pierre Boulez', 23.4504)]


In [11]:
import seaborn as sns
import pandas as pd
from pylab import rcParams
from matplotlib import gridspec
import matplotlib
import matplotlib.font_manager


import matplotlib.pyplot as plt
def plot_cluster(embedding1, embedding2, labels, n_clusters):
  DataF1 = pd.DataFrame()
  DataF1['x'] = embedding1[:,0]
  DataF1['y'] = embedding1[:,1]
  DataF1['labels'] = labels


  DataF2 = pd.DataFrame()
  DataF2['x'] = embedding2[:,0]
  DataF2['y'] = embedding2[:,1]
  DataF2['labels'] = labels

#   DataF3 = pd.DataFrame()
#   DataF3['x'] = embedding3[:,0]
#   DataF3['y'] = embedding3[:,1]
#   DataF3['labels'] = clust_obj1.predict(embedding_emb)

#   DataF4 = pd.DataFrame()
#   DataF4['x'] = embedding4[:,0]
#   DataF4['y'] = embedding4[:,1]
#   DataF4['labels'] = clust_obj2.predict(random_embedding_emb)
  
  fig1, axes1 = plt.subplots(1, 2, figsize=(20, 7))
  # fig1.suptitle('Representantion of the two types of input features, after the dimensionality reduction')
  sns.scatterplot(ax = axes1[0],
      x="x", y="y",
      data=DataF1,
      hue="labels",
      palette=sns.color_palette("hls", n_clusters),
      legend="full",
      alpha=0.3
  )
  axes1[0].set_title('Low level features')
  sns.scatterplot(ax = axes1[1],
      x="x", y="y",
      data=DataF2,
      hue="labels",
      palette=sns.color_palette("hls", n_clusters),
      legend="full",
      alpha=0.3
  )
  axes1[1].set_title('Random features')
  plt.subplots_adjust(left=0.1,
                  bottom=0.1,
                  right=0.9,
                  top=0.9,
                  wspace=0.4,
                  hspace=0.4)
  plt.savefig('comparisons.pdf')
  plt.show()
#   fig2, axes2 = plt.subplots(1, 2, figsize=(20, 7))
#   # fig2.suptitle('Representantion of the embeddings, after the dimensionality reduction')

#   sns.scatterplot(ax = axes2[0],
#       x="x", y="y",
#       data=DataF3,
#       hue="labels",
#       palette=sns.color_palette("hls", K),
#       legend=False,
#       alpha=0.3
#   )
#   axes2[0].set_title('Low level features')
#   sns.scatterplot(ax = axes2[1],
#       x="x", y="y",
#       data=DataF4,
#       hue="labels",
#       palette=sns.color_palette("hls", K),
#       legend=False,
#       alpha=0.3
#   )
#   axes2[1].set_title('Random features')
#   plt.subplots_adjust(left=0.1,
#                     bottom=0.1,
#                     right=0.9,
#                     top=0.9,
#                     wspace=0.4,
#                     hspace=0.4)
#   plt.savefig('dimred2.pdf')
#   plt.show()

  
params = {
   'axes.labelsize': 26,
   'font.family': 'Times New Roman',
   'font.size': 35,
   'legend.fontsize': 12,
   'xtick.labelsize': 23,
   'ytick.labelsize': 23,
   'text.usetex': True,
   'figure.figsize': [4*1.5, 3.5*1.5]
   }

rcParams.update(params)
matplotlib.rc('pdf', fonttype=42)
matplotlib.font_manager._rebuild()

instance_emb = torch.load('embeds/low_level.pt')
embedding_emb = torch.load('embeds/embedding.pt')


# plot_cluster(new_emb, clustering, K, 'K-means clusters')
plot_cluster(instance_emb, embedding_emb, labels = labels, n_clusters = 25)
# plot_cluster(new_emb, gmm, K, 'Gaussian-Mixture clusters')


RuntimeError: latex was not able to process the following string:
b'lp'

Here is the full report generated by latex:
This is pdfTeX, Version 3.14159265-2.6-1.40.20 (TeX Live 2019/Debian) (preloaded format=latex)
 restricted \write18 enabled.
entering extended mode
(/home/peppe/.cache/matplotlib/tex.cache/c4b3685a49380aa0ffb2879185e40c68.tex
LaTeX2e <2020-02-02> patch level 2
L3 programming layer <2020-02-14>
(/usr/share/texlive/texmf-dist/tex/latex/base/article.cls
Document Class: article 2019/12/20 v1.4l Standard LaTeX document class
(/usr/share/texlive/texmf-dist/tex/latex/base/size10.clo))
(/usr/share/texlive/texmf-dist/tex/latex/type1cm/type1cm.sty)

! LaTeX Error: File `type1ec.sty' not found.

Type X to quit or <RETURN> to proceed,
or enter new name. (Default extension: sty)

Enter file name: 
! Emergency stop.
<read *> 
         
l.6 \usepackage
               {type1ec}^^M
No pages of output.
Transcript written on c4b3685a49380aa0ffb2879185e40c68.log.




RuntimeError: latex was not able to process the following string:
b'lp'

Here is the full report generated by latex:
This is pdfTeX, Version 3.14159265-2.6-1.40.20 (TeX Live 2019/Debian) (preloaded format=latex)
 restricted \write18 enabled.
entering extended mode
(/home/peppe/.cache/matplotlib/tex.cache/c4b3685a49380aa0ffb2879185e40c68.tex
LaTeX2e <2020-02-02> patch level 2
L3 programming layer <2020-02-14>
(/usr/share/texlive/texmf-dist/tex/latex/base/article.cls
Document Class: article 2019/12/20 v1.4l Standard LaTeX document class
(/usr/share/texlive/texmf-dist/tex/latex/base/size10.clo))
(/usr/share/texlive/texmf-dist/tex/latex/type1cm/type1cm.sty)

! LaTeX Error: File `type1ec.sty' not found.

Type X to quit or <RETURN> to proceed,
or enter new name. (Default extension: sty)

Enter file name: 
! Emergency stop.
<read *> 
         
l.6 \usepackage
               {type1ec}^^M
No pages of output.
Transcript written on c4b3685a49380aa0ffb2879185e40c68.log.




<Figure size 1440x504 with 2 Axes>

In [12]:
!pip install matplotlib==3.2.2

Collecting matplotlib==3.2.2
  Downloading matplotlib-3.2.2.tar.gz (40.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.3/40.3 MB[0m [31m757.7 kB/s[0m eta [36m0:00:00[0m00:01[0m00:02[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: matplotlib
  Building wheel for matplotlib (setup.py) ... [?25ldone
[?25h  Created wheel for matplotlib: filename=matplotlib-3.2.2-cp39-cp39-linux_x86_64.whl size=8648089 sha256=dec69903cb7a50c22b5dcc4d3ca09786bb16639be1d73ea649e0e370e58a75f1
  Stored in directory: /home/peppe/.cache/pip/wheels/eb/ab/a7/4319da29f630cb41caf52c8bdfac311c21e774532fd6e101e4
Successfully built matplotlib
Installing collected packages: matplotlib
  Attempting uninstall: matplotlib
    Found existing installation: matplotlib 3.0.3
    Uninstalling matplotlib-3.0.3:
      Successfully uninstalled matplotlib-3.0.3
Successfully installed matplotlib-3.2.2
