# Artist Similarity with Graph Neural Network 2nd Notebook

In this notebook are shown the performances of the networks obtained from the training as described in the first notebook.  
In addition to the authors we have seen the quality of the recommended artists, with a query artists and the aid of the K-NN computation.  
It is in our interest to compare all the architectures and to see how the GAT ourtperforms in the results the GraphSAGE layer.

* Another important aspect that we see in this notebook is the possibility to create non-existing artists by only specifiyng fake relationship in the Graph with existing artists. The procedure is simple, we just create a feature vector for the fake artist and then we embed it with the other samples. This procedure also require that are specified one or more existing artist that are related to the fake one, in this way is possible to mix musical genres and see what are the recommended artists.

In [19]:
import os
import torch
os.environ['TORCH'] = torch.__version__
print(torch.__version__)
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import json
import torch.nn as nn
import torch.nn.functional as F
from torchmetrics.functional import pairwise_euclidean_distance
from torch_geometric.nn import GATConv, SAGEConv
from torch.optim import lr_scheduler
import random
from random import choice,randrange
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors
import math
import time
from torch_geometric import seed_everything
import ipywidgets as widgets

random_seed=280085

seed_everything(random_seed)

1.12.1


In [2]:
from architectures import *
from utils import *

1.12.1


In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

With the help of the [Torch geometric framework](https://pytorch-geometric.readthedocs.io/en/latest/) was really easy to handle the graph attributes and nodes and then the training of the GNNs.

In [4]:
X = torch.load('instance').T.to(device)      # Instance matrix
A = torch.load('adjacencyCOO').to(device)    # Adjacency matrix in the COO format, that is that supported by torch geometric
A1 = torch.load('adjacency').to(device)      # Normal adjacency matrix format is obtained with torch.load('adjacency')
num_samples = X.shape[0]
print(num_samples)

11261


In [17]:
''' These variables contain the information about the artists' names, and their position in the dataset, this makes easy to look for their name and to better draw conclusions at inference time '''
num2artist = load_data('dizofartist.json')
artist2num = {num2artist[key]:key for key in num2artist}


##### Data in the dataset corrensponds to a set of artist, we have stored their names in two dictionaries. These two data structure are fundamental to keep track of their names and eventually to do more specific experiments.

## Import the data with Torch geometric:


In [31]:
''' In order to conduct the experiments was fundamental to split the dataset, either the nodes and also the edges.
    The splitting was performed according to the information in the paper, and considering the fact that a lot of date were lost in the preprocessing part.'''
device = 'cuda' if torch.cuda.is_available() else 'cpu'

from torch_geometric.data import Data
from torch_geometric.utils import structured_negative_sampling
data = Data(x=X, edge_index = A).to(device)


In [56]:
# Number of layers #
# G1 = GraphSage(1)
# G2 = GraphSage(2)
# G3 = GraphSage(3)
no_layer = FCL()

Gat1 = GAT1()
Gat2 = GAT2(n_heads = 1)
Gat2r = GAT2(n_heads = 1)

GATSY = Gat2
load_model('./models/GATSY.pt', GATSY, device)


{'loss_train': [0.027967893415027194,
  0.015920915433930025,
  0.014355172403156757,
  0.014508430587334765,
  0.01510198957597216,
  0.014256522214661041,
  0.013632400789194636,
  0.01235945119212071,
  0.012553865348713266,
  0.012506719368199507,
  0.011514446128987603,
  0.012266874908366136,
  0.011964276743431887,
  0.013071277230564091,
  0.01269358080915279,
  0.01217353121481008,
  0.012001497722748253,
  0.012872238462376926,
  0.013159838401608996,
  0.012182174767884944,
  0.012623874884512689,
  0.012266878918227222,
  0.012580310253219472,
  0.012960255663428042,
  0.01238662206257383,
  0.012310455630843839,
  0.012861440941277478,
  0.013576887961890962,
  0.01247345775158869,
  0.012350179203268554,
  0.012874454932494296,
  0.013341398340546422,
  0.013012649491429329,
  0.012858477731545767,
  0.013353794140534269,
  0.013689418582038747,
  0.012328895274549723,
  0.012894894720779525,
  0.012971815704885457,
  0.012821038201865222,
  0.01316362867752711,
  0.01217

## What happens if we create an unreal artist?
* Since we are able now to embed artists, it is easy to augment our Graph Data Structure with new information, and we would expect to obtain plausible results as well.
* Thus we insert a new artist by specifying its connections in the graph, and then we compute its features as a linear combination of its neighbors features, namely the average.

* In the next cell we can either look for the neighbors of existing artist and also to create new artist and their neighborhoods.

In [57]:
def get_nearest_artists(embedding, artist_name, K, artist_to_id, id_to_artist):
  '''To get the nearest artist we use the K-NN algorithm:
      - embedding:   Is the embedding of artist,
      - artist_name: Query artist for which we are looking at its neighbors,
      - K:           Number of neighbors,
      - artist_to_id, and id_to_artist are the dictionary that keep track of the artists and their ids. '''
  Knew = K+50
  T=embedding.detach().to(torch.device("cpu")).numpy()
  neigh=NearestNeighbors(n_neighbors=Knew,algorithm='kd_tree').fit(T)#With the K-NN we get the nearest 
  dist,ind = neigh.kneighbors(T[int(artist_to_id[artist_name])].reshape((1,-1))) 
  
  artist_id = artist_to_id[artist_name]

  neighbors_list = list(ind[0])[1:]
  dist_list = list(dist[0])[1:]
  neighbors_ = []
  c = 1
  while len(neighbors_)<K:
    if id_to_artist[str(neighbors_list[c])]!=None and id_to_artist[str(neighbors_list[c])] not in friend_artist_list:
      
      neighbors_.append((id_to_artist[str(neighbors_list[c])],round(dist_list[c],4)))
      c+=1
    else:
      c+=1

  #neighbors_list = [id_to_artist[str(artist)] for artist in neighbors_list if str(artist) in id_to_artist]
  
  return neighbors_

def get_embeddings(model, data):
  ''' This function simply computes the embeddings given a model name, and the data that we are using '''
  
  embedding = model(data.x, data.edge_index.to(device))

  return embedding

def add_new_artist(artist_query, friend_artist_list): 
  ''' This function augment the dataset if the artist_query is not already present in the dataset. 
      - artist_query:       artist for which we conduct the search,
      - friend_artist_list: list of correlated artists.                                            '''


  X_new = X.clone()
  A_new = A.clone()
  artist2num_new = artist2num.copy()
  num2artist_new = num2artist.copy()
  if artist_query not in artist2num and len(friend_artist_list) != 0:
    print("{} does not exist in  the dataset, or in real life. \n But we still can create it!".format(artist_query))
    artist2num_new[artist_query] = str(X_new.shape[0])
    num2artist_new[str(X_new.shape[0])] = artist_query
    feat_sum = torch.zeros(2613, device = device)
    for artist in friend_artist_list:
      if artist not in artist2num_new:
        print("{} is not in the dataset, so it is not valid for the neighbors list".format(artist))
      artist_num = artist2num_new[artist]
      A_new = torch.cat((A_new, torch.tensor([[int(artist2num_new[artist_query])],[int(artist2num_new[artist])]], device = device)), dim = 1)
      A_new = torch.cat((A_new, torch.tensor([[int(artist2num_new[artist])],[int(artist2num_new[artist_query])]], device = device)), dim = 1)
      feat_sum += X_new[int(artist2num_new[artist])]
    feat_sum /=len(friend_artist_list)
    X_new = torch.cat((X_new, feat_sum.unsqueeze(0)), dim = 0)

    data = Data(x=X_new, edge_index = A_new)
    print("\n{} has been created considering its neighbors:\n {}\n".format(artist_query, friend_artist_list))
  else:
    print("{} is an existing artist".format(artist_query))
    data = Data(x=X, edge_index = A)


  return data, artist2num_new, num2artist_new


data_n = data
artist2num_new = artist2num
num2artist_new = num2artist




Please run the 'friend_artist_list' before. Obviously it is required to specify the desired artist to correlate someone with the fictitious artist.

In [None]:
friend_artist_list = ['Pink Floyd', 'Dire Straits', 'Jethro Tull']
embedding = get_embeddings(model = Gat2.to(device), data = data_n)
artist_name = 'Roger Waters'

if artist_name in list(artist2num_new.keys()):
    print(get_nearest_artists(embedding, artist_name, K = 10, artist_to_id = artist2num_new, id_to_artist = num2artist_new))
else:
    data_n, artist2num_new, num2artist_new = add_new_artist(artist_name, friend_artist_list = friend_artist_list)

