# Graph Neural Networks

Graph Neural Networks, or GNNs, are a method of applying machine learning technology to graph structures, consisting of nodes, or vertices, connected by edges.

### Deciding The Task

The goal of a recommender system is to try to predict which items a user will like, and then give them recommendations based on this. In this section, for our dataset, we will look at the goal of predicting whether or not a user will like an artist. To achieve this, we will create an 80-20 training-testing split of the data. We have seen that for most users, the `user_artist` dataset contains their top 50 artists, so we will take 40 of them for each user at random, and isolate the other 10 to use as testing data. We will then train a GNN on the 40 artists in the training dataset, and evaluate the model by seeing how it predicts the user will like the 10 artists in the testing data. If the model performs well on this target, then we could use the model to predict whether or not a user will like an artist not contained in their 50 artists.

In [1]:
import pandas as pd
import os
import sklearn as sk

from GNNfuncs import *

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from graphframes import GraphFrame
import findspark
findspark.init() 

## Read in Data

In [2]:
artists = pd.read_csv(os.path.join('..','data','artists.dat'), delimiter='\t')
tags = pd.read_csv(os.path.join('..','data','tags.dat'), delimiter='\t',encoding='ISO-8859-1')
user_artists = pd.read_csv(os.path.join('..','data','user_artists.dat'), delimiter='\t')
user_friends = pd.read_csv(os.path.join('..','data','user_friends.dat'), delimiter='\t')
user_taggedartists_timestamps = pd.read_csv(os.path.join('..','data','user_taggedartists-timestamps.dat'), delimiter='\t')
user_taggedartists = pd.read_csv(os.path.join('..','data','user_taggedartists.dat'), delimiter='\t')

## Format Data

Here to format the data to make it work better for the graph. Currently there is overlap in the name of the users, artists, and tags, since each is just designated an integer. If we were to add these as nodes of a graph we would not be able to distinguish between them, and so here we add the type of node to the front of the number, and save it all as a string.

In [3]:
if user_artists['userID'][0] != 'user2':
    user_artists['userID'] = 'user' + user_artists['userID'].astype(str) # Place 'user' before each userID 
    user_artists['artistID'] = 'artist' + user_artists['artistID'].astype(str) # Place 'artist' before each artistID
    user_friends['userID'] = 'user' + user_friends['userID'].astype(str) # Place 'user' before each userID
    user_friends['friendID'] = 'user' + user_friends['friendID'].astype(str) # Place 'user' before each userID
    user_taggedartists['artistID'] = 'artist' + user_taggedartists['artistID'].astype(str) # Place 'artist' before each artistID
    user_taggedartists['tagID'] = 'tag' + user_taggedartists['tagID'].astype(str) # Place 'tag' before each tagID
    user_taggedartists['userID'] = 'user' + user_taggedartists['userID'].astype(str) # Place 'user' before each userID
    artists['id'] = 'artist' + artists['id'].astype(str) # Place 'artist' before each artistID
    tags['tagID'] = 'tag' + tags['tagID'].astype(str) # Place 'tag' before each tagID
    print('Designations added')
else:
    print('Designations already present')

Designations added


## Testing-Training Split

Need to temporarily drop users with only 1 artist

In [4]:
users = user_artists['userID'].unique()
singleartistusers = [user for user in users if len(get_artists(user,user_artists)) == 1]
singleartistusersdf = user_artists[user_artists['userID'].isin(singleartistusers)]
user_artists_temp = user_artists[~user_artists['userID'].isin(singleartistusers)]

Take 80-20 split, ensuring that the proportion of each artist is kept the same.

In [5]:
from sklearn.model_selection import train_test_split

user_artists_train, user_artists_test = train_test_split(user_artists_temp, test_size = 0.2, stratify = user_artists_temp['userID'], random_state = 47)

user_artists_train = pd.concat([user_artists_train,singleartistusersdf])

Need to make sure we remove the same artists from user_taggedartists, so that we don't have a user's tags for an artist that we want to test them on, as this would indicate interest.

In [6]:
user_taggedartists_test = user_taggedartists.merge(user_artists_test[['userID','artistID']], on = ['userID','artistID'], how = 'inner')
user_taggedartists_train = user_taggedartists.merge(user_artists_test[['userID','artistID']], on = ['userID','artistID'], how = 'left', indicator = True)
user_taggedartists_train = user_taggedartists_train[user_taggedartists_train['_merge'] == 'left_only'].drop(columns = ['_merge'])

In [7]:
filepath = os.path.join('..','SheridanH','data')
for df in [user_artists_train,user_artists_test,user_taggedartists_train,user_taggedartists_test]:
    df.to_csv(os.path.join(filepath,get_df_name(df, globals())))

## Define Vertices and Edges

In [8]:
# Define vertices
user_vertices = pd.DataFrame(user_artists_train['userID'].unique(), columns = ['id']) # All users as nodes

artist_vertices = pd.DataFrame(artists['id'].unique(), columns = ["id"]) # all artists as nodes

tag_vertices = pd.DataFrame(tags['tagID'].unique(), columns = ["id"]) # all tags as nodes

# Define edges
user_artist_edges = user_artists_train.drop('weight', axis = 1).rename(columns = {'userID' : 'src', 'artistID' : 'dst'})
user_artist_edges['type'] = 'listens' # user -> artist edges labelled 'listens'

user_tag_edges = user_taggedartists_train.rename(columns = {'userID' : 'src', 'tagID' : 'dst'})
for col in ['day','month','year','artistID']:
    user_tag_edges = user_tag_edges.drop(col, axis = 1)
user_tag_edges['type'] = 'tag_used' # user -> tag edges labelled 'tag_used'


artist_tag_edges = user_taggedartists_train.rename(columns = {'artistID' : 'src', 'tagID' : 'dst'})
for col in ['day','month','year','userID']:
    artist_tag_edges = artist_tag_edges.drop(col, axis = 1)
artist_tag_edges['type'] = 'tagged_as' # artist -> tag edges labelled 'tagged_as'

user_user_edges = user_friends.rename(columns = {'userID' : 'src', 'friendID' : 'dst'})
user_user_edges['type'] = 'friend' # friend <-> friends edges labelled 'friend'

## Bipartite Graph

We start by making a graph using just the user-artist interactions.

In [9]:
vertices = pd.concat([user_vertices,artist_vertices])
edges = pd.concat([user_artist_edges])

print(vertices.shape,edges.shape)

(19524, 1) (74268, 3)


In [10]:
python_path = r'C:\Users\Sheri\AppData\Local\Programs\Python\Python311\python.exe'  
os.environ['PYSPARK_PYTHON'] = python_path
os.environ['PYSPARK_DRIVER_PYTHON'] = python_path

spark = SparkSession.builder \
    .appName("GNN") \
    .getOrCreate()

In [11]:
graph = GraphFrame(spark.createDataFrame(vertices),spark.createDataFrame(edges))

