<p><img alt="TigerGraph logo" height="45px" src="https://blobscdn.gitbook.com/v0/b/gitbook-28427.appspot.com/o/spaces%2F-LHvjxIN4__6bA0T-QmU%2Favatar.png?generation=1532158270801864&amp;alt=media" align="left" hspace="10px" vspace="0px"></p>

# Graph Convolutional Neural Networks for Movie Recommendation
------
## Introduction
This notebook walks through a basic example of using a graph convolutional neural network (GCN) for recommendation. The data is collected from a TigerGraph database using a Python package [pyTigerGraph](https://github.com/parkererickson/pyTigerGraph). Data collected is then pushed through a GCN to output predictions about a person's viewing preferences. This example does makes a couple oversimplifications that will be pointed out, mainly in the assumptions made surrounding a person's preferences.


Collab Notebook [Original Version ](https://colab.research.google.com/drive/11tcL4KXXwY__TmUUTjOf6InFQMC-VsG6)

## Install Queries on TigerGraph Server
You need to create and install two queries on the TigerGraph server; one named userRatings and another called movieLinks.

```
CREATE QUERY userRatings(VERTEX<USER> user) FOR GRAPH Recommender { 
  /* movieID | userID | userRating | term | termRating */
  SumAccum<float> @rating;
	
	src = {user};
  
	S1 = SELECT tgt FROM src:s -(rate:e)-> MOVIE:tgt
       ACCUM tgt.@rating += e.rating;

  PRINT S1[S1.movie_id as movieID, S1.name as movieTitle, S1.@rating as userRating];
}
```

```
CREATE QUERY movieLinks() FOR GRAPH Recommender SYNTAX v2{ 
	TYPEDEF TUPLE <STRING src, STRING dest> TUPLE_RECORD;
	ListAccum<TUPLE_RECORD> @@tupleRecords;
	movies = {MOVIE.*};  
	result = SELECT tgt FROM movies:s-(:e1)-TERM:mid-(:e2)-MOVIE:tgt WHERE s != tgt 
	         ACCUM @@tupleRecords += TUPLE_RECORD (s.name, tgt.name);
	PRINT @@tupleRecords;
}
```

## Installing Packages
The core packages that need to be installed are PyTorch, dgl, and pyTigerGraph. PyTorch and dgl are used for creating and training the GCN, while pyTigerGraph is used for connecting to the TigerGraph database. We also import networkx for converting the list of edges from TigerGraph into a graph dgl can work with.

In [0]:
!pip install pyTigerGraph
!pip install torch torchvision
!pip install dgl
!pip install networkx



## Importing Packages
We now import the packages we just installed

In [0]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import pyTigerGraph as tg
import dgl
import networkx as nx



##Configuration

Here we define some variables, such as the number of epochs of training (usually only need 30 or less for a 2-layer GCN), the learning rate (0.01 seems to work well).


In [0]:
numEpochs = 25
learningRate = 0.01

## Creating the Graph Convolutional Neural Network
The block below defines some functions and classes for the GCN. The main ones to look at are the GCNLayer, which are the individual building blocks that the GCN class is made out of. The GCN class defines the structure of our neural network.

In [0]:
# Define the message and reduce function
# NOTE: We ignore the GCN's normalization constant c_ij for this tutorial.
def gcn_message(edges):
    # The argument is a batch of edges.
    # This computes a (batch of) message called 'msg' using the source node's feature 'h'.
    return {'msg' : edges.src['h']}

def gcn_reduce(nodes):
    # The argument is a batch of nodes.
    # This computes the new 'h' features by summing received 'msg' in each node's mailbox.
    return {'h' : torch.sum(nodes.mailbox['msg'], dim=1)}

# Define the GCNLayer module
class GCNLayer(nn.Module):
    def __init__(self, in_feats, out_feats):
        super(GCNLayer, self).__init__()
        self.linear = nn.Linear(in_feats, out_feats)

    def forward(self, g, inputs):
        # g is the graph and the inputs is the input node features
        # first set the node features
        g.ndata['h'] = inputs
        # trigger message passing on all edges
        g.send(g.edges(), gcn_message)
        # trigger aggregation at all nodes
        g.recv(g.nodes(), gcn_reduce)
        # get the result node features
        h = g.ndata.pop('h')
        # perform linear transformation
        return self.linear(h)

# Define a 2-layer GCN model
class GCN(nn.Module):
    def __init__(self, in_feats, hidden_size, num_classes):
        super(GCN, self).__init__()
        self.gcn1 = GCNLayer(in_feats, hidden_size)
        self.gcn2 = GCNLayer(hidden_size, num_classes)

    def forward(self, g, inputs):
        h = self.gcn1(g, inputs)
        h = torch.relu(h)
        h = self.gcn2(g, h)
        return h


## Creating Database Connection and Creating Edge List

This section instantiates a connection to the TigerGraph database and creates a list of tuples which consist of directed edges in the form of (from, to). This is done through two dictionaries that corresponds an article name to a unique numerical id that is needed to process the graph in the GCN.


#### **Assumption Alert:** We oversimplify the graph here. The query returns pairs of movies that share the same term (genre). In the real world, most people like a variety of genres and therefore their views are a little more nuanced than creating a graph where the edges are created if the movies share the same genre. Better link creation factors might be actors, directors, etc. but we don't have that in this dataset. Where TigerGraph comes in is the ease of data extraction, as there are no JOIN operations to create these links between movies.
* Note: It is possible to create a GCN that has multiple types of verticies, (known as a Relational Graph Convolutional Notebook) but it is more complex. A good way to get started is to simplify until you only have relations between the same type of thing.


In [0]:
graph = tg.TigerGraphConnection(
    ipAddress="https://graphml.i.tgcloud.io", 
    graphname="Recommender", 
    apiToken="bekr9ls24mlh4kbkd7g28stq8vpj67vi") # Really not the best idea to have your API key out in the open, but for the sake of the demo, here it is

movieToNum = {} # translation dictionary for movie name to number (for dgl)
numToMovie = {} # translation dictionary for number to movie name
i = 0
def createEdgeList(result): # returns tuple of number version of edge
    global i
    if result["src"] in movieToNum:
        fromKey = movieToNum[result["src"]]
    else:
        movieToNum[result["src"]] = i
        numToMovie[i] = result["src"]
        fromKey = i
        i+=1
    if result["dest"] in movieToNum:
        toKey = movieToNum[result["dest"]]
    else:
        movieToNum[result["dest"]] = i
        numToMovie[i] = result["dest"]
        toKey = i
        i+=1
    return (fromKey, toKey)
    
edges = [createEdgeList(thing) for thing in graph.runInstalledQuery("movieLinks", {}, sizeLimit=128000000)["results"][0]["@@tupleRecords"]] # creates list of edges
print(len(edges))
print(edges[:5])

1046378
[(0, 1), (2, 1), (3, 1), (4, 1), (5, 1)]



## Initializing Graph

This section converts the list of edges into a graph that DGL can process in the GCN. It also converts our liked and disliked movies to their corresponding numerical ids that we will use later on.


In [0]:
likedMovie = movieToNum["Sound of Music, The (1965)"]
dislikedMovie = movieToNum["Mary Shelley's Frankenstein (1994)"]

g = nx.Graph()
g.add_edges_from(edges)


G = dgl.DGLGraph(g)

## Adding Features to Graph
We one-hot encode the features of the verticies in the graph. Feature assignment can be done a multitude of different ways, this is just the fastest and easiest, especially given the lack of attributal information in the dataset.

If you had a graph of documents for example, you could run doc2vec on those documents to create a feature vector and create the feature matrix by concatenating those together.

Another possiblity is that you have a graph of songs, artists, albums, etc. and you could use tempo, max volume, minimum volume, length, and other numerical descriptions of the song to create the feature matrix.

In [0]:
G.ndata["feat"] = torch.eye(G.number_of_nodes())

print(G.nodes[2].data['feat'])

tensor([[0., 0., 1.,  ..., 0., 0., 0.]])



## Creating Neural Network and Labelling Relevant Verticies

Here, we create the GCN. A two-layered GCN appears to work better than deeper networks, and this is further corroborated by the fact [this](https://arxiv.org/abs/1609.02907) paper only used a two-layered one. We also label the wanted and unwanted verticies and setup the optimizer.


In [0]:
net = GCN(G.number_of_nodes(), 15, 2) #Two layer GCN
inputs = G.ndata["feat"]
labeled_nodes = torch.tensor([likedMovie, dislikedMovie])  # only the liked movie and disliked movie are labelled
labels = torch.tensor([0, 1])  # their labels are different
optimizer = torch.optim.Adam(net.parameters(), lr=learningRate)

## Training Loop
Below is the training loop that actually trains the GCN. Unlike many traditional deep learning architectures, GCNs don't always need that much training or as large of data sets due to their exploitation of the *structure* of the data, as opposed to only the features of the data.
* Note: due to the randomized initial values of the weights in the neural network, sometimes models don't work very well, or their loss gets stuck at a relatively large number. If that happens, just stop and restart the training process and hope for better luck!

In [0]:
all_logits = []
for epoch in range(numEpochs):
    logits = net(G, inputs)
    # we save the logits for visualization later
    all_logits.append(logits.detach())
    logp = F.log_softmax(logits, 1)
    # we only compute loss for labeled nodes
    loss = F.nll_loss(logp[labeled_nodes], labels)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    print('Epoch %d | Loss: %6.3e' % (epoch, loss.item()))

Epoch 0 | Loss: 1.579e+00
Epoch 1 | Loss: 2.732e+03
Epoch 2 | Loss: 5.575e+02
Epoch 3 | Loss: 4.361e-01
Epoch 4 | Loss: 6.102e-01
Epoch 5 | Loss: 6.769e-01
Epoch 6 | Loss: 6.878e-01
Epoch 7 | Loss: 6.941e-01
Epoch 8 | Loss: 6.944e-01
Epoch 9 | Loss: 6.945e-01
Epoch 10 | Loss: 6.946e-01
Epoch 11 | Loss: 6.947e-01
Epoch 12 | Loss: 6.947e-01
Epoch 13 | Loss: 6.948e-01
Epoch 14 | Loss: 6.948e-01
Epoch 15 | Loss: 6.948e-01
Epoch 16 | Loss: 6.948e-01
Epoch 17 | Loss: 6.948e-01
Epoch 18 | Loss: 6.948e-01
Epoch 19 | Loss: 6.948e-01
Epoch 20 | Loss: 6.947e-01
Epoch 21 | Loss: 6.947e-01
Epoch 22 | Loss: 6.947e-01
Epoch 23 | Loss: 6.946e-01
Epoch 24 | Loss: 6.946e-01



## Output Predictions

This section translates the output of the last result of training and outputs the top 10 results given the liked movie.


In [0]:
predictions = list(all_logits[numEpochs-1])

predictionsWithIndex = []
a = 0
for movie in predictions:
    predictionsWithIndex.append([a, movie[0]])
    a+=1

predictionsWithIndex.sort(key=lambda x: x[1], reverse=True)

topResults = predictionsWithIndex[:10]


for movie in topResults:
    print("movie Id: "+str(movie[0]))
    print("movie Name: "+str(numToMovie[movie[0]]))
    print("movie Score: "+str(movie[1]))
    print("")

movie Id: 0
movie Name: Relic, The (1997)
movie Score: tensor(-0.1071)

movie Id: 1
movie Name: Nosferatu a Venezia (1986)
movie Score: tensor(-0.1071)

movie Id: 2
movie Name: Alien: Resurrection (1997)
movie Score: tensor(-0.1071)

movie Id: 3
movie Name: Amityville Horror, The (1979)
movie Score: tensor(-0.1071)

movie Id: 4
movie Name: Mary Shelley's Frankenstein (1994)
movie Score: tensor(-0.1071)

movie Id: 5
movie Name: Alien 3 (1992)
movie Score: tensor(-0.1071)

movie Id: 6
movie Name: Body Snatcher, The (1945)
movie Score: tensor(-0.1071)

movie Id: 7
movie Name: Cat People (1982)
movie Score: tensor(-0.1071)

movie Id: 8
movie Name: Amityville Curse, The (1990)
movie Score: tensor(-0.1071)

movie Id: 9
movie Name: Blood Beach (1981)
movie Score: tensor(-0.1071)



## Test Predictions with User Data

In [0]:
from heapq import nlargest, nsmallest
ratings = graph.runInstalledQuery("userRatings", {"user":"13"})["results"][0]["S1"]
top3Movies = [thing["attributes"]["movieTitle"] for thing in nlargest(3, ratings, key=lambda item: item["attributes"]["userRating"])] # getting the 3 highest rated movies by the user
bottom3Movies = [thing["attributes"]["movieTitle"] for thing in nsmallest(3, ratings, key=lambda item: item["attributes"]["userRating"])] # getting the 3 lowest rated movies by the user
unclassifiedMovies = list(set(ratings) - set([thing for thing in nlargest(3, ratings, key=lambda item: item["attributes"]["userRating"])]) - set([thing for thing in nsmallest(3, ratings, key=lambda item: item["attributes"]["userRating"])]))

def filterNegative(thing):
  if thing["attributes"]["userRating"] < 0:
    return thing

negativeRating = [filterNegative(thing) for thing in unclassifiedMovies]
positiveRating = list(set(unclassifiedMovies)-set(negativeRating))
print(len(unclassifiedMovies))
print(len(negativeRating))
print(len(positiveRating))
print(top3Movies)
print(bottom3Movies)

TypeError: ignored

In [0]:
net = GCN(G.number_of_nodes(), 20, 2) #Two layer GCN
inputs = G.ndata["feat"]
labeled_nodes = torch.tensor([movieToNum[top3Movies[0]], movieToNum[top3Movies[1]], movieToNum[top3Movies[2]], 
                              movieToNum[bottom3Movies[0]], movieToNum[bottom3Movies[1]], movieToNum[bottom3Movies[2]]])  # only the liked movies and the disliked movies are labelled
labels = torch.tensor([0, 0, 0, 1, 1, 1])  # their labels are different
optimizer = torch.optim.Adam(net.parameters(), lr=learningRate)

In [0]:
all_logits = []
for epoch in range(numEpochs):
    logits = net(G, inputs)
    # we save the logits for visualization later
    all_logits.append(logits.detach())
    logp = F.log_softmax(logits, 1)
    # we only compute loss for labeled nodes
    loss = F.nll_loss(logp[labeled_nodes], labels)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    print('Epoch %d | Loss: %6.3e' % (epoch, loss.item()))

Epoch 0 | Loss: 7.528e-01
Epoch 1 | Loss: 4.858e+02
Epoch 2 | Loss: 2.551e+02
Epoch 3 | Loss: 4.904e+01
Epoch 4 | Loss: 0.000e+00
Epoch 5 | Loss: 0.000e+00
Epoch 6 | Loss: 3.852e+01
Epoch 7 | Loss: 0.000e+00
Epoch 8 | Loss: 0.000e+00
Epoch 9 | Loss: 0.000e+00
Epoch 10 | Loss: 0.000e+00
Epoch 11 | Loss: 0.000e+00
Epoch 12 | Loss: 0.000e+00
Epoch 13 | Loss: 0.000e+00
Epoch 14 | Loss: 0.000e+00
Epoch 15 | Loss: 0.000e+00
Epoch 16 | Loss: 0.000e+00
Epoch 17 | Loss: 0.000e+00
Epoch 18 | Loss: 0.000e+00
Epoch 19 | Loss: 0.000e+00
Epoch 20 | Loss: 0.000e+00
Epoch 21 | Loss: 0.000e+00
Epoch 22 | Loss: 0.000e+00
Epoch 23 | Loss: 0.000e+00
Epoch 24 | Loss: 0.000e+00


In [0]:
predictions = list(all_logits[numEpochs-1])

predictionsWithIndex = []
a = 0
for movie in predictions:
    predictionsWithIndex.append([a, movie[0]])
    a+=1

predictionsWithIndex.sort(key=lambda x: x[1], reverse=True)

topResults = predictionsWithIndex[:10]


for movie in topResults:
    print("movie Id: "+str(movie[0]))
    print("movie Name: "+str(numToMovie[movie[0]]))
    print("movie Score: "+str(movie[1]))
    print("")

actualTop10 = [thing["attributes"]["movieTitle"] for thing in nlargest(10, ratings, key=lambda item: item["attributes"]["userRating"])]
print(actualTop10)

movie Id: 211
movie Name: Pollyanna (1960)
movie Score: tensor(720.8066)

movie Id: 270
movie Name: Babe (1995)
movie Score: tensor(720.8066)

movie Id: 158
movie Name: This Is Spinal Tap (1984)
movie Score: tensor(677.1639)

movie Id: 530
movie Name: Stand by Me (1986)
movie Score: tensor(655.3315)

movie Id: 971
movie Name: Ma vie en rose (My Life in Pink) (1997)
movie Score: tensor(653.3358)

movie Id: 983
movie Name: Sirens (1994)
movie Score: tensor(653.3358)

movie Id: 996
movie Name: Reluctant Debutante, The (1958)
movie Score: tensor(653.3358)

movie Id: 1000
movie Name: Last Summer in the Hamptons (1995)
movie Score: tensor(653.3358)

movie Id: 1002
movie Name: Beat the Devil (1954)
movie Score: tensor(653.3358)

movie Id: 1005
movie Name: Stefano Quantestorie (1993)
movie Score: tensor(653.3358)

['Steel (1997)', 'Phantom, The (1996)', 'Baby-Sitters Club, The (1995)', 'Shadow, The (1994)', 'Timecop (1994)', 'Mulholland Falls (1996)', 'Life Less Ordinary, A (1997)', 'Wyatt Ear

## Credits:
<p><img alt="Picture of Parker Erickson" height="150px" src="https://avatars1.githubusercontent.com/u/9616171?s=460&v=4" align="right" hspace="20px" vspace="20px"></p>

Demo/tutorial written by Parker Erickson, a student at the University of Minnesota pursuing a B.S. in Computer Science. His interests include graph databases, machine learning, travelling, playing the saxophone, and watching Minnesota Twins baseball. Feel free to reach out! Find him on:

* LinkedIn: [https://www.linkedin.com/in/parker-erickson/](https://www.linkedin.com/in/parker-erickson/)
* GitHub: [https://github.com/parkererickson](https://github.com/parkererickson)
* Medium: [https://medium.com/@parker.erickson](https://medium.com/@parker.erickson)
* Email: parker.erickson30@gmail.com
----
GCN Resources:
* DGL Documentation: [https://docs.dgl.ai/](https://docs.dgl.ai/)
* GCN paper by Kipf and Welling [https://arxiv.org/abs/1609.02907](https://arxiv.org/abs/1609.02907)
* R-GCN paper: [https://arxiv.org/abs/1703.06103](https://arxiv.org/abs/1703.06103)
---- 
Notebook adapted from: [https://docs.dgl.ai/en/latest/tutorials/basics/1_first.html](https://docs.dgl.ai/en/latest/tutorials/basics/1_first.html)