# Problem setting

In this tutorial, we demonstrate how graph neural networks can be used for recommendation. Here we focus on item-based recommendation model. This method in this tutorial recommends items that are similar to the ones purchased by the user. We demonstrate the recommendation model on the MovieLens dataset.

# Get started

DGL can be used with different deep learning frameworks. Currently, DGL can be used with Pytorch and MXNet. Here, we show how DGL works with Pytorch.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

When we load DGL, we need to set the DGL backend for one of the deep learning frameworks. Because this tutorial develops models in Pytorch, we have to set the DGL backend to Pytorch.

In [None]:
import dgl
from dgl import DGLGraph

# Load Pytorch as backend
dgl.load_backend('pytorch')

Load the rest of necessary libraries.

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
from scipy import sparse as spsp

## Load data from pickle

Currently, we prepared three datasets for evaluation: movielens, bookcrossing, yelp. A user can specify a dataset name in `load_data` to load the corresponding dataset. The dataset name is `movielens` for movielens, `bx` for bookcrossing and `yelp` for yelp.

When a dataset is loaded, `load_data` returns 6 values:
* `user_item_spm` is a Scipy sparse matrix that stores the user-item interaction in the training set.
* `features` is a NumPy array that stores items features.
* `users_valid` and `items_valid` are user-item pairs of the validation dataset.
* `users_test` and `items_test` are the user-item pairs of the testing dataset.
* `neg_valid` is the negative items for each user in the validation set.
* `neg_test` is the negative items for each user in the testing set.

In [None]:
from data_loader import load_data

name = 'bx'

user_item_spm, features, (users_valid, items_valid), (users_test, items_test), neg_valid, neg_test = load_data(name)
num_users = user_item_spm.shape[0]
num_items = user_item_spm.shape[1]
print('#users:', num_users)
print('#items:', num_items)
print('#interactions:', user_item_spm.nnz)

In [None]:
train = user_item_spm.tocsr()
num_interacts = user_item_spm.nnz
item_deg = user_item_spm.transpose().dot(np.ones(num_users))
coo_spm = user_item_spm.tocoo()
mask = np.ones((num_users, num_items)) - spsp.coo_matrix((np.ones((num_interacts)),
                                                          (coo_spm.row, coo_spm.col)),
                                                         shape=(num_users, num_items))
prob_mat = np.multiply(np.tile(item_deg, num_users).reshape(num_users, num_items), mask)
prob_mat = prob_mat / prob_mat.sum(1)
prob_mat = spsp.csr_matrix(prob_mat)

We can try adding more features to items. For example, add the popularity of items as an item feature and embeddings from SVD. It's not mandatory, but it's a good thing to try out.

In [None]:
popularity = user_item_spm.transpose().dot(np.ones(shape=(num_users)))
# We need to rescale the values
popularity = torch.tensor(popularity / np.max(popularity), dtype=torch.float32).unsqueeze(1)

u, s, vt = spsp.linalg.svds(user_item_spm)
v = torch.tensor(vt.transpose(), dtype=torch.float32)
v = v * torch.tensor(np.sqrt(s).transpose(), dtype=torch.float32)

in_feats = features.shape[1]
print('#feats:', in_feats)

Find the items watched/read/used by users in the testing set and their popularity. This is mainly to help understand how well the model will perform on items of different popularity.

In [None]:
item_deg = user_item_spm.transpose().dot(np.ones((num_users)))
test_deg = np.zeros((num_users))
for i in range(num_users):
    item = int(items_test[i])
    test_deg[i] = item_deg[item]
test_deg_dict = {}
for i in range(1, 10):
    test_deg_dict[i] = np.nonzero(test_deg == i)[0]
for i in range(1, 10):
    test_deg_dict[i*10] = np.nonzero(np.logical_and(i*10 <= test_deg, test_deg < (i+1)*10))[0]
test_deg_dict[100] = np.nonzero(test_deg >= 100)[0]
tmp = 0
for key, deg in test_deg_dict.items():
    print(key, len(deg))
    tmp += len(deg)
print(num_users, tmp)

# The recommendation model

At large, the model first learns item embeddings from the user-item interaction dataset and use the item embeddings to recommend users similar items they have purchased. To learn item embeddings, we first need to construct an item similarity graph and train GNN on the item graph.

There are many ways of constructing the item similarity graph. Here we use the [SLIM model](https://dl.acm.org/citation.cfm?id=2118303) to learn item similarity and use the learned result to construct the item graph. The resulting graph will have an edge between two items if they are similar and the edge has a weight that represents the similarity score.

After the item similarity graph is constructed, we run a GNN model on it and use the vertex connectivity as the training signal to train the GNN model. The GNN training procedure is very similar to the link prediction task in [the previous section](https://github.com/zheng-da/DGL_devday_tutorial/blob/master/BasicTasks_pytorch.ipynb).

## Construct the item graph with SLIM
SLIM is an item-based recommendation model. When training SLIM on a user-item dataset, it learns an item similarity graph. This similarity graph is the item graph we construct for the GNN model.

Please follow the instruction on the [SLIM github repo](https://github.com/KarypisLab/SLIM) to install SLIM.

To use SLIM to generate an item similarity graph, there are two hyperparameters we can tune. `l1r` is the co-efficient for the L1 regularization and `l2r` is the co-efficient for the L2 regularization. Increasing `l1r` will generate a sparser similarity graph and increasing `l2r` leads to a denser similarity graph.

In [None]:
from graph_construct import create_SLIM_graph
item_spm = create_SLIM_graph(user_item_spm, l1r=2, l2r=1, test=True,
                             test_set=(users_test, items_test, neg_test))
use_edge_similarity = True

In [None]:
deg = item_spm.dot(np.ones((num_items)))
print(np.sum(deg == 0))
print(len(deg))
print(item_spm.sum(0))

## Construct the co-occurence graph
Or we can simply construct a co-occurrence graph. That is, if two items are used by the same user, we draw an edge between these two items.

When using this method for graph construction, there are also two hyperparameters to tune. `downsample_factor` controls how much we should down sample user-item pairs based on the frequency of items. A larger `downsample_factor` leads more down sampling. `topk` controls how many items should an item connect to. If it's None, an item connects to all items that have co-occurrence with the item; otherwise, an item connects with the most frequently co-occurred items.

In [None]:
from graph_construct import create_cooccur_graph
item_spm = create_cooccur_graph(user_item_spm, downsample_factor=1e-5, topk=50)
use_edge_similarity = False

## Construct the cosine-similarity graph
We can also use cosine similarity to build a graph. We compute the cosine similarity of the neighborhoods of every pair of items. This is quite similar to co-occurrence graph except that we use cosine similarity instead of the number of co-occurrence to measure the similarity of two items.

In this case, there is one hyperparameter `topk`. If it's specified, an item connects to top K most similar items in terms of cosine similarity in the neighborhood.

In [None]:
from graph_construct import create_cosine_graph
item_spm = create_cosine_graph(user_item_spm, topk=50)
use_edge_similarity = False

Once we construct the graph, we load it to the DGL graph.

In [None]:
g = dgl.DGLGraph(item_spm, readonly=True)
g.edata['similarity'] = torch.tensor(item_spm.data, dtype=torch.float32)
g.ndata['feats'] = torch.tensor(features)
#g.ndata['id'] = torch.arange(num_items, dtype=torch.int64)
print('#nodes:', g.number_of_nodes())
print('#edges:', g.number_of_edges())

## GNN models

We run GNN on the item graph to compute item embeddings. In this tutorial, we use a customized [GraphSage](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) model to compute node embeddings. The original GraphSage performs the following computation on every node $v$ in the graph:

$$h_{N(v)}^{(l)} \gets AGGREGATE_k({h_u^{(l-1)}, \forall u \in N(v)})$$
$$h_v^{(l)} \gets \sigma(W^k \cdot CONCAT(h_v^{(l-1)}, h_{N(v)}^{(l)})),$$

where $N(v)$ is the neighborhood of node $v$ and $l$ is the layer Id.

The original GraphSage model treats each neighbor equally. However, the SLIM model learns the item similarity based on the user-item iteration. The GNN model should take the similarity into account. Thus, we customize the GraphSage model in the following fashion. Instead of aggregating all neighbors equally, we aggregate neighbors embeddings rescaled by the similarity on the edges. Thus, the aggregation step is defined as follows:

$$h_{N(v)}^{(l)} \gets \Sigma_{u \in N(v)}({h_u^{(l-1)} * s_{uv}}),$$

where $s_{uv}$ is the similarity score between two vertices $u$ and $v$.

The GNN model has multiple layers. In each layer, a vertex accesses its direct neighbors. When we stack $k$ layers in a model, a node $v$ access neighbors within $k$ hops. The output of the GNN model is node embeddings that represent the nodes and all information in the k-hop neighborhood.

<img src="https://github.com/zheng-da/DGL_devday_tutorial/raw/master/GNN.png" alt="drawing" width="600"/>

We implement the computation in each layer of the customized GraphSage model in `SAGEConv` and implement the multi-layer model in `GraphSAGEModel`.

In [None]:
if use_edge_similarity:
    from sageconv import SAGEConv
else:
    from dgl.nn.pytorch.conv import SAGEConv

class GraphSAGEModel(nn.Module):
    def __init__(self,
                 in_feats,
                 n_hidden,
                 out_dim,
                 n_layers,
                 activation,
                 dropout,
                 aggregator_type):
        super(GraphSAGEModel, self).__init__()
        self.layers = nn.ModuleList()
        if n_layers == 1:
            self.layers.append(SAGEConv(in_feats, n_hidden, aggregator_type,
                                        feat_drop=dropout, activation=None))
        elif n_layers > 1:
            # input layer
            self.layers.append(SAGEConv(in_feats, n_hidden, aggregator_type,
                                        feat_drop=dropout, activation=activation))
            # hidden layer
            for i in range(n_layers - 2):
                self.layers.append(SAGEConv(n_hidden, n_hidden, aggregator_type,
                                            feat_drop=dropout, activation=activation))
            # output layer
            self.layers.append(SAGEConv(n_hidden, out_dim, aggregator_type,
                                        feat_drop=dropout, activation=None))

    def forward(self, g, features):
        h = features
        for layer in self.layers:
            if use_edge_similarity:
                h = layer(g, h, g.edata['similarity'])
            else:
                h = layer(g, h)
            #h = tmp + prev_h
            #prev_h = h
        return h

## Train Item Embeddings

We train the item embeddings with the edges in the item graph as the training signal. This step is very similar to the link prediction task in the [basic applications](https://github.com/zheng-da/DGL_devday_tutorial/blob/master/BasicTasks_pytorch.ipynb).

Because the MovieLens dataset has sparse features (both genre and title are stored as multi-hot encoding). The sparse features have many dimensions. To run GNN on the item features, we first create an encoding layer to project the sparse features to a lower dimension. 

In [None]:
def mix_embeddings(h, ndata, emb, proj):
    '''Combine node-specific trainable embedding ``h`` with categorical inputs
    (projected by ``emb``) and numeric inputs (projected by ``proj``).
    '''
    e = []
    for key, value in ndata.items():
        if value.dtype == torch.int64:
            e.append(emb[key](value))
        elif value.dtype == torch.float32:
            e.append(proj[key](value))
    if len(e) == 0:
        return h
    else:
        return h + torch.stack(e, 0).sum(0)
    
class EncodeLayer(nn.Module):
    def __init__(self, ndata, num_hidden, device):
        super(EncodeLayer, self).__init__()
        self.proj = nn.ModuleDict()
        self.emb = nn.ModuleDict()
        for key in ndata.keys():
            vals = ndata[key]
            if vals.dtype == torch.float32:
                self.proj[key] = nn.Linear(ndata[key].shape[1], num_hidden)
                #self.proj[key] = nn.Sequential(
                #                    nn.Linear(ndata[key].shape[1], num_hidden),
                #                    nn.LeakyReLU(),
                #                    )
            elif vals.dtype == torch.int64:
                self.emb[key] = nn.Embedding(
                            vals.max().item() + 1,
                            num_hidden,
                            padding_idx=0)
                
    def forward(self, ndata):
        return mix_embeddings(0, ndata, self.emb, self.proj)

In [None]:
class FISMrating(nn.Module):
    r"""
    PinSAGE + FISM for item-based recommender systems
    The formulation of FISM goes as
    .. math::
       r_{ui} = b_u + b_i + \left(n_u^+\right)^{-\alpha}
       \sum_{j \in R_u^+} p_j q_i^\top
    In FISM, both :math:`p_j` and :math:`q_i` are trainable parameters.  Here
    we replace them as outputs from two PinSAGE models ``P`` and
    ``Q``.
    """
    def __init__(self, P, Q, num_users, num_movies, alpha=0):
        super().__init__()

        self.P = P
        self.Q = Q
        self.b_u = nn.Parameter(torch.zeros(num_users))
        self.b_i = nn.Parameter(torch.zeros(num_movies))
        self.alpha = alpha

    
    def forward(self, I, U, I_neg, I_U, N_U, test):
        '''
        I: 1D LongTensor
        U: 1D LongTensor
        I_neg: 2D LongTensor (batch_size, n_negs)
        '''
        batch_size = I.shape[0]
        device = I.device
        I_U = I_U.to(device)
        # number of interacted items
        N_U = N_U.to(device)
        U_idx = torch.arange(U.shape[0], device=device).repeat_interleave(N_U)

        q = self.Q(I)
        p = self.P(I_U)
        # If this is training, we need to subtract the embedding of the self node from the context embedding
        if not test:
            p_self = self.P(I)
        p_sum = torch.zeros_like(q)
        p_sum = p_sum.scatter_add(0, U_idx[:, None].expand_as(p), p)    # batch_size, n_dims
        if test:
            p_ctx = p_sum
            pq = (p_ctx * q).sum(1) / (N_U.float() ** self.alpha)
        else:
            p_ctx = p_sum - p_self
            pq = (p_ctx * q).sum(1) / ((N_U.float() - 1).clamp(min=1) ** self.alpha)
        r = self.b_u[U] + self.b_i[I] + pq

        if I_neg is not None:
            n_negs = I_neg.shape[1]
            I_neg_flat = I_neg.view(-1)
            q_neg = self.Q(I_neg_flat)
            q_neg = q_neg.view(batch_size, n_negs, -1)  # batch_size, n_negs, n_dims
            if test:
                pq_neg = (p_ctx.unsqueeze(1) * q_neg).sum(2) / (N_U.float().unsqueeze(1) ** self.alpha)
            else:
                pq_neg = (p_ctx.unsqueeze(1) * q_neg).sum(2) / ((N_U.float() - 1).clamp(min=1).unsqueeze(1) ** self.alpha)
            r_neg = self.b_u[U].unsqueeze(1) + self.b_i[I_neg] + pq_neg
            return r, r_neg
        else:
            return r

We use the FISM model to train.

In [None]:
beta = 0
gamma = 0

class FISM(nn.Module):
    def __init__(self, user_item_spm, gconv_p, gconv_q, g, num_hidden, device):
        super(FISM, self).__init__()
        num_users = user_item_spm.shape[0]
        num_movies = user_item_spm.shape[1]
        self.encode_p = EncodeLayer(g.ndata, num_hidden, device)
        self.encode_q = EncodeLayer(g.ndata, num_hidden, device)
        self.gconv_p = gconv_p
        self.gconv_q = gconv_q
        P = lambda I: self.gconv_p(g, self.encode_p(g.ndata))[I]
        Q = lambda I: self.gconv_q(g, self.encode_q(g.ndata))[I]
        self.fism_rating = FISMrating(P, Q, num_users, num_movies, 1)

    def est_rating(self, I, U, I_neg, I_U, N_U):
        r, r_neg = self.fism_rating(I, U, I_neg, I_U, N_U, True)
        neg_sample_size = int(len(r_neg) / len(r))
        return torch.unsqueeze(r, 1), r_neg.reshape((-1, neg_sample_size))

    def loss(self, r_ui, neg_r_ui):
        diff = 1 - (r_ui - neg_r_ui)
        return torch.sum(torch.mul(diff, diff)/2)# \
        #    + beta/2 * torch.sum(torch.mul(P, P) + torch.mul(Q, Q)) \
        #    + gamma/2 * (torch.sum(torch.mul(self.fism_rating.b_u, self.fism_rating.b_u)) \
        #                 + torch.sum(torch.mul(self.fism_rating.b_i, self.fism_rating.b_i)))

    def forward(self, I, U, I_neg, I_U, N_U):
        r, r_neg = self.fism_rating(I, U, I_neg, I_U, N_U, False)
        neg_sample_size = int(len(r_neg) / len(r))
        r_neg = r_neg.reshape((-1, neg_sample_size))
        return self.loss(r, r_neg)

In [None]:
class EdgeSampler:
    def __init__(self, user_item_spm, batch_size, neg_sample_size):
        edge_ids = np.random.permutation(user_item_spm.nnz)
        self.batches = np.split(edge_ids, np.arange(batch_size, len(edge_ids), batch_size))
        self.idx = 0
        user_item_spm = user_item_spm.tocoo()
        self.users = user_item_spm.row
        self.movies = user_item_spm.col
        self.user_item_spm = user_item_spm.tocsr()
        self.num_movies = user_item_spm.shape[1]
        self.num_users = user_item_spm.shape[0]
        self.neg_sample_size = neg_sample_size
        
    def __next__(self):
        if self.idx == len(self.batches):
            raise StopIteration
        batch = self.batches[self.idx]
        self.idx += 1
        I = self.movies[batch]
        U = self.users[batch]
        neighbors = self.user_item_spm[U]
        I_neg = np.zeros((len(batch), self.neg_sample_size))
        for i, u in enumerate(U):
            u = U[i]
            neg_set = prob_mat[u]
            I_neg[i] = np.random.choice(neg_set.indices, self.neg_sample_size, p=neg_set.data)
        I = torch.LongTensor(I).to(device)
        U = torch.LongTensor(U).to(device)
        I_neg = torch.LongTensor(I_neg).to(device)
        I_U = torch.LongTensor(neighbors.indices).to(device)
        N_U = torch.LongTensor(neighbors.indptr[1:] - neighbors.indptr[:-1]).to(device)
        return I, U, I_neg, I_U, N_U
    
    def __iter__(self):
        return self

We evaluate the performance of the trained item embeddings in the item-based recommendation task. We use the last item that a user purchased to represent the user and compute the similarity between the last item and a list of items (an item the user will purchase and a set of randomly sampled items). We calculate the ranking of the item that will be purchased among the list of items.

In [None]:
def RecEval(model, user_item_spm, k, users_eval, items_eval, neg_eval):
    model.eval()
    with torch.no_grad():
        neg_items_eval = neg_eval[users_eval]
        neighbors = user_item_spm.tocsr()[users_eval]
        I_U = torch.LongTensor(neighbors.indices)
        N_U = torch.LongTensor(neighbors.indptr[1:] - neighbors.indptr[:-1])
        r, neg_r = model.est_rating(torch.LongTensor(items_eval).to(device),
                                    torch.LongTensor(users_eval).to(device),
                                    torch.LongTensor(neg_items_eval).to(device),
                                    I_U.to(device),
                                    N_U.to(device))
        neg_sample_size = int(len(neg_r) / len(r))
        neg_r = neg_r.reshape((-1, neg_sample_size))
        hits = (torch.sum(neg_r >= r, 1) <= k).cpu().numpy()
        return np.mean(hits)

In [None]:
def RecEvalAll(model, user_item_spm, k, users_eval, items_eval):
    model.eval()
    all_users = []
    all_items = []
    user_item_spm = user_item_spm.tocoo()
    all_users.append(user_item_spm.row)
    all_items.append(user_item_spm.col)
    all_users.append(users_valid)
    all_items.append(items_valid)
    all_users.append(users_test)
    all_items.append(items_test)
    all_users = np.concatenate(all_users).astype(np.int64)
    all_items = np.concatenate(all_items).astype(np.int64)
    all_user_item_spm = spsp.coo_matrix((np.ones((len(all_users))), (all_users, all_items)))
    all_user_item_spm = all_user_item_spm.tocsr()
    
    batch_size = 1024
    batches = np.split(users_eval, np.arange(batch_size, len(users_eval), batch_size))
    with torch.no_grad():
        hits_list = []
        for users in batches:
            items = items_eval[users]
            neg_items_eval = np.tile(np.arange(num_items), len(users)).reshape(len(users), num_items)
            neighbors = all_user_item_spm[users]
            I_U = torch.LongTensor(neighbors.indices)
            N_U = torch.LongTensor(neighbors.indptr[1:] - neighbors.indptr[:-1])
            r, neg_r = model.est_rating(torch.LongTensor(items).to(device),
                                        torch.LongTensor(users).to(device),
                                        torch.LongTensor(neg_items_eval).to(device),
                                        I_U.to(device),
                                        N_U.to(device))
            neg_sample_size = num_items
            # Here neg_r includes the scores on the positive edges. let's make the scores
            # on the positive edges very small. This is equivalent to exclude positive edges
            # from negative edges.
            neg_r = neg_r.reshape((-1, neg_sample_size)).cpu().numpy() - neighbors * 10
            hits = (np.sum(neg_r >= r.cpu().numpy(), 1) <= k)
            hits_list.append(hits)
        return np.mean(np.concatenate(hits_list))

Now we put everything in the training loop.

In [None]:
import time

if torch.cuda.is_available():
    device = torch.device('cuda:0')
else:
    device = torch.device('cpu')


#Model hyperparameters
n_hidden = 16
n_layers = 0
dropout = 0.5
aggregator_type = 'sum' if use_edge_similarity else 'gcn'

# create GraphSAGE model
gconv_p = GraphSAGEModel(n_hidden,
                         n_hidden,
                         n_hidden,
                         n_layers,
                         F.relu,
                         dropout,
                         aggregator_type)

gconv_q = GraphSAGEModel(n_hidden,
                         n_hidden,
                         n_hidden,
                         n_layers,
                         F.relu,
                         dropout,
                         aggregator_type)

model = FISM(user_item_spm, gconv_p, gconv_q, g, n_hidden, device).to(device)
g.to(device)

# Training hyperparameters
weight_decay = 1e-3
n_epochs = 100
lr = 1e-3
neg_sample_size = 20

# use optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)

batch_size = 4096
print('#edges:', user_item_spm.nnz)
print('#batch/epoch:', user_item_spm.nnz/batch_size)

# initialize graph
dur = []
prev_acc = 0
for epoch in range(n_epochs):
    model.train()
    losses = []
    start = time.time()
    negs = []
    for I, U, I_neg, I_U, N_U in EdgeSampler(user_item_spm, batch_size, neg_sample_size):
        loss = model(I, U, I_neg, I_U, N_U)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        losses.append(loss.detach().item())
    train_time = time.time() - start
    
    start = time.time()
    hits10_sub = RecEval(model, user_item_spm, 10, users_valid, items_valid, neg_valid)
    hits_all = RecEvalAll(model, user_item_spm, 20, users_valid, items_valid)
    eval_time = time.time() - start
    print("Epoch {:05d} | train {:.4f} | eval {:.4f} | Loss {:.4f} | HITS@10 sub:{:.4f} | HITS@10 all:{:.4f}".format(
        epoch, train_time, eval_time, np.mean(losses), hits10_sub, hits_all))
    #if prev_acc > hits_all:
    #    break
    prev_acc = hits_all

print()
# Let's save the trained node embeddings.
hits_all = RecEvalAll(model, user_item_spm, 20, users_test, items_test)
print("Test HITS@all:{:.4f}".format(np.mean(hits_all)))