# HinSAGE Tutorial

Movie recommendation as edge regression with [HinSAGE](https://stellargraph.readthedocs.io/en/stable/api.html#module-stellargraph.layer.hinsage).

We first build an (undirected) bipartite graph with users/movies as (heterogeneous) nodes, and the ratings of (user, movie) pairs as edge weights, then learn the node embeddings, and recommend movies for users by regressing the weight of a given edge, i.e., a (user, movie) pair.

In [None]:
%matplotlib inline

import os
import json
import random
import numpy as np
import pandas as pd
import multiprocessing
import matplotlib.pyplot as plt

In [None]:
# discard warnings
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

from tensorflow.python.util import deprecation
deprecation._PRINT_DEPRECATION_WARNINGS = False

!pip install -U 'gast==0.2.2'

In [None]:
from sklearn import preprocessing, feature_extraction
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

import tensorflow as tf
import tensorflow.keras.backend as K
from tensorflow import keras

import stellargraph as sg
from stellargraph.mapper import HinSAGELinkGenerator
from stellargraph.layer import HinSAGE, link_regression

from utils import ingest_features, ingest_graph, add_features_to_nodes

In [None]:
# set random seed
SEED = 101
random.seed(SEED)
np.random.seed(SEED)
tf.set_random_seed(SEED)

## Load the MovieLens 100k Dataset

Load the [movielens 100k dataset](http://files.grouplens.org/datasets/movielens/ml-100k.zip) and build a bipartite graph with users/movies as nodes, and the ratings of (user, movie) pairs as edge weights. 

In [None]:
data_dir = os.path.join('data', 'ml-100k')

In [None]:
config = json.load(open('ml-100k-config.json', 'r'))
Gnx, id_map, inv_id_map = ingest_graph(data_dir, config)

Load user features

In [None]:
user_features = ingest_features(data_dir, config, node_type='users')

Preprocess user features: normalising user `age`, and performing one-hot encoding for `gender` and `job`.

In [None]:
feature_names = ['age', 'gender', 'job']
feature_encoding = feature_extraction.DictVectorizer(sparse=False, dtype=np.float)
user_features_transformed = feature_encoding.fit_transform(user_features[feature_names].to_dict('records'))
user_features_transformed[:, 0] = preprocessing.scale(user_features_transformed[:, 0])  # rescale ages
user_features = pd.DataFrame(user_features_transformed, index=user_features.index, dtype=np.float)

Load movie features

In [None]:
movie_features = ingest_features(data_dir, config, node_type='movies')

Add user and movie features to the graph

In [None]:
Gnx = add_features_to_nodes(Gnx, inv_id_map, user_features, movie_features)

Split edges into train and test sets

In [None]:
edges_train, edges_test = train_test_split(list(Gnx.edges(data=True)), train_size=0.7, test_size=0.3)

edgelist_train = [(e[0],e[1]) for e in edges_train]
edgelist_test = [(e[0],e[1]) for e in edges_test]

labels_train = [e[2]['score'] for e in edges_train]
labels_test = [e[2]['score'] for e in edges_test]

## Learn node embedding using supervised HinSAGE

Create an undirected stellargraph model with node features

In [None]:
G = sg.StellarGraph(Gnx, node_features='feature')
batch_size = 32
num_samples = [10, 5]  # sizes of the 1- and 2-hop neighbour samples

Create the generators to feed data from the graph to the Keras model.

In [None]:
link_gen = HinSAGELinkGenerator(G, batch_size, num_samples)
train_gen = link_gen.flow(edgelist_train, labels_train, shuffle=True)
test_gen = link_gen.flow(edgelist_test, labels_test, shuffle=False)

### Exercise 1

Create a [HinSAGE](https://stellargraph.readthedocs.io/en/stable/api.html#stellargraph.layer.hinsage.HinSAGE) model with:
- 2 hidden layers
- the size of both layers is 32
- the link generator created above
- no dropout

In [None]:
hinsage = # YOUR_CODE

In [None]:
x_inp, x_out = hinsage.build()

### Exercise 2

Create a regression layer using [link_regression](https://stellargraph.readthedocs.io/en/stable/api.html#stellargraph.layer.link_inference.link_regression) which uses the concatenated embeddings of nodes (i.e., user/movie) as the edge embedding.

In [None]:
pred_layer = # YOUR_CODE

In [None]:
prediction = pred_layer(x_out)

Create a Keras model that combines node embedding learning and link regression layers, then train it.

In [None]:
model = keras.Model(inputs=x_inp, outputs=prediction)

def root_mean_squared_error(y_true, y_pred): 
    return K.sqrt(K.mean((y_true - y_pred) ** 2))

model.compile(
    optimizer=keras.optimizers.Adam(lr=1e-3),
    loss=keras.losses.mean_squared_error,
    metrics=[root_mean_squared_error],
)

_ = model.fit_generator(
          train_gen,
          epochs=3,
          verbose=1,
          shuffle=True,
          workers=multiprocessing.cpu_count()//2)

## Make recommendations

**Baseline I**

Predict the rating of a movie using the average of observed ratings of that movie

In [None]:
ratings_dict = dict()
ratings_total = 0

for e in edges_train:
    mid = e[2]['mId']
    score = e[2]['score']
    ratings_total += score
    try:
        ratings_dict[mid].append(score)
    except KeyError:
        ratings_dict[mid] = [score]
        
rating_mean = ratings_total / len(edges_train)

In [None]:
y_pred_baseline = []

for e in edges_test:
    mid = e[2]['mId']
    if mid in ratings_dict:
        pred = np.mean(ratings_dict[mid]) 
    else:
        pred = rating_mean
    y_pred_baseline.append(pred)

In [None]:
y_true = labels_test
y_pred = model.predict_generator(test_gen, verbose=1)

In [None]:
print('RMSE of Baseline: %.3f' % np.sqrt(mean_squared_error(y_true, y_pred_baseline)))
print('RMSE of  HinSAGE: %.3f' % np.sqrt(mean_squared_error(y_true, y_pred)))