# 🧪 Semi-supervised node classification with `kglab` and PyTorch Geometric

We introduce the application of neural networks on knowledge graphs using `kglab`. 

Graph Neural networks (GNNs) have gained popularity in a number of practical applications, including knowledge graphs, social networks and recommender systems. In the context of knowledge graphs, GNNs are being used for tasks such as link prediction, node classification or knowledge graph embeddings. Use cases for these tasks include `Automatic Knowledge Base Construction` (AKBC) and `Data Curation` of data from different sources and with varying quality and trust.

In this tutorial, we will learn to:

- use `kglab` to represent a knowledge graph as a Pytorch Tensor, a suitable structure working with PyTorch neural nets

- use the widely known `pytorch_geometric` (PyG) GNN library together with `kglab`.

- train a GNN with `pytorch_geometric` and `PyTorch Lightning` for semi-supervised node classification of the recipes knowledge graph.

- build and iterate on training data using `rubrix` with a Human-in-the-loop (HITL) approach.

## Our use case in a nutshell

Our goal in this notebook will be to build a semi-supervised node classifier of recipes and ingredients from scratch using kglab, PyG and Rubrix. 

Our classifier will be able to classify the nodes in our 15K nodes knowledge graph according to a set of pre-defined flavour related categories: `sweet`, `salty`, `piquant`, `sour`, etc. To account for mixed flavours (e.g., sweet chili sauce), our model will be multi-class (we have several target labels), multi-label (a node can be labelled as with 0 or several categories).

## Install `kglab` and `Pytorch Geometric`

In [16]:
!pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.8.0+cpu.html -qqq
!pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.8.0+cpu.html -qqq
!pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-1.8.0+cpu.html -qqq
!pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.8.0+cpu.html -qqq
!pip install torch-geometric -qqq
!pip install torch -qqq

!pip install kglab -qqq

!pip install pytorch_lightning -qqq

In [139]:
!pip install pytorch_lightning -qqq

## Loading the recipes knowledge graph

We'll be working with the "recipes" knowledge graph, which is used throughout the `kglab` tutorial (see the [Syllabus](https://derwen.ai/docs/kgl/tutorial/)).

This version of the recipes kg contains around ~15K recipes linked to their respective ingredients, as well as some other properties such as cooking time, labels and descriptions. 

Let's load the knowledge graph into a `kg` object by reading from an RDF file (in Turtle):

In [3]:
import kglab

NAMESPACES = {
    "wtm":  "http://purl.org/heals/food/",
    "ind":  "http://purl.org/heals/ingredient/",
    "recipe":  "https://www.food.com/recipe/",
    }

kg = kglab.KnowledgeGraph(namespaces = NAMESPACES)

_ = kg.load_rdf("data/recipe_lg.ttl")

Let's take a look at our graph structure using the `Measure` class:

In [4]:
measure = kglab.Measure()
measure.measure_graph(kg)

f"Nodes: {measure.get_node_count()} ; Edges: {measure.get_edge_count()}"

'Nodes: 15983 ; Edges: 160980'

In [5]:
measure.p_gen.get_tally() # tallies the counts of predicates

Unnamed: 0,count
http://purl.org/heals/food/hasIngredient,113537
http://www.w3.org/1999/02/22-rdf-syntax-ns#type,15981
http://www.w3.org/2004/02/skos/core#definition,15481
http://purl.org/heals/food/hasCookTime,15407
http://www.w3.org/2004/02/skos/core#prefLabel,574


In [6]:
measure.s_gen.get_tally() # tallies the counts of predicates

Unnamed: 0,count
https://www.food.com/recipe/67888,25
https://www.food.com/recipe/501028,25
https://www.food.com/recipe/38276,24
https://www.food.com/recipe/277843,24
https://www.food.com/recipe/262816,23
...,...
http://purl.org/heals/ingredient/garlic_powder,2
http://purl.org/heals/ingredient/spinach,2
http://purl.org/heals/ingredient/toasted_sesame_oil,2
http://purl.org/heals/ingredient/salmon,2


In [7]:
measure.o_gen.get_tally() # tallies the counts of predicates

Unnamed: 0,count
http://purl.org/heals/food/Recipe,15407
http://purl.org/heals/ingredient/Salt,9034
http://purl.org/heals/ingredient/AllPurposeFlour,6456
http://purl.org/heals/ingredient/ChickenEgg,6041
http://purl.org/heals/ingredient/WhiteSugar,5979
...,...
http://purl.org/heals/ingredient/wood_bethony,1
http://purl.org/heals/ingredient/smoked_chicken,1
http://purl.org/heals/ingredient/dried_sweet_basil_leaves,1
http://purl.org/heals/ingredient/red_chile,1


In [8]:
measure.l_gen.get_tally() # tallies the counts of literals

Unnamed: 0,count
PT30M,1129
PT20M,1074
PT25M,956
PT10M,938
PT15M,906
...,...
tre s catalina dressing,1
neenish tarts,1
japanese take out ginger salad dressing,1
tatizas chamorro snack,1


From the above exploration, we can extract some conclusions to guide the next steps:

- We have a limited number of relationships, being `hasIngredient` the most frequent.

- We have rather unique literals for labels and descriptions, but a certain amount of repetition for `hasCookTime`.

- As we would have expected, most frequently referenced objects are ingredients such as `Salt`, `ChikenEgg` and so on. 


Now, let's move into preparing our knowledge graph for PyTorch.

## Representing our knowledge graph as a `PyTorch` Tensor

Let's now represent our `kg` as a `PyTorch` tensor using the `kglab.SubgraphTensor` class.

In [9]:
sg = kglab.SubgraphTensor(
    kg
) 

In [10]:
def tensorify(g, sg, excludes):
    def exclude(rel):
        return sg.n3fy(rel) in excludes
    
    relations = sorted(set(g.predicates()))
    subjects = set(g.subjects())
    objects = set(g.objects())
    nodes = list(subjects.union(objects))
    
    relations_dict = {rel: i for i, rel in enumerate(list(relations)) if not exclude(rel)}
    
    # this offset enables consecutive indices in our final vector
    offset = len(relations_dict.keys())
    
    nodes_dict = {node: i+offset for i, node in enumerate(nodes)}

    
    edge_list = []
    
    for s, p, o in g.triples((None, None, None)):
        if p in relations_dict.keys(): # this means is not excluded
            src, dst, rel = nodes_dict[s], nodes_dict[o], relations_dict[p]
            edge_list.append([src, dst, 2 * rel])
            edge_list.append([dst, src, 2 * rel + 1])
    
    # turn into str keys and concat
    node_vector = [sg.n3fy(node) for node in relations_dict.keys()] + [sg.n3fy(node) for node in nodes_dict.keys()]
    return edge_list, node_vector

In [11]:
edge_list, node_vector = tensorify(kg.rdf_graph(), sg, excludes=['skos:description', 'skos:prefLabel'])

In [13]:
len(edge_list)

320812

Let's create `kglab.Subgraph` to be used for encoding/decoding numerical ids and uris, which will be useful for preparing our training data, as well as making sense of the predictions of our neural net.

In [14]:
sg = kglab.Subgraph(kg=kg, preload=node_vector)

In [17]:
import torch
from torch_geometric.data import Data

tensor = torch.tensor(edge_list, dtype=torch.long).t().contiguous()  # pylint: disable=E1101,E1102
edge_index, edge_type = tensor[:2], tensor[2]
data = Data(edge_index=edge_index)
data.edge_type = edge_type

In [26]:
(data.edge_index.shape, data.edge_type.shape, data.edge_type.max())

(torch.Size([2, 320812]), torch.Size([320812]), tensor(7))

## Building a training set with Rubrix

Now that we have a tensor representation of our kg which we can feed into our neural network, let's now focus on the training data.

As we will be doing semi-supervised classification, we need to build a training set (i.e., some recipes and ingredients with ground-truth labels). 


For this, we can use [Rubrix](https://github.com/recognai/rubrix), an open-source tool for exploring, labeling and iterating on data for AI. Rubrix allows data scientists and subject matter experts to rapidly iterate on training and evaluation data by enabling iterative, asynchronous and potentially distributed workflows.

In Rubrix, a very simple workflow during model development looks like this:

1. Log unlabelled data records with `rb.log()` into a Rubrix dataset. At this step you could use weak supervision methods (e.g., Snorkel) to pre-populate and then only your labels, or use a pretrained model to guide your annotation process. In our case, we will just log recipe and ingredient "records" along with some metadata (RDF types, labels, etc.).

2. Rapidly explore and label records in your dataset using the webapp which follows a search-driven approach, which is especially useful with large, potentially noisy datasets and for quickly leveraging domain knowledge (e.g., recipes containing WhiteSugar are likely sweet). For the tutorial, we have spent around 30min for labelling around 600 records.

3. Retrieve your annotations any time using `rb.load()` or `rb.snapshot()`, which return a convenient `pd.Dataframe` making it quite handy to process and use for model development. In our case, we will load a snapshot, do a train_test_split with scikit_learn, and then use this for training our GNN.

4. After training a model, you can go back to step 1, this time using your model and its predictions, to spot improvements, quickly label other portions of the data, and so on. In our case, as we've started with a very limited training set (~600 examples), we will use our node classifier and `rb.log()` it's predictions over the rest of our data (unlabelled recipes and ingredients).

## Setup Rubrix

If you have not installed and launched Rubrix, check the [installation guide](https://github.com/recognai/rubrix#get-started). 

In [30]:
import rubrix as rb

### 0. Preparing our raw dataset of recipes and ingredients

In [28]:
import pandas as pd
sparql = """
    SELECT distinct *
    WHERE {
        ?uri a wtm:Recipe .
        ?uri a ?type .
        ?uri skos:definition ?definition .
        ?uri wtm:hasIngredient ?ingredient
    } 
"""
df = kg.query_as_df(sparql=sparql)

# We group the ingredients into one column containing lists:
recipes_df = df.groupby(['uri', 'definition', 'type'])['ingredient'].apply(list).reset_index(name='ingredients') ; recipes_df

sparql_ingredients = """
    SELECT distinct *
    WHERE {
        ?uri a wtm:Ingredient .
        ?uri a ?type .
        OPTIONAL { ?uri skos:prefLabel ?definition } 
    }
"""

df = kg.query_as_df(sparql=sparql_ingredients)
df['ingredients'] = None

ing_recipes_df = pd.concat([recipes_df, df])

ing_recipes_df.fillna('', inplace=True) ; ing_recipes_df

Unnamed: 0,uri,definition,type,ingredients
0,recipe:10000,tomato paste,wtm:Recipe,"[ind:Salt, ind:Tomato]"
1,recipe:100026,baking powder meatballs,wtm:Recipe,"[ind:CowMilk, ind:Salt, ind:bread, ind:Onion, ..."
2,recipe:100034,working woman s cheese souffle,wtm:Recipe,"[ind:monterey_jack_cheese, ind:CowMilk, ind:Al..."
3,recipe:100048,carrot spice cookies,wtm:Recipe,"[ind:margarine, ind:AllPurposeFlour, ind:Salt,..."
4,recipe:100051,2 minute mayonnaise,wtm:Recipe,"[ind:AppleCiderVinegar, ind:Salt, ind:sweetene..."
...,...,...,...,...
569,ind:raisins,raisins,wtm:Ingredient,
570,ind:bay_leaves,bay leaves,wtm:Ingredient,
571,ind:hass_avocado,hass avocado,wtm:Ingredient,
572,ind:Tomato,tomato,wtm:Ingredient,


### 1. Logging into Rubrix

In [32]:
LABELS = ['Bitter', 'Meaty', 'Piquant', 'Salty', 'Sour', 'Sweet']

In [None]:
import rubrix as rb

records = []
for i, r in ing_recipes_df.iterrows():
    item = rb.TextClassificationRecord(
            inputs={
                "id":r.uri, 
                "definition": r.definition,
                "ingredients": r.ingredients, 
                "type": r.type
            }, # log node fields
            prediction=[(label, 0.0) for label in LABELS], # log "dummy" predictions for aiding annotation
            metadata={'ingredients': r.ingredients, "type": r.type}, # metadata filters for quick exploration and annotation
            prediction_agent="kglab_tutorial", # who's performing/logging the prediction
            multi_label=True
        )
    records.append(item)
rb.log(records=records, name="kg_node_classification")

### 2. Annotation session with Rubrix (optional)

In this step you can go to your rubrix dataset and annotate some examples of each class.

If you have no time to do this, just skip this part as we have prepared a dataset for you with around ~600 examples.

### 3. Loading our labelled records and create a train_test split (optional)

In this step you can go to your rubrix dataset and annotate some examples of each class.

If you have no time to do this, just skip this part as we have prepared a dataset for you.

In [37]:
rb.snapshots(dataset="kg_node_classification")

[DatasetSnapshot(id='1619529362.161462', task='TextClassification', creation_date=datetime.datetime(2021, 4, 27, 13, 16, 2, 226195)),
 DatasetSnapshot(id='1619530026.060097', task='TextClassification', creation_date=datetime.datetime(2021, 4, 27, 13, 27, 6, 119516))]

In [38]:
df = rb.load(name="kg_node_classification", snapshot='1619530026.060097') ; df.head()

Unnamed: 0,index,id,ingredients,type,labels,definition
0,0,95315,"['ind:VanillaExtract', 'ind:ChickenEgg', 'ind:...",wtm:Recipe,[Sour],grandmother paul s sour cream pound cake
1,1,253152,"['ind:sour_cream', 'ind:BakingPowder', 'ind:Ba...",wtm:Recipe,[Sour],sour cream biscuit
2,2,410482,"['ind:ChickenEgg', 'ind:sour_cream', 'ind:coco...",wtm:Recipe,[Sour],the world s fastest chocolate sour cream cake ...
3,3,96249,"['ind:Water', 'ind:Cornstarch', 'ind:tomato_sa...",wtm:Recipe,[Sour],basic sweet and sour sauce
4,4,9128,"['ind:ChickenEgg', 'ind:Salt', 'ind:Paprika', ...",wtm:Recipe,[Sour],sour cream and spinach omelette


In [70]:
from sklearn.model_selection import train_test_split

#train_df, test_df = train_test_split(df)
train_df.to_csv('data/train_recipes_new.csv')
test_df.to_csv('data/test_recipes_new.csv')

### 4. Creating PyTorch train and test sets

Here we take our train and test datasets and transform them into `torch.Tensor` objects with the help of our kglab `Subgraph` for turning `uris` into `torch.long` indices.

In [46]:
import pandas as pd

train_df = pd.read_csv('data/train_recipes.csv') # use your own labelled datasets if you've created a snapshot
test_df = pd.read_csv('data/test_recipes.csv')

# we make sure lists are parsed correctly
train_df.labels = train_df.labels.apply(eval)
test_df.labels = test_df.labels.apply(eval)

Let's create label lookups for label to int and viceversa

In [47]:
label2id = {label:i for i,label in enumerate(LABELS)} ; 
id2label = {i:l for l,i in label2id.items()} ; (id2label, label2id)

({0: 'Bitter', 1: 'Meaty', 2: 'Piquant', 3: 'Salty', 4: 'Sour', 5: 'Sweet'},
 {'Bitter': 0, 'Meaty': 1, 'Piquant': 2, 'Salty': 3, 'Sour': 4, 'Sweet': 5})

The following function turns our DataFrame into numerical arrays for node indices and labels

In [71]:
import numpy as np

def create_indices_labels(df):
    # turn our dense labels into a one-hot list
    def one_hot(label_ids):
        a = np.zeros(len(LABELS))
        a.put(label_ids, np.ones(len(label_ids)))
        return a
    
    indices, labels = [], []
    for uri, label in zip(df.uri.tolist(), df.labels.tolist()):
        indices.append(sg.transform(uri))
        labels.append(one_hot([label2id[label] for label in label]))
    return indices, labels

Finally, let's turn our dataset into PyTorch tensors

In [72]:
train_indices, train_labels = create_indices_labels(train_df)
test_indices, test_labels = create_indices_labels(test_df)

train_idx = torch.tensor(train_indices, dtype=torch.long)
train_y = torch.tensor(train_labels, dtype=torch.float)

test_idx = torch.tensor(test_indices, dtype=torch.long)
test_y = torch.tensor(test_labels, dtype=torch.float) ; train_idx[:10], train_y

(tensor([12498, 17116, 24375,  9082, 12005, 17665, 20652, 19351, 23531,  3124]),
 tensor([[0., 0., 0., 0., 1., 1.],
         [0., 0., 0., 0., 0., 1.],
         [1., 0., 0., 0., 0., 1.],
         ...,
         [0., 1., 0., 1., 0., 0.],
         [0., 0., 0., 0., 1., 0.],
         [1., 0., 1., 0., 0., 1.]]))

Let's see if we can recover the correct URIs for our numerical ids using our `kglab.Subgraph`

In [79]:
(train_df.loc[0], sg.inverse_transform(12498))

(Unnamed: 0                                                    64
 index                                                         64
 id                                                        214252
 ingredients    ['ind:AllPurposeFlour', 'ind:whipping_cream', ...
 type                                                  wtm:Recipe
 labels                                             [Sweet, Sour]
 definition     noni s sour cream chocolate cake with brown su...
 uri                                                recipe:214252
 Name: 0, dtype: object,
 'recipe:214252')

## Creating a Subgraph of recipe and ingredient nodes
Here we create a node list to be used as a seed for building our `PyG` subgraph (using k-hops as we will see in the next section). The reason is that we do not want to encode all nodes in the graph (such as literals, durations, etc.). Our goal would be to encode only `recipes` and `ingredients`, as all nodes passed through the GNN will be classified. 

In [80]:
node_idx = torch.LongTensor([
    sg.transform(i) for i in ing_recipes_df.uri.values
])

In [82]:
node_idx.max()

tensor(32262)

## Defining a RGCN for node classification

### Creating a `PyG` subgraph

Here we build a subgraph with `k=1` hops from target to source starting with all `recipe` and `ingredient` nodes:

In [84]:

from torch_geometric.utils import k_hop_subgraph
# here we take all connected nodes with 2 hops
k = 1
node_idx, edge_index, mapping, edge_mask = k_hop_subgraph(
    node_idx, 
    k, 
    data.edge_index, 
    relabel_nodes=False
)

We have increased the size of our node set:

In [85]:
node_idx.shape

torch.Size([31712])

Here we compute some measures needed for defining the size of our layers

In [86]:
data.num_nodes = data.edge_index.max().item() + 1

data.num_relations = data.edge_type.max().item() + 1

data.edge_type = data.edge_type[edge_mask]

data.num_classes = len(LABELS)

data.num_nodes, data.num_relations, data.num_classes

(32263, 8, 6)

### Definition a basic Relational Graph Convolutional Network

In [88]:
from torch_geometric.nn import FastRGCNConv, RGCNConv
import torch.nn.functional as F

In [89]:
RGCNConv?

[0;31mInit signature:[0m
[0mRGCNConv[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0min_channels[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mint[0m[0;34m,[0m [0mTuple[0m[0;34m[[0m[0mint[0m[0;34m,[0m [0mint[0m[0;34m][0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mout_channels[0m[0;34m:[0m [0mint[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnum_relations[0m[0;34m:[0m [0mint[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnum_bases[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mint[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnum_blocks[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mint[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0maggr[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'mean'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mroot_weight[0m[0;34m:[0m [0mbool[0m [0;34m=[0m [0;32mTrue[0m[0;34m,[0m[0;3

In [113]:
class RGCN(torch.nn.Module):
    def __init__(self, num_nodes, num_relations, num_classes, out_channels=16, num_bases=30, layer_type=FastRGCNConv):
        
        super(RGCN, self).__init__()
        
        self.conv1 = layer_type(
            num_nodes, 
            out_channels, 
            num_relations, 
            num_bases=30
        )
        self.conv2 = layer_type(
            out_channels, 
            num_classes, 
            num_relations, 
            num_bases=30
        )

    def forward(self, edge_index, edge_type):
        x = F.relu(self.conv1(None, edge_index, edge_type))
        x = self.conv2(x, edge_index, edge_type)
        return torch.sigmoid(x)

### Create our model and optimizer

In [126]:
model = RGCN(
    num_nodes=data.num_nodes,
    num_relations=data.num_relations,
    num_classes=data.num_classes,
    #out_channels=32
)

In [127]:
device = torch.device('cpu') # ('cuda')
data = data.to(device)
model = model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=0.001)
model

RGCN(
  (conv1): FastRGCNConv(32263, 16, num_relations=8)
  (conv2): FastRGCNConv(16, 6, num_relations=8)
)

## Training our RGCN

In [134]:
loss_module = torch.nn.BCELoss()

def train():
    model.train()
    optimizer.zero_grad()
    out = model(edge_index, edge_type)
    loss = loss_module(out[train_idx], train_y)
    loss.backward()
    optimizer.step()
    return loss.item()

def accuracy(predictions, y):
    return predictions.eq(y).to(torch.float).mean()

@torch.no_grad()
def test():
    model.eval()
    pred = model(edge_index, edge_type)
    pred = np.round(pred)
    train_acc = accuracy(pred[train_idx], train_y)
    test_acc = accuracy(pred[test_idx], test_y)
    return train_acc.item(), test_acc.item()

In [136]:
for epoch in range(1, 5):
    loss = train()
    train_acc, test_acc = test()
    print(f'Epoch: {epoch:02d}, Loss: {loss:.4f}, Train: {train_acc:.4f} '
          f'Test: {test_acc:.4f}')

Epoch: 01, Loss: 0.0250, Train: 0.9996 Test: 0.8345
Epoch: 02, Loss: 0.0241, Train: 1.0000 Test: 0.8322
Epoch: 03, Loss: 0.0233, Train: 1.0000 Test: 0.8322
Epoch: 04, Loss: 0.0225, Train: 1.0000 Test: 0.8333


## Using our model and analyzing its predictions with Rubrix
Let's see the shape of our model predictions

In [142]:
pred = model(edge_index, edge_type) ; pred

tensor([[0.5514, 0.3967, 0.4534, 0.3968, 0.4095, 0.4736],
        [0.5508, 0.3979, 0.4539, 0.3966, 0.4093, 0.4736],
        [0.5506, 0.3966, 0.4536, 0.3957, 0.4093, 0.4740],
        ...,
        [0.0651, 0.5984, 0.1798, 0.9521, 0.0616, 0.0125],
        [0.5555, 0.3987, 0.4512, 0.3935, 0.4066, 0.4767],
        [0.0374, 0.1662, 0.0898, 0.6494, 0.0627, 0.1541]],
       grad_fn=<SigmoidBackward>)

Let's find the predictions for our the nodes in our training and test sets

In [143]:
def find(tensor, values):
    return torch.nonzero(tensor[..., None] == values)

In [147]:
train_test_idx = find(node_idx,torch.cat((test_idx, train_idx))) ; len(train_test_idx)

589

Let's get the ids, uris and labels of the nodes which were not in our train/test datasets

In [151]:
index = torch.ones(node_idx.shape[0], dtype=bool)
indices = find(node_idx,torch.cat((test_idx, train_idx)))
index[indices] = False
idx = node_idx[index]

In [154]:
len(idx), len(node_idx), len(node_idx) - len(idx)

(30543, 31712, 1169)

We use our `SubgraphTensor` for getting back our URIs and build `uri,predicted_labels` pairs:

In [155]:
uris = [sg.inverse_transform(i) for i in idx]
predicted_labels = [l for l in pred[idx]]

In [156]:
predictions = list(zip(uris,predicted_labels)) ; predictions[0:10]

[('recipe:264354',
  tensor([0.0349, 0.3532, 0.2850, 0.8633, 0.0710, 0.0735],
         grad_fn=<UnbindBackward>)),
 ('recipe:227679',
  tensor([0.0536, 0.2053, 0.0462, 0.7612, 0.0393, 0.0595],
         grad_fn=<UnbindBackward>)),
 ('delmonico potatoes',
  tensor([0.5547, 0.3986, 0.4506, 0.3935, 0.4070, 0.4768],
         grad_fn=<UnbindBackward>)),
 ('chocolate pixies',
  tensor([0.5544, 0.3992, 0.4532, 0.3952, 0.4061, 0.4757],
         grad_fn=<UnbindBackward>)),
 ('pennsylvania dutch breakfast cake',
  tensor([0.5567, 0.3994, 0.4526, 0.3937, 0.4059, 0.4766],
         grad_fn=<UnbindBackward>)),
 ('lemon streusel bars',
  tensor([0.5552, 0.4005, 0.4510, 0.3927, 0.4045, 0.4756],
         grad_fn=<UnbindBackward>)),
 ('recipe:321021',
  tensor([0.0402, 0.5512, 0.3531, 0.9369, 0.1224, 0.0124],
         grad_fn=<UnbindBackward>)),
 ('easy microwave pie',
  tensor([0.5562, 0.3982, 0.4514, 0.3926, 0.4073, 0.4782],
         grad_fn=<UnbindBackward>)),
 ('classic waffles',
  tensor([0.5563, 0.

In [157]:
import rubrix as rb

records = []
for uri,predicted_labels in predictions:
    ids = ing_recipes_df.index[ing_recipes_df.uri == uri]
    if len(ids) > 0:
        r = ing_recipes_df.iloc[ids]
        item = rb.TextClassificationRecord(
                inputs={"id":r.uri.values[0], "definition": r.definition.values[0], "ingredients": str(r.ingredients.values[0]), "type": r.type.values[0]}, 
                prediction=[(id2label[i], score) for i,score in enumerate(predicted_labels)], 
                metadata={'ingredients': r.ingredients.values[0], "type": r.type.values[0]}, 
                prediction_agent="node_classifier_v1", 
                multi_label=True
        )
        records.append(item)

In [158]:
rb.log(records, name="kg_node_classification_unseen_nodes")

BulkResponse(dataset='kg_node_classification_unseen_nodes', processed=15125, failed=0)

## APPENDIX: Training with PyTorch Lightning

In [137]:
from torch_geometric.data import Data, DataLoader

data.train_idx = train_idx
data.train_y = train_y
data.test_idx = test_idx
data.test_y = test_y

dataloader = DataLoader([data], batch_size=1); dataloader

<torch_geometric.data.dataloader.DataLoader at 0x14b7d1910>

In [None]:
import torch
import pytorch_lightning as pl
from pytorch_lightning.callbacks import LearningRateMonitor, ModelCheckpoint

class RGCNNodeClassification(pl.LightningModule):
    
    def __init__(self, **model_kwargs):
        super().__init__()
        
        self.model = RGCN(**model_kwargs)
        self.loss_module = torch.nn.BCELoss()
    
    def forward(self, data, mode="train"):
        edge_index, edge_type = data.edge_index, data.edge_type
        if mode == "train":
            idx, y = data.train_idx, data.train_y
        elif mode == "val":
            idx, y = data.test_idx, data.test_y
        
        x = self.model(edge_index, edge_type)
        loss = self.loss_module(x[idx], y)
        metric = pl.metrics.F1(num_classes=6, multilabel=True)
        f1 = metric(x[idx], y)
        return loss, f1
        
    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=0.01, weight_decay=0.001)
        return optimizer
        
    def training_step(self, batch, batch_idx):
        loss, f1 = self.forward(batch, mode="train")
        self.log('train_loss', loss)
        self.log('train_f1', f1, prog_bar=True)
        return loss 
        
    def validation_step(self, batch, batch_idx):
        _, f1 = self.forward(batch, mode="val")
        self.log('val_f1', f1, prog_bar=True)
          
    def test_step(self, batch, batch_idx):
        _, f1 = self.forward(batch, mode="test")
        self.log('test_f1', f1, prog_bar=True)

In [None]:
pl.seed_everything()

In [None]:
model_pl = RGCNNodeClassification()

In [None]:
trainer = pl.Trainer(
    default_root_dir='pl_runs',
    checkpoint_callback=ModelCheckpoint(save_weights_only=True, mode="max", monitor="val_f1"),
    max_epochs=200
)

In [None]:
trainer.fit(model_pl, dataloader, dataloader)