# Load extracted KG recipes and fit a GNN via PyTorch Geometric

This notebook provides an example of how to load in the extracted triples from https://www.allrecipes.com and fit them to a heterogenous GNN via pytorch geometric. 

In [1]:
import os, sys

import torch as tn
import torch.optim 
from torch.nn import ModuleDict 
import torch_geometric.transforms as T
from torch_geometric.nn import to_hetero, Linear, GATConv, HeteroConv
from torch_geometric.data import HeteroData

sys.path = ['/Users/walder2/kg_uq/'] + sys.path
path_to_data = '/Users/walder2/kg_uq/recipe_data'

from kg_extraction import kg_to_hetero

  from .autonotebook import tqdm as notebook_tqdm


# Define the Hetero GNN

This is just one example. You can head to the docs here https://pytorch-geometric.readthedocs.io/en/latest/cheatsheet/gnn_cheatsheet.html and find heterogenous GNNs that handle features (and possibly lazy-initialization).

The GNN maps from `HeteroData()` to $\mathbb{R}^d$, with a linear layer at the bottom aggregating from each node type. If you change this, just be mindful of the loss function used later on. 

Note that the emebeddings for the features are `torch.float32`, so you need to be mindful of how you initialize the variables of the GNN.

In [2]:
log_sig = torch.nn.LogSigmoid()

class HeteroGNN(tn.nn.Module):
    def __init__(self, hidden_channels, emb_dim, num_layers, edge_types, node_types):
        super().__init__()
        
        self.emb_dim = emb_dim
        self.convs = tn.nn.ModuleList()
        
        for _ in range(num_layers):
            conv = HeteroConv({key : GATConv((-1, -1), hidden_channels, add_self_loops=False) for key in edge_types}, aggr='sum')
            self.convs.append(conv)
            
        self.lin_dict = ModuleDict({x: Linear(hidden_channels, emb_dim, bias=False) for x in node_types})
        
        
        
    def forward(self, x_dict, edge_index_dict):
        rv = tn.zeros(self.emb_dim)
        
        for conv in self.convs:
            x_dict = conv(x_dict, edge_index_dict)
            x_dict = {key: x.relu() for key, x in x_dict.items()}
            
        # may want to change the aggregation layer here
        for key, x in x_dict.items():
            rv += self.lin_dict[key](torch.mean(x, dim=0))
        return rv
        

# Load the data 

The returned dictionary contains:

    - 'train': a list of `HeteroData()` objects (one for each subgraph)
    - 'kg': the KG as a `DataFrame`
    - 'embedding_maps': Tuple of node type sentence enumeration, and corresponding node type embeddings
    
You need to specify the path to `./recipe_data` in `data_dir`. You can speficy an embedding model for the sentences by passing a string to `sentence_transformer_model`, which corresponds to a type of embeddings for `SentenceTransformer` objects (see https://www.sbert.net/). `Passing `undirected = True`, supplies reverse edges for fitting. 
    

In [4]:
data_dict = kg_to_hetero(data_dir=path_to_data, sentence_transformer_model='all-MiniLM-L6-v2', undirected=True)

# Initialize the GNN

`emb_dim` is the dimension mapped to by the encoder. You need to specify the edge types and node types to initialize the GNN as defined. 


Model is fit by performing an N-ary classification. Each graph is embedded to get $f_{\theta}(G_i) = \boldsymbol{d}_i$. The optimization maximizes 

$$ \sum_{i=1}^{N} P(G_i \vert D ) = \sum_{i=1}^{n} \frac{\sigma(\boldsymbol{d}_i^{T}\boldsymbol{d}_i)}{\sum_{i=1}^{N}  \sigma(\boldsymbol{d}_i^{T}\boldsymbol{d}_j) }, $$

where $\sigma(x) = \frac{1}{1 + \exp(-x)}$ is a sigmoid function This is not a great way to do this, but for a first pass try it out. 

In [9]:
emb_dim = 10 

edge_types = tuple(set((y for x in train for y in x.edge_types)))
node_types = tuple((y for x in train for y in x.node_types))

encoder = HeteroGNN(hidden_channels=16, num_layers=2, emb_dim=emb_dim, edge_types=edge_types, node_types=node_types)

In [10]:
epochs = 100
ntrain = len(train)
optimizer = torch.optim.Adam(encoder.parameters(), lr=0.01)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.90)

for epoch in range(epochs):
    optimizer.zero_grad()
    
    rv = tn.zeros((ntrain, emb_dim))
    
    for i in range(ntrain):
        rv[i] = encoder(train[i].x_dict, train[i].edge_index_dict)
        
    d_ij = log_sig(tn.matmul(rv, rv.T))
    loss = -tn.sum(tn.diag(d_ij) - tn.logsumexp(d_ij, dim=1))
    
    loss.backward()
    optimizer.step()
    scheduler.step()

    if (epoch+1) % 10 == 0:
        print('Epoch %s, loss %s' % (repr(epoch+1), repr(loss.detach())))
        

Epoch 10, loss tensor(11.3430)
Epoch 20, loss tensor(11.3430)
Epoch 30, loss tensor(11.3430)
Epoch 40, loss tensor(11.3430)
Epoch 50, loss tensor(11.3430)
Epoch 60, loss tensor(11.3430)
Epoch 70, loss tensor(11.3430)
Epoch 80, loss tensor(11.3430)
Epoch 90, loss tensor(11.3430)
Epoch 100, loss tensor(11.3430)


# Todo

Extract more KGs, maybe consider a different schema. We'd like to add some noise via removing/add edges. It would also be nice to make sure that similar recipes with different extraction types are embedded close together. We should provide some search depth curves on this and start thinking about a better fit and UQ. 