# Merged Tutorial: R-GCN Link Prediction

In this notebook, we’ll demonstrate how to use a **Relational Graph Convolutional Network (R-GCN)** for **link prediction**.
We will cover:
1. Loading and Preparing the Dataset
2. Building an R-GCN Model
3. **Training and Optimization**
4. **Evaluation Metrics and Scoring (Precision, Recall, Cohen’s Kappa)**
5. **Visualization and Interpretability**
6. **Interactive Elements** 


## 1. Installation and Imports

Below we install and import the required libraries if necessary.

In [None]:
# If you need to install PyTorch Geometric:
# !pip install torch torch-geometric

In [None]:
import torch
from torch import nn
import torch.nn.functional as F
from torch_geometric.data import Data
from torch_geometric.nn import RGCNConv

# for evaluation:
from sklearn.metrics import precision_score, recall_score, cohen_kappa_score

print("PyTorch version:", torch.__version__)
print("PyTorch Geometric imported.")

## 2. Load and Prepare the Dataset

Here, we create or load a small synthetic relational graph. In practice, replace this with your real dataset. We also split edges into training, validation, and test sets for link prediction.

In [None]:
# Synthetic example with two relations (0 and 1)
edge_index = torch.tensor([
    [0, 1, 1, 2, 2, 3, 4, 5],  # Source nodes
    [1, 0, 2, 1, 3, 2, 1, 3]   # Target nodes
], dtype=torch.long)

# Relation types (one per edge)
edge_type = torch.tensor([0, 0, 1, 1, 0, 1, 0, 1], dtype=torch.long)

# Node features (dummy identity matrix for demonstration)
x = torch.eye(4, 4)

data = Data(
    x=x,
    edge_index=edge_index
)
data.edge_type = edge_type

print(data)

# Split edges into train/val/test
num_edges = data.edge_index.size(1)
train_size = int(num_edges * 0.6)
val_size = int(num_edges * 0.2)

indices = torch.randperm(num_edges)
train_idx = indices[:train_size]
val_idx = indices[train_size:train_size + val_size]
test_idx = indices[train_size + val_size:]

train_edge_index = data.edge_index[:, train_idx]
train_edge_type = data.edge_type[train_idx]
val_edge_index = data.edge_index[:, val_idx]
val_edge_type = data.edge_type[val_idx]
test_edge_index = data.edge_index[:, test_idx]
test_edge_type = data.edge_type[test_idx]

train_edge_index, train_edge_type, val_edge_index, val_edge_type, test_edge_index, test_edge_type

## 3. Building the R-GCN Model

Below is a minimal R-GCN for link prediction:
- It uses learned node embeddings or provided features.
- Multiple RGCNConv layers.
- A simple concatenation-based scoring function for edges.

In [None]:
class RGCNLinkPredictor(nn.Module):
    def __init__(self, num_nodes, in_channels, out_channels, num_relations, num_layers=2):
        super().__init__()
        # Node embeddings (if nodes lack inherent features)
        self.node_embeddings = nn.Embedding(num_nodes, in_channels)

        # RGCN layers
        self.convs = nn.ModuleList()
        self.convs.append(
            RGCNConv(in_channels, out_channels, num_relations=num_relations)
        )
        for _ in range(num_layers - 1):
            self.convs.append(
                RGCNConv(out_channels, out_channels, num_relations=num_relations)
            )

        # Linear scoring layer
        self.scoring = nn.Linear(out_channels * 2, 1)

        self.reset_parameters()

    def reset_parameters(self):
        nn.init.xavier_uniform_(self.node_embeddings.weight)
        for conv in self.convs:
            conv.reset_parameters()
        nn.init.xavier_uniform_(self.scoring.weight)
        if self.scoring.bias is not None:
            nn.init.zeros_(self.scoring.bias)

    def forward(self, x, edge_index, edge_type):
        # If x is None, we use the learned embeddings entirely.
        if x is None:
            out = self.node_embeddings.weight
        else:
            # Example: Summation of learned embeddings and provided features
            emb = self.node_embeddings.weight
            out = emb + x

        # Pass through R-GCN layers
        for conv in self.convs:
            out = conv(out, edge_index, edge_type)
            out = F.relu(out)

        return out

    def predict(self, node_embeddings, edge_index):
        # node_embeddings: [num_nodes, out_channels]
        source = node_embeddings[edge_index[0]]
        target = node_embeddings[edge_index[1]]
        # Concat node embeddings and feed to scoring
        score = self.scoring(torch.cat([source, target], dim=-1))
        return torch.sigmoid(score)


## 4. Training and Optimization

In link prediction, we often optimize by contrasting positive edges (from the graph) with negative edges (sampled randomly or via more advanced strategies). Here, we:

1. Generate negative samples.
2. Compute link likelihood for both positive and negative edges.
3. Use a binary cross-entropy style loss (via manual log-likelihoods) to distinguish between them.
4. Use an optimizer (e.g., Adam) to train the model.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

num_nodes = data.num_nodes
num_relations = int(torch.max(data.edge_type)) + 1
in_channels = 8  # hidden dimension for the embedding
out_channels = 8

model = RGCNLinkPredictor(
    num_nodes=num_nodes,
    in_channels=in_channels,
    out_channels=out_channels,
    num_relations=num_relations,
    num_layers=2
).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
data = data.to(device)

# Move splits to device
train_edge_index = train_edge_index.to(device)
train_edge_type = train_edge_type.to(device)
val_edge_index = val_edge_index.to(device)
val_edge_type = val_edge_type.to(device)
test_edge_index = test_edge_index.to(device)
test_edge_type = test_edge_type.to(device)

def negative_sampling(num_neg_samples, num_nodes):
    # Random negative edges
    i = torch.randint(0, num_nodes, (num_neg_samples,), device=device)
    j = torch.randint(0, num_nodes, (num_neg_samples,), device=device)
    return torch.stack([i, j], dim=0)

def train_one_epoch():
    model.train()
    optimizer.zero_grad()

    # Forward pass
    node_embeddings = model(data.x, train_edge_index, train_edge_type)

    # Positive scores
    pos_score = model.predict(node_embeddings, train_edge_index)

    # Negative scores
    neg_edge_index = negative_sampling(train_edge_index.size(1), data.num_nodes)
    neg_score = model.predict(node_embeddings, neg_edge_index)

    # Compute loss
    loss_pos = -torch.log(pos_score + 1e-15).mean()
    loss_neg = -torch.log(1 - neg_score + 1e-15).mean()
    loss = loss_pos + loss_neg

    loss.backward()
    optimizer.step()

    return loss.item()

# Train for a few epochs
for epoch in range(1, 21):
    loss = train_one_epoch()
    if epoch % 5 == 0:
        print(f"Epoch {epoch} | Loss: {loss:.4f}")

## 5. Evaluation Metrics and Scoring

**Precision**, **Recall**, and **Cohen’s Kappa** can provide valuable insights:
- **Precision**: Useful when false positives are costly (i.e., predicting edges that do not actually exist).
- **Recall**: Critical for ensuring that as many *true* links are recovered as possible.
- **Cohen’s Kappa**: Offers insight beyond raw accuracy by considering chance agreement, which helps with imbalanced datasets.

Below is an example of how to compute these metrics. We do negative sampling again for the test or validation edges, get scores, and then threshold them to produce predictions. In practice, you might tune this threshold or rely on other decision rules.

In [None]:
def evaluate(edge_index, edge_type, threshold=0.5):
    model.eval()
    with torch.no_grad():
        # Obtain node embeddings from the entire graph perspective
        node_embeddings = model(data.x, edge_index, edge_type)

        # Positive scores
        pos_score = model.predict(node_embeddings, edge_index)

        # Negative scores (equal number to positive edges)
        neg_edge_index = negative_sampling(edge_index.size(1), data.num_nodes)
        neg_score = model.predict(node_embeddings, neg_edge_index)

    # Construct label vectors
    y_true = torch.cat([
        torch.ones(pos_score.size(0), device=device),
        torch.zeros(neg_score.size(0), device=device)
    ], dim=0)

    y_scores = torch.cat([pos_score, neg_score], dim=0)
    y_pred = (y_scores >= threshold).float()

    # Convert tensors to numpy for sklearn compatibility
    y_true_np = y_true.cpu().numpy()
    y_pred_np = y_pred.cpu().numpy()

    precision = precision_score(y_true_np, y_pred_np)
    recall = recall_score(y_true_np, y_pred_np)
    kappa = cohen_kappa_score(y_true_np, y_pred_np)

    return {
        'precision': precision,
        'recall': recall,
        'kappa': kappa
    }


In [None]:
metrics_val = evaluate(val_edge_index, val_edge_type, threshold=0.5)
print("Validation Metrics:", metrics_val)

metrics_test = evaluate(test_edge_index, test_edge_type, threshold=0.5)
print("Test Metrics:", metrics_test)

## 6. Visualization and Interpretability

Visualizing or interpreting an R-GCN can be done in various ways:
- **Node Embedding Visualization**: Project the learned embeddings (e.g., via PCA or t-SNE) into 2D.
- **Edge Score Heatmaps**: If the graph is small, you can illustrate predicted link probabilities.


In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import seaborn as sns
import numpy as np

In [None]:
model.eval()
with torch.no_grad():
    # Get final embeddings using the entire graph
    embeddings = model(data.x, data.edge_index, data.edge_type).cpu()

# Reduce to 2D with PCA
pca = PCA(n_components=2)
embed_2d = pca.fit_transform(embeddings)

plt.figure()
plt.scatter(embed_2d[:, 0], embed_2d[:, 1])
for i in range(embed_2d.shape[0]):
    plt.annotate(str(i), (embed_2d[i, 0], embed_2d[i, 1]))
plt.title("2D PCA Projection of Node Embeddings")
plt.show()

In [None]:
model.eval()
with torch.no_grad():
    embeddings = model(data.x, data.edge_index, data.edge_type)
    scores = model.predict(embeddings, data.edge_index).cpu().numpy()

# Convert edge_index to numpy for indexing
edge_index_np = data.edge_index.cpu().numpy()
num_nodes = data.num_nodes

# Create empty adjacency matrix
adj_matrix = np.zeros((num_nodes, num_nodes))

# Fill with scores
for idx in range(edge_index_np.shape[1]):
    src = edge_index_np[0, idx]
    dst = edge_index_np[1, idx]
    adj_matrix[src, dst] = scores[idx]

# Plot the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(adj_matrix, cmap='viridis', square=True, cbar_kws={'label': 'Edge Score'})
plt.title("Predicted Edge Score Heatmap")
plt.xlabel("Target Node")
plt.ylabel("Source Node")
plt.show()

## 10. Interactive Elements

For more advanced experimentation, you can use this widget to:
- Tune hyperparameters (e.g., learning rate, hidden size) via sliders.
- Dynamically update plots of training metrics in real time.

Below is a small example showing how you could integrate a widget for toggling the threshold during evaluation. *Make sure you have `ipywidgets` installed (e.g., `pip install ipywidgets`).*


In [None]:
from ipywidgets import interact, FloatSlider

@interact(threshold=FloatSlider(min=0.0, max=1.0, step=0.05, value=0.5))
def interactive_evaluation(threshold):
    """Allows interactive threshold selection for link prediction."""
    metrics = evaluate(test_edge_index, test_edge_type, threshold=threshold)
    print(f"Threshold: {threshold:.2f}")
    print(f"Precision: {metrics['precision']:.4f}")
    print(f"Recall:    {metrics['recall']:.4f}")
    print(f"Kappa:     {metrics['kappa']:.4f}")

Experiment to see how **precision**, **recall**, and **kappa** vary as the decision threshold for link existence changes.

## Conclusion

This merged notebook demonstrated:
- Loading/preparing a relational graph.
- Building an R-GCN for link prediction in PyTorch Geometric.
- Training and optimization with negative sampling.
- Evaluating using multiple metrics (Precision, Recall, Cohen’s Kappa).
- Simple visualization of node embeddings.
- Interactive threshold tuning with ipywidgets.
