# Dataset creation for link prediction

In this notebook, we create the dataset required for our link prediction task. Note, however, that this dataset is only used for our GNN applications. 
For the baselines, we only load the 5821 links of our graph and use them to create a networkx or networkit graph that we can use to make all the predictions that we want.

In [1]:
import numpy as np
import pandas as pd

import pickle

import json
import os.path as osp

import torch
import torch_geometric
from torch_geometric.data import Data
from torch_geometric.data import Dataset, download_url
from torch_geometric.transforms import NormalizeFeatures, RandomLinkSplit

## Define helper function

In [2]:
def reverse(tuples):
    """
    Reverse a 2-tuple.
    """
    new_tup = tuples[::-1]
    
    return new_tup

## Link prediction

Link prediction is about predicting links between pairs of nodes. In our task, we focus on the methods summarized in <span style="font-variant:small-caps;">Wu et al. (2022)</span> who heavily based their work on link prediction on <span style="font-variant:small-caps;">Zhang and Chen (2018)</span> and <span style="font-variant:small-caps;">Kipf and Welling (2016)</span>.

In our dataset, we use all 5821 links as well as all 68 available node features.

## PyTorch Geometric functionality for dataset creation

We use the built-in dataset creation provided in PyTorch Geometric. For more information on dataset creation in PyTorch Geometric, click [here](https://pytorch-geometric.readthedocs.io/en/latest/notes/create_dataset.html).

PyTorch Geometric provides two abstract classes for datasets: [`torch_geometric.data.Dataset`](https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html#torch_geometric.data.Dataset) and [`torch_geometric.data.InMemoryDataset`](https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html#torch_geometric.data.InMemoryDataset). Note that `torch_geometric.data.InMemoryDataset` inherits from `torch_geometric.data.Dataset` and should be used if the whole dataset fits into CPU memory.

Each dataset is passed a root folder, indicating where the datasets should be stored. This root folder is split up into two folders: the `raw_dir`, where the dataset is downloaded to, and the `processed_dir`, where the processed dataset is saved.

Additionally, each dataset can be passed a `transform`, a `pre_transform` and a `pre_filter` function, all being `None` by default. The `transform` function dynamically transforms the data object before accessing, therefore being particularly useful for data augmentation. The `pre_transform` function applies the transformation before saving the data objects to disk, therefore being particularly useful for heavy precomputation which needs to be done only once. The `pre_filter` function can manually filter out data objects before saving, therefore being particularly useful for use cases which involve the restriction of data objects being of a specific class.

To create a `torch_geometric.data.InMemoryDataset`, we need to implement four fundamental methods:
* `torch_geometric.data.Dataset.raw_file_names()`: A list of files in the `raw_dir` which needs to be found in order to skip the download.
* `torch_geometric.data.Dataset.processed_file_names()`: A list of files in the `processed_dir` which needs to be found in order to skip the processing.
* `torch_geometric.data.Dataset.download()`: Downloads raw data into `raw_dir`.
* `torch_geometric.data.Dataset.process()`: Processes raw data and saves it into `processed_dir`.

To create a `torch_geometric.data.Dataset`, we also need to implement the above four methods. Additionally, we need to implement the following two methods: 
* `torch_geometric.data.Dataset.len()`: Returns the number of examples in the dataset.
* `torch_geometric.data.Dataset.get()`: Implements the logic to load a single path.

## Our dataset

In order not to have to worry about CPU memory issues, we decide to use the `torch_geometric.data.Dataset` class. 

First, the class is initialized with the `__init__()` method. We do not pass a `transform`, `pre_transform` or `pre_filter` function, since data augmentation, heavy computation and the restriction of data objects to a specific class are not relevant in our setting.

Afterwards, the four fundamental methods `raw_file_names()`, `processed_file_names()`, `download()` and `process()` are implemented, of which the `process()` method is the most extensive. It loads the three data sources of interest: the data frame containing the node features, the data frame containing the edge features and the list containing all the links. Then, these are processed, yielding the node features, edge features, all the links and the node labels. For the `process()` method to perform its work, four other methods have been implemented: `_get_node_features()`, `_get_edge_features()`, `_get_adjacency_info()` and `_get_labels()`. They are mainly used to transform the node features, edge features, all links and node labels from the data frames to a torch tensor format that PyTorch Geometric can handle. For more information on data handling of graphs in PyTorch Geometric, click [here](https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html#data-handling-of-graphs).

Lastly, the two methods `len()` and `get()` are implemented.

We call our dataset `LinkPredictionDataset`.

In [3]:
class LinkPredictionDataset(Dataset):
    def __init__(self, root, transform=None, pre_transform=None, pre_filter=None):
        super(LinkPredictionDataset, self).__init__(root, transform, pre_transform, pre_filter)

    @property
    def raw_file_names(self):
        return ['node_features.pkl', 'edge_features.pkl', 'all_links.txt']

    @property
    def processed_file_names(self):
        return 'not_implemented.pt'

    def download(self):
        pass

    def process(self):
        idx = 0
        #for raw_path in self.raw_paths:
        with open(self.raw_paths[0], 'rb') as fh:
            node_features = pickle.load(fh)
        with open(self.raw_paths[1], 'rb') as fh:
            edge_features = pickle.load(fh)
        with open(self.raw_paths[2], 'r') as f:
            all_links = json.loads(f.read())
        
        # Get node features
        node_feats = self._get_node_features(node_features)
        # Get edge features
        edge_feats = self._get_edge_features(edge_features)
        # Get adjacency info
        edge_index = self._get_adjacency_info(all_links)
        # Get labels info
        # not required for LinkPrediction
            
        data = Data(x = node_feats, 
                    edge_index = edge_index, 
                    edge_attr = edge_feats#,
                     #y = label
                    )

        torch.save(data, osp.join(self.processed_dir, f'data_{idx}.pt'))
        idx += 1
    
    def _get_node_features(self, node_features):
        return torch.tensor(node_features.values, dtype=torch.float)
        
    def _get_edge_features(self, edge_features):
        edge_features = edge_features.filter(items=['paper_link', 'journal_link', 'hospital_link'])
        edge_features = pd.concat([edge_features, edge_features], ignore_index=True)
        return torch.tensor(edge_features.values, dtype=torch.float)
    
    def _get_adjacency_info(self, all_links):
        # double links:
        for i in range(len(all_links)):
            all_links += [reverse(all_links[i])]
        return torch.tensor(all_links, dtype=torch.long).t().contiguous()

    def len(self):
        return len(self.processed_file_names)

    def get(self, idx):
        data = torch.load(osp.join(self.processed_dir, f'data_{idx}.pt'))
        return data

For our GNN applications, we need to split the edges into positive and negative training, validation and test edges.

We use the [`RandomNodeSplit`](https://pytorch-geometric.readthedocs.io/en/latest/modules/transforms.html#torch_geometric.transforms.RandomNodeSplit) class to do this, which can be called when loading our `LinkPredictionDataset`. We need to specify the arguments `num_val` and `num_test`, which specify the share of validation and test edges in all edges. In addition, we also specify with setting `is_undirected` to True that our graph is undirected. By setting `add_negative_train_samples` to True, we add negative training samples for link prediction. When the argument `split_labels` is set to True - as in our case, then positive and negative labels are saved in distinct attributes. 

We aim to have a ratio of 0.1 of edges in the validation set and a ratio of 0.2 of edges in the test set. The majority of edges therefore remain in the training set.

We now look at an exemplary split.

In [4]:
torch_geometric.seed_everything(12345) 

dataset = LinkPredictionDataset(root='data/', transform=RandomLinkSplit(num_val=0.1, num_test=0.3, is_undirected=True, split_labels=True, add_negative_train_samples=True))

data = dataset[0]

Processing...
Done!


In [5]:
dataset

LinkPredictionDataset(18)

In [6]:
data

(Data(x=[229, 68], edge_index=[2, 6986], edge_attr=[6986, 3], pos_edge_label=[3493], pos_edge_label_index=[2, 3493], neg_edge_label=[3493], neg_edge_label_index=[2, 3493]),
 Data(x=[229, 68], edge_index=[2, 6986], edge_attr=[6986, 3], pos_edge_label=[582], pos_edge_label_index=[2, 582], neg_edge_label=[582], neg_edge_label_index=[2, 582]),
 Data(x=[229, 68], edge_index=[2, 8150], edge_attr=[8150, 3], pos_edge_label=[1746], pos_edge_label_index=[2, 1746], neg_edge_label=[1746], neg_edge_label_index=[2, 1746]))

We can see that we have a dataset, where the node feature matrix `x` is of shape `[229, 68]`, which indicates that we have 229 nodes, each having 68 features. 

The positive and negative training edges are given by the edges contained in `pos_edge_label_index` and `neg_edge_label_index` with shape `[2, 3493]`, respectively. 
Similarly, the positive and negative validation edges are contained in `pos_edge_label_index` and `neg_edge_label_index` with shape `[2, 582]`.
Finally, `pos_edge_label_index` and `neg_edge_label_index` with shape `[2, 1746]` give the positive and negative test edges.

Note that `pos_edge_label` is a tensor containing only ones, indicating that the edges in `pos_edge_label_index` exist, while `neg_edge_label` is a tensor with only zeroes, showing that the edges `negative_edge_label_index` contains do not exist.

We now see what each part of the dataset looks like for the training set.

In [7]:
data[0].x

tensor([[0.2000, 0.0976, 0.0702,  ..., 0.0000, 1.0000, 0.0000],
        [0.4000, 0.2439, 0.2073,  ..., 0.0000, 1.0000, 0.0000],
        [1.0000, 1.0000, 1.0000,  ..., 1.0000, 0.0000, 0.0000],
        ...,
        [0.2000, 0.0244, 0.0616,  ..., 0.0000, 1.0000, 0.0000],
        [0.2000, 0.0976, 0.1322,  ..., 0.0000, 1.0000, 0.0000],
        [0.2000, 0.1220, 0.1064,  ..., 0.0000, 1.0000, 0.0000]])

In [8]:
data[0].edge_index

tensor([[ 55, 189,  42,  ..., 122,  97, 109],
        [111, 210, 137,  ...,  61,  59, 100]])

In [9]:
data[0].edge_attr

tensor([[1., 0., 0.],
        [0., 0., 1.],
        [0., 1., 0.],
        ...,
        [0., 1., 0.],
        [0., 1., 0.],
        [0., 1., 0.]])

In [10]:
data[0].pos_edge_label

tensor([1., 1., 1.,  ..., 1., 1., 1.])

In [11]:
data[0].pos_edge_label_index

tensor([[ 55, 189,  42,  ...,  61,  59, 100],
        [111, 210, 137,  ..., 122,  97, 109]])

In [12]:
data[0].neg_edge_label

tensor([0., 0., 0.,  ..., 0., 0., 0.])

In [13]:
data[0].neg_edge_label_index

tensor([[180, 160,  17,  ..., 103, 150,  46],
        [203, 176, 225,  ...,  95,  76,  51]])

As mentioned above, we use the `LinkPredictionDataset` only for our GNN applications of link prediction. We use different splits into training set, validation set and test set based on the respective seed we use. Of course, the sizes of the training set, validation set and test set always remain unchanged. 

Our GNN applications are:
* graph autoencoder, implemented in `gae_model_gcn_encoder.ipynb`,
* variational graph autoencoder, implemented in `vgae_model_vgcn_encoder.ipynb`,
* SEAL, implemented in `seal_model.ipynb`.

The results are analyzed and visualized in `visualization_results.ipynb` - together with the baseline results.