# Dataset creation for semi-supervised node classification

In this notebook, we create the dataset required for our semi-supervised node classification task. We use this dataset for all semi-supervised node classification models, i.e., for baselines and GNNs alike.

In [1]:
import numpy as np
import pandas as pd

import pickle

import json
import os.path as osp

import torch
import torch_geometric
from torch_geometric.data import Data
from torch_geometric.data import Dataset, download_url
from torch_geometric.transforms import NormalizeFeatures, RandomNodeSplit

#pd.set_option('display.max_rows', None)
#pd.set_option('display.max_columns', None)

## Define helper function

In [2]:
def reverse(tuples):
    """
    Reverse a 2-tuple.
    """
    new_tup = tuples[::-1]
    
    return new_tup

## Semi-supervised node classification

Semi-supervised node classification is about classifying nodes in a graph. The term *semi-supervised* (<span style="font-variant:small-caps;">Yang et al. (2016)</span>) is used due to the atypical nature of node classification. Because when training node classification models, we can usually access the full graph, including all the unlabeled nodes. We are only missing the labels of the test nodes. However, we can still use information about the test nodes, e.g., knowledge of their neighborhood in the graph, to improve the model during training. This is a significant deviation from the standard supervised setting, where unlabeled datapoints are completely unobserved during training (<span style="font-variant:small-caps;">Hamilton (2020)</span>).

In our task, we follow the set-up of the paper *Semi-supervised classification with graph convolutional networks* (<span style="font-variant:small-caps;">Kipf and Welling (2017)</span>), where labels are only available for a small subset of nodes. 

We choose the variable `segment` for node classification, which has the four classes `S1`, `S2`, `S3`, `S4`, where each class features sufficiently often, so that class imbalance is no problem.

In [3]:
# Read in data frame
targets = pd.read_pickle("../1_data_processing/processed_data/targets.pkl")
# Print absolute frequencies
print(f"Absolute frequencies for 'segment':\n{targets.segment.value_counts()}")

Absolute frequencies for 'segment':
S3    93
S4    66
S2    54
S1    16
Name: segment, dtype: int64


## PyTorch Geometric functionality for dataset creation

We use the built-in dataset creation provided in PyTorch Geometric. For more information on dataset creation in PyTorch Geometric, click [here](https://pytorch-geometric.readthedocs.io/en/latest/notes/create_dataset.html).

PyTorch Geometric provides two abstract classes for datasets: [`torch_geometric.data.Dataset`](https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html#torch_geometric.data.Dataset) and [`torch_geometric.data.InMemoryDataset`](https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html#torch_geometric.data.InMemoryDataset). Note that `torch_geometric.data.InMemoryDataset` inherits from `torch_geometric.data.Dataset` and should be used if the whole dataset fits into CPU memory.

Each dataset is passed a root folder, indicating where the datasets should be stored. This root folder is split up into two folders: the `raw_dir`, where the dataset is downloaded to, and the `processed_dir`, where the processed dataset is saved.

Additionally, each dataset can be passed a `transform`, a `pre_transform` and a `pre_filter` function, all being `None` by default. The `transform` function dynamically transforms the data object before accessing, therefore being particularly useful for data augmentation. The `pre_transform` function applies the transformation before saving the data objects to disk, therefore being particularly useful for heavy precomputation which needs to be done only once. The `pre_filter` function can manually filter out data objects before saving, therefore being particularly useful for use cases which involve the restriction of data objects being of a specific class.

To create a `torch_geometric.data.InMemoryDataset`, we need to implement four fundamental methods:
* `torch_geometric.data.Dataset.raw_file_names()`: A list of files in the `raw_dir` which needs to be found in order to skip the download.
* `torch_geometric.data.Dataset.processed_file_names()`: A list of files in the `processed_dir` which needs to be found in order to skip the processing.
* `torch_geometric.data.Dataset.download()`: Downloads raw data into `raw_dir`.
* `torch_geometric.data.Dataset.process()`: Processes raw data and saves it into `processed_dir`.

To create a `torch_geometric.data.Dataset`, we also need to implement the above four methods. Additionally, we need to implement the following two methods: 
* `torch_geometric.data.Dataset.len()`: Returns the number of examples in the dataset.
* `torch_geometric.data.Dataset.get()`: Implements the logic to load a single path.

## Our dataset

In order not to have to worry about CPU memory issues, we decide to use the `torch_geometric.data.Dataset` class. 

First, the class is initialized with the `__init__()` method. We do not pass a `transform`, `pre_transform` or `pre_filter` function, since data augmentation, heavy computation and the restriction of data objects to a specific class are not relevant in our setting.

Afterwards, the four fundamental methods `raw_file_names()`, `processed_file_names()`, `download()` and `process()` are implemented, of which the `process()` method is the most extensive. It loads the three data sources of interest: the data frame containing the node features, the data frame containing the edge features and the list containing all the links. Then, these are processed, yielding the node features, edge features, all the links and the node labels. For the `process()` method to perform its work, four other methods have been implemented: `_get_node_features()`, `_get_edge_features()`, `_get_adjacency_info()` and `_get_labels()`. They are mainly used to transform the node features, edge features, all links and node labels from the data frames to a torch tensor format that PyTorch Geometric can handle. For more information on data handling of graphs in PyTorch Geometric, click [here](https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html#data-handling-of-graphs).

Lastly, the two methods `len()` and `get()` are implemented.

We call our dataset `NodeClassificationDataset`.

In [4]:
class NodeClassificationDataset(Dataset):
    def __init__(self, root, transform=None, pre_transform=None, pre_filter=None):
        super(NodeClassificationDataset, self).__init__(root, transform, pre_transform, pre_filter)

    @property
    def raw_file_names(self):
        return ['node_features.pkl', 'edge_features.pkl', 'all_links.txt']

    @property
    def processed_file_names(self):
        return 'not_implemented.pt'

    def download(self):
        pass

    def process(self):
        idx = 0
        
        with open(self.raw_paths[0], 'rb') as fh:
            node_features = pickle.load(fh)
        with open(self.raw_paths[1], 'rb') as fh:
            edge_features = pickle.load(fh)
        with open(self.raw_paths[2], 'r') as f:
            all_links = json.loads(f.read())
        
        # Get node features
        node_feats = self._get_node_features(node_features)
        # Get edge features
        edge_feats = self._get_edge_features(edge_features)
        # Get adjacency info
        edge_index = self._get_adjacency_info(all_links)
        # Get labels info
        label = self._get_labels(node_features)
            
        data = Data(x = node_feats, 
                    edge_index = edge_index, 
                    #edge_attr = edge_feats,
                    y = label
                    )

        torch.save(data, osp.join(self.processed_dir, f'data_{idx}.pt'))
        idx += 1
        
        self.num_classes = len(label.unique()) 
        
    def _get_node_features(self, node_features):
        node_features = node_features.drop(columns=['S1', 'S2', 'S3', 'S4'])
        return torch.tensor(node_features.values, dtype=torch.float32)
        
    def _get_edge_features(self, edge_features):
        edge_features = edge_features.filter(items=['paper_link', 'journal_link', 'hospital_link'])
        edge_features = pd.concat([edge_features, edge_features], ignore_index=True)
        return torch.tensor(edge_features.values, dtype=torch.float)
    
    def _get_adjacency_info(self, all_links):
        # double links:
        for i in range(len(all_links)):
            all_links += [reverse(all_links[i])]
        return torch.tensor(all_links, dtype=torch.int64).t().contiguous()
    
    def _get_labels(self, node_features):
        label = node_features.filter(items=['S1', 'S2', 'S3', 'S4'])
        return torch.tensor(label.values, dtype=torch.int64).argmax(-1)

    def len(self):
        return len(self.processed_file_names)

    def get(self, idx):
        data = torch.load(osp.join(self.processed_dir, f'data_{idx}.pt'))
        return data

For our applications, be it baselines or GNNs, we need to split the nodes into training set, validation set and test set.

We use the [`RandomNodeSplit`](https://pytorch-geometric.readthedocs.io/en/latest/modules/transforms.html#torch_geometric.transforms.RandomNodeSplit) class to do this, which can be called when loading our `NodeClassificationDataset`. We need to specify the argument `split`, where `"random"` specifies that training, validation and test sets are randomly generated, according to `num_train_per_class`, `num_val` and `num_test` (as in the *Semi-supervised classification with graph convolutional networks* paper that we stick to). Here, `num_train_per_class` is the number of training samples per class, `num_val` is the number of validation samples and `num_test` is the number of test samples.

As we follow the set-up of the *Semi-supervised classification with graph convolutional networks* paper, we would like to have only a small number of labeled nodes. In the datasets used in that paper, this corresponds to label rates of at most 5.2%. However, these datasets also have a very large number of nodes, the smallest being 2,708 (in the case of a label rate of 5.2%). We have a total of 229 nodes in our graph, which is considerably smaller. In order to have a sufficiently high number of nodes in the training set to enable our models to be able to generalize to unlabeled nodes, we choose to have 40 nodes in the training set, 10 of which come for each class, thereby avoiding class imbalance. For the remaining 189 nodes, we aim to have a rough 1:2 split between validation and test set. We therefore opt for 60 nodes in the validation set and for the remaining 129 nodes in the test set. Note that the standard benchmark dataset for semi-supervised node classification is "Cora", which can also be found in PyTorch Geometric as part of the [`Planetoid`](https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html#torch_geometric.datasets.Planetoid) dataset. It has a training set of size 140, a validation set of size 500 and a test set of size 1000. We are guided by the 1:2 split between validation and test set.

We now look at an exemplary split.

In [5]:
torch_geometric.seed_everything(12345) 

# random as in paper by Kipf and Welling
dataset = NodeClassificationDataset(root='data/', transform=RandomNodeSplit(split="random", num_train_per_class = 10, num_val = 60, num_test = 129))

data = dataset[0]

Processing...
Done!


In [6]:
dataset

NodeClassificationDataset(18)

In [7]:
dataset[0]

Data(x=[229, 64], edge_index=[2, 11642], y=[229], train_mask=[229], val_mask=[229], test_mask=[229])

We can see that we have a dataset, where the node feature matrix `x` is of shape `[229, 64]`, which indicates that we have 229 nodes, each having 64 features. 

The graph connectivity in `edge_index` is given in COO format with shape `[2, 11642]`, so that we have a total of 11642 links. Note that if there is an undirected link between node 1 and node 2, PyTorch Geometric requires two directed links between the two nodes: a link from node 1 to node 2 and a link from node 2 to node 1. This means that we actually only have half as many links, i.e., a total of 5821 nodes.

The node labels are contained in `y` with shape `[229, 1]`, showing that each of our 229 nodes has exactly 1 label.

The three masks `train_mask`, `val_mask`, `test_mask` denote which nodes to use in the training, validation and test set. Their shape is `[229, 1]`, where `True` in the 4th position of `train_mask` indicates that the 4th node is used for training and `False` in the 227th position of `test_mask` indicates that the 227th node is not used for testing.

We now see what each part of the dataset looks like.

In [8]:
data.x

tensor([[0.2000, 0.0976, 0.0702,  ..., 0.0000, 1.0000, 0.0000],
        [0.4000, 0.2439, 0.2073,  ..., 0.0000, 1.0000, 0.0000],
        [1.0000, 1.0000, 1.0000,  ..., 1.0000, 0.0000, 0.0000],
        ...,
        [0.2000, 0.0244, 0.0616,  ..., 1.0000, 0.0000, 0.0000],
        [0.2000, 0.0976, 0.1322,  ..., 1.0000, 0.0000, 0.0000],
        [0.2000, 0.1220, 0.1064,  ..., 0.0000, 0.0000, 0.0000]])

In [9]:
data.edge_index

tensor([[  1, 128,   1,  ..., 172, 167, 224],
        [ 53,  53,   8,  ...,  81,  95, 223]])

In [10]:
data.y

tensor([2, 2, 1, 1, 1, 1, 1, 3, 2, 3, 3, 2, 1, 0, 1, 3, 0, 0, 3, 0, 2, 3, 3, 1,
        1, 1, 1, 2, 1, 3, 3, 2, 2, 2, 3, 3, 3, 2, 2, 2, 3, 2, 2, 1, 1, 2, 2, 2,
        1, 1, 3, 1, 3, 0, 2, 2, 0, 1, 1, 0, 0, 1, 1, 2, 2, 2, 3, 1, 3, 3, 2, 1,
        3, 2, 1, 2, 2, 1, 2, 1, 0, 3, 3, 2, 2, 1, 3, 3, 2, 2, 2, 2, 2, 2, 2, 1,
        2, 3, 2, 1, 0, 2, 3, 3, 2, 2, 2, 1, 2, 1, 2, 1, 2, 0, 2, 2, 1, 0, 1, 2,
        2, 2, 1, 2, 2, 2, 3, 0, 2, 0, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2,
        1, 1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2, 2, 2, 2,
        3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
        3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 2, 1, 1, 2, 0, 3, 2, 1, 3, 3, 0, 1,
        1, 2, 2, 2, 1, 1, 3, 2, 2, 2, 2, 2, 2])

In [11]:
data.train_mask

tensor([False, False, False, False, False, False, False,  True,  True, False,
        False, False, False, False,  True, False, False,  True, False,  True,
         True, False, False, False,  True, False, False,  True, False,  True,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False,  True, False, False,  True,
        False, False, False,  True, False, False,  True, False, False,  True,
         True, False,  True, False, False, False, False, False, False, False,
        False,  True, False, False, False, False, False, False, False, False,
        False,  True, False, False, False,  True, False, False, False, False,
        False, False, False,  True, False, False, False, False, False,  True,
         True, False,  True, False, False, False,  True, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False,  True, 

In [12]:
data.val_mask

tensor([False, False, False, False,  True, False, False, False, False, False,
         True,  True, False, False, False,  True, False, False,  True, False,
        False, False, False, False, False,  True,  True, False, False, False,
        False,  True, False,  True, False,  True,  True, False, False, False,
        False, False, False, False, False, False, False, False,  True, False,
         True, False, False, False,  True, False, False,  True, False, False,
        False, False, False, False,  True,  True,  True, False, False,  True,
         True, False, False, False, False, False, False,  True, False, False,
        False, False, False, False, False, False,  True, False, False, False,
        False, False, False, False, False, False, False, False,  True, False,
        False, False, False, False, False,  True, False, False,  True, False,
        False,  True,  True, False, False, False, False, False, False, False,
         True,  True, False, False, False, False, False, False, 

In [13]:
data.test_mask

tensor([ True,  True,  True,  True, False,  True,  True, False, False,  True,
        False, False,  True,  True, False, False,  True, False, False, False,
        False,  True,  True,  True, False, False, False, False,  True, False,
         True, False,  True, False,  True, False, False,  True,  True,  True,
         True,  True,  True,  True,  True,  True, False,  True, False, False,
        False,  True,  True, False, False,  True, False, False,  True, False,
        False,  True, False,  True, False, False, False,  True,  True, False,
        False, False,  True,  True,  True,  True,  True, False,  True,  True,
         True, False,  True,  True,  True, False, False,  True,  True,  True,
         True,  True,  True, False,  True,  True,  True,  True, False, False,
        False,  True, False,  True,  True, False, False,  True, False,  True,
         True, False, False,  True,  True,  True,  True,  True,  True,  True,
        False, False,  True,  True,  True,  True,  True, False, 

Of course, we can also find out which labels are in the training, validation and test set.

In [14]:
data.y[data.train_mask]

tensor([3, 2, 1, 0, 0, 2, 1, 2, 3, 2, 1, 0, 0, 0, 0, 1, 1, 3, 1, 2, 1, 0, 3, 2,
        0, 0, 2, 2, 1, 2, 3, 3, 3, 3, 3, 1, 2, 1, 0, 3])

In [15]:
data.y[data.val_mask]

tensor([1, 3, 2, 3, 3, 1, 1, 2, 2, 3, 3, 1, 3, 2, 1, 2, 2, 3, 3, 2, 1, 3, 2, 2,
        2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3,
        3, 3, 3, 3, 2, 3, 3, 1, 2, 2, 2, 2])

In [16]:
data.y[data.test_mask]

tensor([2, 2, 1, 1, 1, 1, 3, 1, 0, 0, 3, 3, 1, 1, 3, 2, 3, 2, 2, 2, 3, 2, 2, 1,
        1, 2, 2, 1, 3, 2, 1, 1, 2, 1, 3, 3, 2, 1, 2, 2, 2, 1, 0, 3, 2, 2, 3, 2,
        2, 2, 2, 2, 2, 1, 2, 3, 2, 3, 2, 1, 1, 2, 0, 2, 2, 1, 0, 1, 2, 1, 2, 2,
        2, 3, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 1, 1, 1, 2, 1, 1, 2,
        2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 2, 0,
        3, 1, 1, 2, 2, 2, 1, 2, 2])

In [17]:
print(f"Length of training set: {len(data.y[data.train_mask])}")
print(f"Length of validation set: {len(data.y[data.val_mask])}")
print(f"Length of test set: {len(data.y[data.test_mask])}")

Length of training set: 40
Length of validation set: 60
Length of test set: 129


We use the `NodeClassificationDataset` for all of our following applications of node classification. However, we use different splits into training set, validation set and test set based on the respective seed we use. Of course, the size of the training set, validation set and test set always remain unchanged. 

Our applications are:

* logistic regression, implemented in `logistic_regression_baseline.ipynb`,
* random forest classifier, implemented in `random_forest_classifier_baseline.ipynb`,
* support vector classifier, implemented in `support_vector_classifier_baseline.ipynb`,
* multi-layer perceptron, implemented in `multi_layer_perceptron_baseline.ipynb`,
* graph convolutional network, implemented in `graph_convolutional_network.ipynb`,
* GNN with Chebyshev graph spectral convolutional operator, implemented in `gnn_with_chebyshev_convolution.ipynb`,
* graph attention network, implemented in `graph_attention_network.ipynb`.

The results are analyzed and visualized in `visualization_results.ipynb`.