# OnDiskDataset for Homogeneous Graph

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dmlc/dgl/blob/master/notebooks/stochastic_training/ondisk_dataset_homograph.ipynb) [![GitHub](https://img.shields.io/badge/-View%20on%20GitHub-181717?logo=github&logoColor=ffffff)](https://github.com/dmlc/dgl/blob/master/notebooks/stochastic_training/ondisk_dataset_homograph.ipynb)

This tutorial shows how to create `OnDiskDataset` for homogeneous graph that could be used in **GraphBolt** framework.

By the end of this tutorial, you will be able to

- organize graph structure data.
- organize feature data.
- organize training/validation/test set for specific tasks.

To create an ``OnDiskDataset`` object, you need to organize all the data including graph structure, feature data and tasks into a directory. The directory should contain a ``metadata.yaml`` file that describes the metadata of the dataset.

Now let's generate various data step by step and organize them together to instantiate `OnDiskDataset` finally.

## Install DGL package

In [1]:
# Install required packages.
import os
import torch
import numpy as np
os.environ['TORCH'] = torch.__version__
os.environ['DGLBACKEND'] = "pytorch"

# Install the CPU version.
device = torch.device("cpu")
!pip install --pre dgl -f https://data.dgl.ai/wheels-test/repo.html

try:
    import dgl
    import dgl.graphbolt as gb
    installed = True
except ImportError as error:
    installed = False
    print(error)
print("DGL installed!" if installed else "DGL not found!")

Looking in links: https://data.dgl.ai/wheels-test/repo.html








  @torch.library.impl_abstract("graphbolt::expand_indptr")


DGL installed!


## Data preparation
In order to demonstrate how to organize various data, let's create a base directory first.

In [2]:
base_dir = './ondisk_dataset_homograph'
os.makedirs(base_dir, exist_ok=True)
print(f"Created base directory: {base_dir}")

Created base directory: ./ondisk_dataset_homograph


### Generate graph structure data
For homogeneous graph, we just need to save edges(namely seeds) into  **Numpy** or **CSV** file.

Note:
- when saving to **Numpy**, the array requires to be in shape of `(2, N)`. This format is recommended as constructing graph from it is much faster than **CSV** file.
- when saving to **CSV** file, do not save index and header.


In [3]:
import numpy as np
import pandas as pd
num_nodes = 1000
num_edges = 10 * num_nodes
edges_path = os.path.join(base_dir, "edges.csv")
edges = np.random.randint(0, num_nodes, size=(num_edges, 2))

print(f"Part of edges: {edges[:5, :]}")

df = pd.DataFrame(edges)
df.to_csv(edges_path, index=False, header=False)

print(f"Edges are saved into {edges_path}")

Part of edges: [[645 874]
 [556 843]
 [ 91 996]
 [228 974]
 [116 667]]
Edges are saved into ./ondisk_dataset_homograph/edges.csv


### Generate feature data for graph
For feature data, numpy arrays and torch tensors are supported for now.

In [4]:
# Generate node feature in numpy array.
node_feat_0_path = os.path.join(base_dir, "node-feat-0.npy")
node_feat_0 = np.random.rand(num_nodes, 5)
print(f"Part of node feature [feat_0]: {node_feat_0[:3, :]}")
np.save(node_feat_0_path, node_feat_0)
print(f"Node feature [feat_0] is saved to {node_feat_0_path}\n")

# Generate another node feature in torch tensor
node_feat_1_path = os.path.join(base_dir, "node-feat-1.pt")
node_feat_1 = torch.rand(num_nodes, 5)
print(f"Part of node feature [feat_1]: {node_feat_1[:3, :]}")
torch.save(node_feat_1, node_feat_1_path)
print(f"Node feature [feat_1] is saved to {node_feat_1_path}\n")

# Generate edge feature in numpy array.
edge_feat_0_path = os.path.join(base_dir, "edge-feat-0.npy")
edge_feat_0 = np.random.rand(num_edges, 5)
print(f"Part of edge feature [feat_0]: {edge_feat_0[:3, :]}")
np.save(edge_feat_0_path, edge_feat_0)
print(f"Edge feature [feat_0] is saved to {edge_feat_0_path}\n")

# Generate another edge feature in torch tensor
edge_feat_1_path = os.path.join(base_dir, "edge-feat-1.pt")
edge_feat_1 = torch.rand(num_edges, 5)
print(f"Part of edge feature [feat_1]: {edge_feat_1[:3, :]}")
torch.save(edge_feat_1, edge_feat_1_path)
print(f"Edge feature [feat_1] is saved to {edge_feat_1_path}\n")


Part of node feature [feat_0]: [[0.18413199 0.62237674 0.35621369 0.95699183 0.90371447]
 [0.4342842  0.71157801 0.88077361 0.43984195 0.41271835]
 [0.44611919 0.65949997 0.19022741 0.85122299 0.64400144]]
Node feature [feat_0] is saved to ./ondisk_dataset_homograph/node-feat-0.npy

Part of node feature [feat_1]: tensor([[0.9071, 0.6124, 0.9781, 0.2970, 0.6244],
        [0.3461, 0.6524, 0.2934, 0.6486, 0.4537],
        [0.3431, 0.3436, 0.7397, 0.9123, 0.9990]])
Node feature [feat_1] is saved to ./ondisk_dataset_homograph/node-feat-1.pt

Part of edge feature [feat_0]: [[0.89959283 0.70484689 0.19200917 0.26794296 0.78545331]
 [0.05236168 0.74512754 0.27068731 0.32353999 0.73508868]
 [0.79725622 0.00224259 0.6553986  0.41038235 0.01512925]]
Edge feature [feat_0] is saved to ./ondisk_dataset_homograph/edge-feat-0.npy

Part of edge feature [feat_1]: tensor([[0.6208, 0.0047, 0.9682, 0.5679, 0.1674],
        [0.4977, 0.3262, 0.8332, 0.3400, 0.0043],
        [0.3815, 0.5794, 0.3996, 0.9817, 0

### Generate tasks
`OnDiskDataset` supports multiple tasks. For each task, we need to prepare training/validation/test sets respectively. Such sets usually vary among different tasks. In this tutorial, let's create a **Node Classification** task and **Link Prediction** task.

#### Node Classification Task
For node classification task, we need **node IDs** and corresponding **labels** for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets.

In [5]:
num_trains = int(num_nodes * 0.6)
num_vals = int(num_nodes * 0.2)
num_tests = num_nodes - num_trains - num_vals

ids = np.arange(num_nodes)
np.random.shuffle(ids)

nc_train_ids_path = os.path.join(base_dir, "nc-train-ids.npy")
nc_train_ids = ids[:num_trains]
print(f"Part of train ids for node classification: {nc_train_ids[:3]}")
np.save(nc_train_ids_path, nc_train_ids)
print(f"NC train ids are saved to {nc_train_ids_path}\n")

nc_train_labels_path = os.path.join(base_dir, "nc-train-labels.pt")
nc_train_labels = torch.randint(0, 10, (num_trains,))
print(f"Part of train labels for node classification: {nc_train_labels[:3]}")
torch.save(nc_train_labels, nc_train_labels_path)
print(f"NC train labels are saved to {nc_train_labels_path}\n")

nc_val_ids_path = os.path.join(base_dir, "nc-val-ids.npy")
nc_val_ids = ids[num_trains:num_trains+num_vals]
print(f"Part of val ids for node classification: {nc_val_ids[:3]}")
np.save(nc_val_ids_path, nc_val_ids)
print(f"NC val ids are saved to {nc_val_ids_path}\n")

nc_val_labels_path = os.path.join(base_dir, "nc-val-labels.pt")
nc_val_labels = torch.randint(0, 10, (num_vals,))
print(f"Part of val labels for node classification: {nc_val_labels[:3]}")
torch.save(nc_val_labels, nc_val_labels_path)
print(f"NC val labels are saved to {nc_val_labels_path}\n")

nc_test_ids_path = os.path.join(base_dir, "nc-test-ids.npy")
nc_test_ids = ids[-num_tests:]
print(f"Part of test ids for node classification: {nc_test_ids[:3]}")
np.save(nc_test_ids_path, nc_test_ids)
print(f"NC test ids are saved to {nc_test_ids_path}\n")

nc_test_labels_path = os.path.join(base_dir, "nc-test-labels.pt")
nc_test_labels = torch.randint(0, 10, (num_tests,))
print(f"Part of test labels for node classification: {nc_test_labels[:3]}")
torch.save(nc_test_labels, nc_test_labels_path)
print(f"NC test labels are saved to {nc_test_labels_path}\n")

Part of train ids for node classification: [ 57 855 928]
NC train ids are saved to ./ondisk_dataset_homograph/nc-train-ids.npy

Part of train labels for node classification: tensor([8, 0, 9])
NC train labels are saved to ./ondisk_dataset_homograph/nc-train-labels.pt

Part of val ids for node classification: [511 435 153]
NC val ids are saved to ./ondisk_dataset_homograph/nc-val-ids.npy

Part of val labels for node classification: tensor([7, 1, 3])
NC val labels are saved to ./ondisk_dataset_homograph/nc-val-labels.pt

Part of test ids for node classification: [181 421 489]
NC test ids are saved to ./ondisk_dataset_homograph/nc-test-ids.npy

Part of test labels for node classification: tensor([0, 3, 0])
NC test labels are saved to ./ondisk_dataset_homograph/nc-test-labels.pt



#### Link Prediction Task
For link prediction task, we need **seeds** or **corresponding labels and indexes** which representing the pos/neg property and group of the seeds for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets.

In [6]:
num_trains = int(num_edges * 0.6)
num_vals = int(num_edges * 0.2)
num_tests = num_edges - num_trains - num_vals

lp_train_seeds_path = os.path.join(base_dir, "lp-train-seeds.npy")
lp_train_seeds = edges[:num_trains, :]
print(f"Part of train seeds for link prediction: {lp_train_seeds[:3]}")
np.save(lp_train_seeds_path, lp_train_seeds)
print(f"LP train seeds are saved to {lp_train_seeds_path}\n")

lp_val_seeds_path = os.path.join(base_dir, "lp-val-seeds.npy")
lp_val_seeds = edges[num_trains:num_trains+num_vals, :]
lp_val_neg_dsts = np.random.randint(0, num_nodes, (num_vals, 10)).reshape(-1)
lp_val_neg_srcs = np.repeat(lp_val_seeds[:,0], 10)
lp_val_neg_seeds = np.concatenate((lp_val_neg_srcs, lp_val_neg_dsts)).reshape(2,-1).T
lp_val_seeds = np.concatenate((lp_val_seeds, lp_val_neg_seeds))
print(f"Part of val seeds for link prediction: {lp_val_seeds[:3]}")
np.save(lp_val_seeds_path, lp_val_seeds)
print(f"LP val seeds are saved to {lp_val_seeds_path}\n")

lp_val_labels_path = os.path.join(base_dir, "lp-val-labels.npy")
lp_val_labels = np.empty(num_vals * (10 + 1))
lp_val_labels[:num_vals] = 1
lp_val_labels[num_vals:] = 0
print(f"Part of val labels for link prediction: {lp_val_labels[:3]}")
np.save(lp_val_labels_path, lp_val_labels)
print(f"LP val labels are saved to {lp_val_labels_path}\n")

lp_val_indexes_path = os.path.join(base_dir, "lp-val-indexes.npy")
lp_val_indexes = np.arange(0, num_vals)
lp_val_neg_indexes = np.repeat(lp_val_indexes, 10)
lp_val_indexes = np.concatenate([lp_val_indexes, lp_val_neg_indexes])
print(f"Part of val indexes for link prediction: {lp_val_indexes[:3]}")
np.save(lp_val_indexes_path, lp_val_indexes)
print(f"LP val indexes are saved to {lp_val_indexes_path}\n")

lp_test_seeds_path = os.path.join(base_dir, "lp-test-seeds.npy")
lp_test_seeds = edges[-num_tests:, :]
lp_test_neg_dsts = np.random.randint(0, num_nodes, (num_tests, 10)).reshape(-1)
lp_test_neg_srcs = np.repeat(lp_test_seeds[:,0], 10)
lp_test_neg_seeds = np.concatenate((lp_test_neg_srcs, lp_test_neg_dsts)).reshape(2,-1).T
lp_test_seeds = np.concatenate((lp_test_seeds, lp_test_neg_seeds))
print(f"Part of test seeds for link prediction: {lp_test_seeds[:3]}")
np.save(lp_test_seeds_path, lp_test_seeds)
print(f"LP test seeds are saved to {lp_test_seeds_path}\n")

lp_test_labels_path = os.path.join(base_dir, "lp-test-labels.npy")
lp_test_labels = np.empty(num_tests * (10 + 1))
lp_test_labels[:num_tests] = 1
lp_test_labels[num_tests:] = 0
print(f"Part of val labels for link prediction: {lp_test_labels[:3]}")
np.save(lp_test_labels_path, lp_test_labels)
print(f"LP test labels are saved to {lp_test_labels_path}\n")

lp_test_indexes_path = os.path.join(base_dir, "lp-test-indexes.npy")
lp_test_indexes = np.arange(0, num_tests)
lp_test_neg_indexes = np.repeat(lp_test_indexes, 10)
lp_test_indexes = np.concatenate([lp_test_indexes, lp_test_neg_indexes])
print(f"Part of test indexes for link prediction: {lp_test_indexes[:3]}")
np.save(lp_test_indexes_path, lp_test_indexes)
print(f"LP test indexes are saved to {lp_test_indexes_path}\n")

Part of train seeds for link prediction: [[645 874]
 [556 843]
 [ 91 996]]
LP train seeds are saved to ./ondisk_dataset_homograph/lp-train-seeds.npy

Part of val seeds for link prediction: [[570  78]
 [536 573]
 [923 351]]
LP val seeds are saved to ./ondisk_dataset_homograph/lp-val-seeds.npy

Part of val labels for link prediction: [1. 1. 1.]
LP val labels are saved to ./ondisk_dataset_homograph/lp-val-labels.npy

Part of val indexes for link prediction: [0 1 2]
LP val indexes are saved to ./ondisk_dataset_homograph/lp-val-indexes.npy

Part of test seeds for link prediction: [[ 69 972]
 [859 990]
 [512 580]]
LP test seeds are saved to ./ondisk_dataset_homograph/lp-test-seeds.npy

Part of val labels for link prediction: [1. 1. 1.]
LP test labels are saved to ./ondisk_dataset_homograph/lp-test-labels.npy

Part of test indexes for link prediction: [0 1 2]
LP test indexes are saved to ./ondisk_dataset_homograph/lp-test-indexes.npy



## Organize Data into YAML File
Now we need to create a `metadata.yaml` file which contains the paths, dadta types of graph structure, feature data, training/validation/test sets.

Notes:
- all path should be relative to `metadata.yaml`.
- Below fields are optional and not specified in below example.
  - `in_memory`: indicates whether to load dada into memory or `mmap`. Default is `True`.

Please refer to [YAML specification](https://github.com/dmlc/dgl/blob/master/docs/source/stochastic_training/ondisk-dataset-specification.rst) for more details.

In [7]:
yaml_content = f"""
    dataset_name: homogeneous_graph_nc_lp
    graph:
      nodes:
        - num: {num_nodes}
      edges:
        - format: csv
          path: {os.path.basename(edges_path)}
    feature_data:
      - domain: node
        name: feat_0
        format: numpy
        path: {os.path.basename(node_feat_0_path)}
      - domain: node
        name: feat_1
        format: torch
        path: {os.path.basename(node_feat_1_path)}
      - domain: edge
        name: feat_0
        format: numpy
        path: {os.path.basename(edge_feat_0_path)}
      - domain: edge
        name: feat_1
        format: torch
        path: {os.path.basename(edge_feat_1_path)}
    tasks:
      - name: node_classification
        num_classes: 10
        train_set:
          - data:
              - name: seeds
                format: numpy
                path: {os.path.basename(nc_train_ids_path)}
              - name: labels
                format: torch
                path: {os.path.basename(nc_train_labels_path)}
        validation_set:
          - data:
              - name: seeds
                format: numpy
                path: {os.path.basename(nc_val_ids_path)}
              - name: labels
                format: torch
                path: {os.path.basename(nc_val_labels_path)}
        test_set:
          - data:
              - name: seeds
                format: numpy
                path: {os.path.basename(nc_test_ids_path)}
              - name: labels
                format: torch
                path: {os.path.basename(nc_test_labels_path)}
      - name: link_prediction
        num_classes: 10
        train_set:
          - data:
              - name: seeds
                format: numpy
                path: {os.path.basename(lp_train_seeds_path)}
        validation_set:
          - data:
              - name: seeds
                format: numpy
                path: {os.path.basename(lp_val_seeds_path)}
              - name: labels
                format: numpy
                path: {os.path.basename(lp_val_labels_path)}
              - name: indexes
                format: numpy
                path: {os.path.basename(lp_val_indexes_path)}
        test_set:
          - data:
              - name: seeds
                format: numpy
                path: {os.path.basename(lp_test_seeds_path)}
              - name: labels
                format: numpy
                path: {os.path.basename(lp_test_labels_path)}
              - name: indexes
                format: numpy
                path: {os.path.basename(lp_test_indexes_path)}
"""
metadata_path = os.path.join(base_dir, "metadata.yaml")
with open(metadata_path, "w") as f:
  f.write(yaml_content)

## Instantiate `OnDiskDataset`
Now we're ready to load dataset via `dgl.graphbolt.OnDiskDataset`. When instantiating, we just pass in the base directory where `metadata.yaml` file lies.

During first instantiation, GraphBolt preprocesses the raw data such as constructing `FusedCSCSamplingGraph` from edges. All data including graph, feature data, training/validation/test sets are put into `preprocessed` directory after preprocessing. Any following dataset loading will skip the preprocess stage.

After preprocessing, `load()` is required to be called explicitly in order to load graph, feature data and tasks.

In [8]:
dataset = gb.OnDiskDataset(base_dir).load()
graph = dataset.graph
print(f"Loaded graph: {graph}\n")

feature = dataset.feature
print(f"Loaded feature store: {feature}\n")

tasks = dataset.tasks
nc_task = tasks[0]
print(f"Loaded node classification task: {nc_task}\n")
lp_task = tasks[1]
print(f"Loaded link prediction task: {lp_task}\n")

The on-disk dataset is re-preprocessing, so the existing preprocessed dataset has been removed.
Start to preprocess the on-disk dataset.
Finish preprocessing the on-disk dataset.
Loaded graph: FusedCSCSamplingGraph(csc_indptr=tensor([    0,    12,    22,  ...,  9977,  9989, 10000], dtype=torch.int32),
                      indices=tensor([622,  91, 498,  ..., 203, 162, 681], dtype=torch.int32),
                      total_num_nodes=1000, num_edges=10000,)

Loaded feature store: TorchBasedFeatureStore(
    {(<OnDiskFeatureDataDomain.NODE: 'node'>, None, 'feat_0'): TorchBasedFeature(
        feature=tensor([[0.1841, 0.6224, 0.3562, 0.9570, 0.9037],
                        [0.4343, 0.7116, 0.8808, 0.4398, 0.4127],
                        [0.4461, 0.6595, 0.1902, 0.8512, 0.6440],
                        ...,
                        [0.6623, 0.0730, 0.5594, 0.2819, 0.7494],
                        [0.7507, 0.8814, 0.8831, 0.4750, 0.5390],
                        [0.8160, 0.1719, 0.2640, 0.8

  return torch.load(path)
  return torch.load(graph_topology.path)
