# OnDiskDataset for Homogeneous Graph

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dmlc/dgl/blob/master/notebooks/stochastic_training/ondisk_dataset_homograph.ipynb) [![GitHub](https://img.shields.io/badge/-View%20on%20GitHub-181717?logo=github&logoColor=ffffff)](https://github.com/dmlc/dgl/blob/master/notebooks/stochastic_training/ondisk_dataset_homograph.ipynb)

This tutorial shows how to create `OnDiskDataset` for homogeneous graph that could be used in **GraphBolt** framework.

By the end of this tutorial, you will be able to

- organize graph structure data.
- organize feature data.
- organize training/validation/test set for specific tasks.

To create an ``OnDiskDataset`` object, you need to organize all the data including graph structure, feature data and tasks into a directory. The directory should contain a ``metadata.yaml`` file that describes the metadata of the dataset.

Now let's generate various data step by step and organize them together to instantiate `OnDiskDataset` finally.

## Install DGL package

In [1]:
# Install required packages.
import os
import torch
import numpy as np
os.environ['TORCH'] = torch.__version__
os.environ['DGLBACKEND'] = "pytorch"

# Install the CPU version.
device = torch.device("cpu")
!pip install --pre dgl -f https://data.dgl.ai/wheels-test/repo.html

try:
    import dgl
    import dgl.graphbolt as gb
    installed = True
except ImportError as error:
    installed = False
    print(error)
print("DGL installed!" if installed else "DGL not found!")

Looking in links: https://data.dgl.ai/wheels-test/repo.html








DGL installed!


## Data preparation
In order to demonstrate how to organize various data, let's create a base directory first.

In [2]:
base_dir = './ondisk_dataset_homograph'
os.makedirs(base_dir, exist_ok=True)
print(f"Created base directory: {base_dir}")

Created base directory: ./ondisk_dataset_homograph


### Generate graph structure data
For homogeneous graph, we just need to save edges(namely node pairs) into **CSV** file.

Note:
when saving to file, do not save index and header.


In [3]:
import numpy as np
import pandas as pd
num_nodes = 1000
num_edges = 10 * num_nodes
edges_path = os.path.join(base_dir, "edges.csv")
edges = np.random.randint(0, num_nodes, size=(num_edges, 2))

print(f"Part of edges: {edges[:5, :]}")

df = pd.DataFrame(edges)
df.to_csv(edges_path, index=False, header=False)

print(f"Edges are saved into {edges_path}")

Part of edges: [[916  53]
 [ 43 379]
 [685 876]
 [225 743]
 [919 249]]
Edges are saved into ./ondisk_dataset_homograph/edges.csv


### Generate feature data for graph
For feature data, numpy arrays and torch tensors are supported for now.

In [4]:
# Generate node feature in numpy array.
node_feat_0_path = os.path.join(base_dir, "node-feat-0.npy")
node_feat_0 = np.random.rand(num_nodes, 5)
print(f"Part of node feature [feat_0]: {node_feat_0[:3, :]}")
np.save(node_feat_0_path, node_feat_0)
print(f"Node feature [feat_0] is saved to {node_feat_0_path}\n")

# Generate another node feature in torch tensor
node_feat_1_path = os.path.join(base_dir, "node-feat-1.pt")
node_feat_1 = torch.rand(num_nodes, 5)
print(f"Part of node feature [feat_1]: {node_feat_1[:3, :]}")
torch.save(node_feat_1, node_feat_1_path)
print(f"Node feature [feat_1] is saved to {node_feat_1_path}\n")

# Generate edge feature in numpy array.
edge_feat_0_path = os.path.join(base_dir, "edge-feat-0.npy")
edge_feat_0 = np.random.rand(num_edges, 5)
print(f"Part of edge feature [feat_0]: {edge_feat_0[:3, :]}")
np.save(edge_feat_0_path, edge_feat_0)
print(f"Edge feature [feat_0] is saved to {edge_feat_0_path}\n")

# Generate another edge feature in torch tensor
edge_feat_1_path = os.path.join(base_dir, "edge-feat-1.pt")
edge_feat_1 = torch.rand(num_edges, 5)
print(f"Part of edge feature [feat_1]: {edge_feat_1[:3, :]}")
torch.save(edge_feat_1, edge_feat_1_path)
print(f"Edge feature [feat_1] is saved to {edge_feat_1_path}\n")


Part of node feature [feat_0]: [[0.07506874 0.97412832 0.14989629 0.68315846 0.72461738]
 [0.97415705 0.0221444  0.88130406 0.93071116 0.85560486]
 [0.09422079 0.76017339 0.90975754 0.0252611  0.2286788 ]]
Node feature [feat_0] is saved to ./ondisk_dataset_homograph/node-feat-0.npy

Part of node feature [feat_1]: tensor([[0.4850, 0.7368, 0.8878, 0.4458, 0.6004],
        [0.8962, 0.4395, 0.7174, 0.4299, 0.5414],
        [0.1950, 0.0906, 0.9257, 0.3184, 0.8078]])
Node feature [feat_1] is saved to ./ondisk_dataset_homograph/node-feat-1.pt

Part of edge feature [feat_0]: [[0.16029718 0.73791084 0.49150916 0.07443747 0.56202794]
 [0.59497222 0.38449722 0.44229384 0.3990971  0.55965944]
 [0.91664484 0.6356691  0.21319406 0.85963877 0.56662854]]
Edge feature [feat_0] is saved to ./ondisk_dataset_homograph/edge-feat-0.npy

Part of edge feature [feat_1]: tensor([[0.5766, 0.6273, 0.4371, 0.6038, 0.5810],
        [0.6287, 0.9752, 0.7301, 0.5222, 0.3633],
        [0.4078, 0.3468, 0.2190, 0.7907, 0

### Generate tasks
`OnDiskDataset` supports multiple tasks. For each task, we need to prepare training/validation/test sets respectively. Such sets usually vary among different tasks. In this tutorial, let's create a **Node Classification** task and **Link Prediction** task.

#### Node Classification Task
For node classification task, we need **node IDs** and corresponding **labels** for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets.

In [5]:
num_trains = int(num_nodes * 0.6)
num_vals = int(num_nodes * 0.2)
num_tests = num_nodes - num_trains - num_vals

ids = np.arange(num_nodes)
np.random.shuffle(ids)

nc_train_ids_path = os.path.join(base_dir, "nc-train-ids.npy")
nc_train_ids = ids[:num_trains]
print(f"Part of train ids for node classification: {nc_train_ids[:3]}")
np.save(nc_train_ids_path, nc_train_ids)
print(f"NC train ids are saved to {nc_train_ids_path}\n")

nc_train_labels_path = os.path.join(base_dir, "nc-train-labels.pt")
nc_train_labels = torch.randint(0, 10, (num_trains,))
print(f"Part of train labels for node classification: {nc_train_labels[:3]}")
torch.save(nc_train_labels, nc_train_labels_path)
print(f"NC train labels are saved to {nc_train_labels_path}\n")

nc_val_ids_path = os.path.join(base_dir, "nc-val-ids.npy")
nc_val_ids = ids[num_trains:num_trains+num_vals]
print(f"Part of val ids for node classification: {nc_val_ids[:3]}")
np.save(nc_val_ids_path, nc_val_ids)
print(f"NC val ids are saved to {nc_val_ids_path}\n")

nc_val_labels_path = os.path.join(base_dir, "nc-val-labels.pt")
nc_val_labels = torch.randint(0, 10, (num_vals,))
print(f"Part of val labels for node classification: {nc_val_labels[:3]}")
torch.save(nc_val_labels, nc_val_labels_path)
print(f"NC val labels are saved to {nc_val_labels_path}\n")

nc_test_ids_path = os.path.join(base_dir, "nc-test-ids.npy")
nc_test_ids = ids[-num_tests:]
print(f"Part of test ids for node classification: {nc_test_ids[:3]}")
np.save(nc_test_ids_path, nc_test_ids)
print(f"NC test ids are saved to {nc_test_ids_path}\n")

nc_test_labels_path = os.path.join(base_dir, "nc-test-labels.pt")
nc_test_labels = torch.randint(0, 10, (num_tests,))
print(f"Part of test labels for node classification: {nc_test_labels[:3]}")
torch.save(nc_test_labels, nc_test_labels_path)
print(f"NC test labels are saved to {nc_test_labels_path}\n")

Part of train ids for node classification: [130 434  38]
NC train ids are saved to ./ondisk_dataset_homograph/nc-train-ids.npy

Part of train labels for node classification: tensor([9, 7, 9])
NC train labels are saved to ./ondisk_dataset_homograph/nc-train-labels.pt

Part of val ids for node classification: [643 843 846]
NC val ids are saved to ./ondisk_dataset_homograph/nc-val-ids.npy

Part of val labels for node classification: tensor([1, 2, 0])
NC val labels are saved to ./ondisk_dataset_homograph/nc-val-labels.pt

Part of test ids for node classification: [636 660 645]
NC test ids are saved to ./ondisk_dataset_homograph/nc-test-ids.npy

Part of test labels for node classification: tensor([3, 9, 6])
NC test labels are saved to ./ondisk_dataset_homograph/nc-test-labels.pt



#### Link Prediction Task
For link prediction task, we need **node pairs** or **negative src/dsts** for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets.

In [6]:
num_trains = int(num_edges * 0.6)
num_vals = int(num_edges * 0.2)
num_tests = num_edges - num_trains - num_vals

lp_train_node_pairs_path = os.path.join(base_dir, "lp-train-node-pairs.npy")
lp_train_node_pairs = edges[:num_trains, :]
print(f"Part of train node pairs for link prediction: {lp_train_node_pairs[:3]}")
np.save(lp_train_node_pairs_path, lp_train_node_pairs)
print(f"LP train node pairs are saved to {lp_train_node_pairs_path}\n")

lp_val_node_pairs_path = os.path.join(base_dir, "lp-val-node-pairs.npy")
lp_val_node_pairs = edges[num_trains:num_trains+num_vals, :]
print(f"Part of val node pairs for link prediction: {lp_val_node_pairs[:3]}")
np.save(lp_val_node_pairs_path, lp_val_node_pairs)
print(f"LP val node pairs are saved to {lp_val_node_pairs_path}\n")

lp_val_neg_dsts_path = os.path.join(base_dir, "lp-val-neg-dsts.pt")
lp_val_neg_dsts = torch.randint(0, num_nodes, (num_vals, 10))
print(f"Part of val negative dsts for link prediction: {lp_val_neg_dsts[:3]}")
torch.save(lp_val_neg_dsts, lp_val_neg_dsts_path)
print(f"LP val negative dsts are saved to {lp_val_neg_dsts_path}\n")

lp_test_node_pairs_path = os.path.join(base_dir, "lp-test-node-pairs.npy")
lp_test_node_pairs = edges[-num_tests, :]
print(f"Part of test node pairs for link prediction: {lp_test_node_pairs[:3]}")
np.save(lp_test_node_pairs_path, lp_test_node_pairs)
print(f"LP test node pairs are saved to {lp_test_node_pairs_path}\n")

lp_test_neg_dsts_path = os.path.join(base_dir, "lp-test-neg-dsts.pt")
lp_test_neg_dsts = torch.randint(0, num_nodes, (num_tests, 10))
print(f"Part of test negative dsts for link prediction: {lp_test_neg_dsts[:3]}")
torch.save(lp_test_neg_dsts, lp_test_neg_dsts_path)
print(f"LP test negative dsts are saved to {lp_test_neg_dsts_path}\n")

Part of train node pairs for link prediction: [[916  53]
 [ 43 379]
 [685 876]]
LP train node pairs are saved to ./ondisk_dataset_homograph/lp-train-node-pairs.npy

Part of val node pairs for link prediction: [[ 72 918]
 [687  98]
 [862 193]]
LP val node pairs are saved to ./ondisk_dataset_homograph/lp-val-node-pairs.npy

Part of val negative dsts for link prediction: tensor([[585, 352, 225, 623, 161, 134, 844,  63, 875,  79],
        [370, 399, 479, 586, 684,   8, 593, 513,  76, 953],
        [108, 971, 282, 230, 907, 891, 554,  54, 208,  21]])
LP val negative dsts are saved to ./ondisk_dataset_homograph/lp-val-neg-dsts.pt

Part of test node pairs for link prediction: [965 219]
LP test node pairs are saved to ./ondisk_dataset_homograph/lp-test-node-pairs.npy

Part of test negative dsts for link prediction: tensor([[417, 762, 962, 146, 162, 845, 403, 265, 572, 157],
        [885, 962, 214, 346, 631, 308, 647,   6,  38, 580],
        [834, 170, 209,  35, 775, 853, 486, 910, 236, 122]])


## Organize Data into YAML File
Now we need to create a `metadata.yaml` file which contains the paths, dadta types of graph structure, feature data, training/validation/test sets.

Notes:
- all path should be relative to `metadata.yaml`.
- Below fields are optional and not specified in below example.
  - `in_memory`: indicates whether to load dada into memory or `mmap`. Default is `True`.

Please refer to [YAML specification](https://github.com/dmlc/dgl/blob/master/docs/source/stochastic_training/ondisk-dataset-specification.rst) for more details.

In [7]:
yaml_content = f"""
    dataset_name: homogeneous_graph_nc_lp
    graph:
      nodes:
        - num: {num_nodes}
      edges:
        - format: csv
          path: {os.path.basename(edges_path)}
    feature_data:
      - domain: node
        name: feat_0
        format: numpy
        path: {os.path.basename(node_feat_0_path)}
      - domain: node
        name: feat_1
        format: torch
        path: {os.path.basename(node_feat_1_path)}
      - domain: edge
        name: feat_0
        format: numpy
        path: {os.path.basename(edge_feat_0_path)}
      - domain: edge
        name: feat_1
        format: torch
        path: {os.path.basename(edge_feat_1_path)}
    tasks:
      - name: node_classification
        num_classes: 10
        train_set:
          - data:
              - name: seed_nodes
                format: numpy
                path: {os.path.basename(nc_train_ids_path)}
              - name: labels
                format: torch
                path: {os.path.basename(nc_train_labels_path)}
        validation_set:
          - data:
              - name: seed_nodes
                format: numpy
                path: {os.path.basename(nc_val_ids_path)}
              - name: labels
                format: torch
                path: {os.path.basename(nc_val_labels_path)}
        test_set:
          - data:
              - name: seed_nodes
                format: numpy
                path: {os.path.basename(nc_test_ids_path)}
              - name: labels
                format: torch
                path: {os.path.basename(nc_test_labels_path)}
      - name: link_prediction
        num_classes: 10
        train_set:
          - data:
              - name: node_pairs
                format: numpy
                path: {os.path.basename(lp_train_node_pairs_path)}
        validation_set:
          - data:
              - name: node_pairs
                format: numpy
                path: {os.path.basename(lp_val_node_pairs_path)}
              - name: negative_dsts
                format: torch
                path: {os.path.basename(lp_val_neg_dsts_path)}
        test_set:
          - data:
              - name: node_pairs
                format: numpy
                path: {os.path.basename(lp_test_node_pairs_path)}
              - name: negative_dsts
                format: torch
                path: {os.path.basename(lp_test_neg_dsts_path)}
"""
metadata_path = os.path.join(base_dir, "metadata.yaml")
with open(metadata_path, "w") as f:
  f.write(yaml_content)

## Instantiate `OnDiskDataset`
Now we're ready to load dataset via `dgl.graphbolt.OnDiskDataset`. When instantiating, we just pass in the base directory where `metadata.yaml` file lies.

During first instantiation, GraphBolt preprocesses the raw data such as constructing `FusedCSCSamplingGraph` from edges. All data including graph, feature data, training/validation/test sets are put into `preprocessed` directory after preprocessing. Any following dataset loading will skip the preprocess stage.

After preprocessing, `load()` is required to be called explicitly in order to load graph, feature data and tasks.

In [8]:
dataset = gb.OnDiskDataset(base_dir).load()
graph = dataset.graph
print(f"Loaded graph: {graph}\n")

feature = dataset.feature
print(f"Loaded feature store: {feature}\n")

tasks = dataset.tasks
nc_task = tasks[0]
print(f"Loaded node classification task: {nc_task}\n")
lp_task = tasks[1]
print(f"Loaded link prediction task: {lp_task}\n")

The dataset is already preprocessed.
Loaded graph: FusedCSCSamplingGraph(csc_indptr=tensor([    0,    12,    21,  ...,  9980,  9990, 10000], dtype=torch.int32),
                      indices=tensor([868, 783, 112,  ..., 519, 834,  14], dtype=torch.int32),
                      num_nodes=1000, num_edges=10000)

Loaded feature store: TorchBasedFeatureStore{(<OnDiskFeatureDataDomain.NODE: 'node'>, None, 'feat_0'): TorchBasedFeature(feature=tensor([[0.2942, 0.6908, 0.3524, 0.0982, 0.7380],
                                                        [0.5107, 0.4093, 0.5527, 0.3593, 0.9603],
                                                        [0.8525, 0.9824, 0.2857, 0.7608, 0.6530],
                                                        ...,
                                                        [0.5683, 0.3722, 0.6206, 0.9546, 0.4787],
                                                        [0.6621, 0.8055, 0.7307, 0.9558, 0.6780],
                                                        

  return torch.load(graph_topology.path)
