# PyG Dataset

## torch_geometric.datasets
* contains a variety of common graph datasets
* Each PyG dataset stores a list of *torch_geometric.data.Data* objects, where each torch_geometric.data.Data object represents a graph

In [None]:
# example: load TUDataset dataset
from torch_geometric.datasets import TUDataset

root = './enzymes'
name = 'ENZYMES'
# The ENZYMES dataset
pyg_dataset= TUDataset(root, name)

In [None]:
# number of graphs in the dataset
len(pyg_dataset)

# number of calsses for that dataset
pyg_dataset.num_classes

# number of features for that dataset
pyg_dataset.num_features

## torch_geometric.data
* provides the data handling of graphs in PyTorch tensors.
* A data object describing a homogeneous graph. The data object can hold node-level, link-level and graph-level attributes.

Attributes:
* data.x: Node feature matrix with shape [num_nodes, num_node_features]

* data.edge_index: Graph connectivity in COO (coordinate) format with shape [2, num_edges] and type torch.long

* data.edge_attr: Edge feature matrix with shape [num_edges, num_edge_features]

* data.y: Target to train against (may have arbitrary shape), e.g., node-level targets of shape [num_nodes, *] or graph-level targets of shape [1, *]

* data.pos: Node position matrix with shape [num_nodes, num_dimensions]
* data.is_undirected()- check if graph is undirected
* train_mask (optional)- denotes against which nodes to train
* val_mask (optional)- denotes which nodes to use for validation
* test_mask (optional)- denotes against which nodes to test

In [None]:
# 
data = pyg_dataset[idx]

## Open Graph Benchmark (OGB)

* The Open Graph Benchmark (OGB) is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. 
* Its datasets are automatically downloaded, processed, and split using the OGB Data Loader. 
* The model performance can then be evaluated by using the OGB Evaluator in a unified manner.
* OGB also supports PyG dataset and data classes

In [None]:
# 'ogbn-arxiv' dataset
import torch_geometric.transforms as T
from ogb.nodeproppred import PygNodePropPredDataset, Evaluator


dataset_name = 'ogbn-arxiv'
dataset = PygNodePropPredDataset(name=dataset_name,
                                  transform=T.ToSparseTensor())
data = dataset[0]

# Make the adjacency matrix to symmetric (sparse symmetric tensor)
data.adj_t = data.adj_t.to_symmetric()


# a dictionary of train/valid/test split mask (indices)
split_idx = dataset.get_idx_split()



In [None]:
import torch_geometric.transforms as T

In [None]:
from ogb.graphproppred.mol_encoder import AtomEncoder  # embed raw atom

evaluator = Evaluator(name='ogbg-molhiv')

# Dataloader

In [None]:
from torch_geometric.data import DataLoader

# example of train/val/test split, based on fixed masking split
train_loader = DataLoader(dataset[split_idx["train"]], batch_size=32, shuffle=True, num_workers=0)
valid_loader = DataLoader(dataset[split_idx["valid"]], batch_size=32, shuffle=False, num_workers=0)
test_loader = DataLoader(dataset[split_idx["test"]], batch_size=32, shuffle=False, num_workers=0)

# Graph Mini-Batching

In order to parallelize the processing of a mini-batch of graphs, PyG combines the graphs into a single disconnected graph data object (*torch_geometric.data.Batch*). 

*torch_geometric.data.Batch* inherits from *torch_geometric.data.Data* and contains an additional attribute called `batch`. 

The `batch` attribute is a vector mapping each node to the index of its corresponding graph within the mini-batch:

    batch = [0, ..., 0, 1, ..., n - 2, n - 1, ..., n - 1]

This attribute is crucial for associating which graph each node belongs to and can be used to e.g. average the node embeddings for each graph individually to compute graph level embeddings. 
* batch.y
* batch.x
* batch.batch- batch vector which assigns each node to a specific example.
* batch.edge_index

## Global Pooling Layers
PyG [global pooling layer](https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#global-pooling-layers) (partial list):
* global_add_pool- returns batch-wise graph-level-outputs by adding node features across the node dimension, so that for a single graph $G_i$ its output is computed by
* global_mean_pool- returns batch-wise graph-level-outputs by averaging node features across the node dimension, so that for a single graph $G_i$ its output is computed by
* global_max_pool- Returns batch-wise graph-level-outputs by taking the channel-wise maximum across the node dimension, so that for a single graph $G_i$ its output is computed by
* GlobalAttention- Global soft attention layer from the [“Gated Graph Sequence Neural Networks”](https://arxiv.org/pdf/1511.05493.pdf) paper
* Set2Set- The global pooling operator based on iterative content-based attention from the [“Order Matters: Sequence to sequence for sets”](https://arxiv.org/pdf/1511.06391.pdf) paper
* GraphMultisetTransformer- The global Graph Multiset Transformer pooling operator from the [“Accurate Learning of Graph Representations with Graph Multiset Pooling”](https://arxiv.org/pdf/2102.11533.pdf) paper

# Cora Dataset
* CORA dataset is a citation network benchmark. 
* In this dataset, nodes correspond to documents and edges correspond to undirected citations.
* Each node or document in the graph is assigned a class label and features based on the documents binarized bag-of-words representation. 
* The Cora graph has 2708 nodes, 5429 edges, 7 prediction classes, and 1433 features per node. 