The code mostly requires the following packages to run:
- Python >= 3.9
- PyTorch (tested on 1.10.2)
- DGL (tested on 0.80post1)
- NumPy (tested on 1.20.3)
- SciPy (tested on 1.7.3)
- scikit-learn (tested on 1.0.2)
- pandas (tested on 1.3.5)
- ogb (tested on 1.3.2)
- DVC (tested on 2.9.5)
- Signac (tested on 1.7.0)
A complete list of packages installed in the conda environment for experiments can be found in /envs/conda/scaling_GNNs-GPU.yml
.
The code uses DVC
and Signac
to record and manage experiments under different hyperparameters. Default training and model hyperparameters are defined in params.yaml
.
To train a GNN method on single GPU under default hyperparameters, use the following command:
dvc exp run --temp gnn_links -S model=${model}
For example, to train GCN
under default hyperparameters, use the following command:
dvc exp run --temp gnn_links -S model=GCN
To override default parameters in params.yaml
on the fly (e.g. use GPU #2 for the experiment with 40 training epochs), use the following command:
dvc exp run --temp gcn_links -S "train.gpu=2" -S "models.GCN.n_epochs=40"
Please refer to params.yaml
for a complite list of parameters, and additional documents for more usages of the DVC experiment interface.
To train a GNN method on multiple GPUs of a single machine, specify a list of GPU IDs to use as follows:
dvc exp run --temp gnn_links -S model=${model} -S "train.gpu=${GPU_IDs}"
For example, to train GCN
on GPU with IDs from 0 to 7, use the following command:
dvc exp run --temp gnn_links -S model=GCN -S "train.gpu=[0,1,2,3,4,5,6,7]"
Usage of other parameters are the same as single-GPU training. See bin/
for example scripts to replicate the experiments on each dataset.
Distributed training mode should be set as AvgCluster
in the distributed.mode
entry in params.yaml
for model aggregation training. When the distributed.mode
is set as null
, the code will instead train on single / multiple GPUs on a single machine as described above.
Before running the code distributedly, each participants should have its own copy of the code in separate folders which are not shared with other clients. The the address and port to connect to the server (Rank 0 participant) should be set in the shared environment file src/dist_environ.json
as following:
{
"MASTER_PORT": 12345,
"MASTER_ADDR": "172.31.xx.xxx",
"RANK": 0,
"TCPSTORE_PORT": 1234,
"KMP_DUPLICATE_LIB_OK": true
}
The rank for each participant can be set the local environment file in src/__environ__.json
.
{
"RANK": 0
}
This local environment file will not be synced through Git, but you can track src/dist_environ.json
with git because it is supposed to be the same on all machines. The rank number should be within the range of range(world_size)
set in the server.
Please refer to PyTorch documentations for additional information on setting these parameters.
Make sure you have finished preparing the distributed environment before starting distributed training.
- Each participant (server or trainer) should run in its own folder.
- To start the server (rank 0), follow the "Train GNN models on Multiple GPUs of a single machine" section, but with
distributed.mode
set asAvgCluster
. - The trainers can be started using the same command as the server, but any hyperparameters that are specified differently on the client machine will be ignored as it will always follow the hyperparameters used in the server, so please make sure you specify the hyperparameters in the server.
For example, suppose we have a server with rank 0 and a trainer with rank 1. The server should run the following command:
dvc exp run --temp gnn_links -S model=GCN \
-S "distributed.mode=AvgCluster" \
-S "distributed.gpu=[0,1]" \
-S "distributed.world_size=2" \
-S "distributed.rank=0"
And the trainer should run the following command:
dvc exp run --temp gnn_links -S model=GCN \
-S "distributed.mode=AvgCluster" \
-S "distributed.world_size=2" \
-S "distributed.rank=1"
If distributed.rank
is not specified in the command line, it will be loaded from the environment variables (e.g., if it is set in src/__environ__.json
). See bin/
for example scripts to replicate the experiments on each dataset.
params.yaml
: define default training and model hyperparameters.src/
: folder for source code.src/train_links.py
: main entry for link prediction tasks.src/dist_environ.json
: global environment file shared by all machines for distributed training.src/__environ__.json
: local environment file for specifying rank of the machine in distributed training. Not synced through Git.src/models
: implementation of GNN models.src/pipelines
: pipelines for initialization, training and evaluation.src/distributed
: modules for MultiGPU and time-based model aggregation training.src/dataloading
: modules for data loading and preprocessing.src/samplers
: modules for customized mini-batch and graph samplers.
bin/
: bash scripts for replicating experiments.workspace/
: workspace folder for experiment jobs created bySignac
.dataset/
: dataset used in experiments.dvc.yaml
: DVC configuration file.
Contact Jiong Zhu (jiongzhu@
, or jiongzhu@umich.edu
) or Aishwarya Reganti (areganti@amazon.com
) if you have any questions for the code.