Skip to content

amazon-science/random-tma

Distributed Scalable GNN Training through Time-based Model Aggregation and Randomized Partitions

Requirements

The code mostly requires the following packages to run:

  • Python >= 3.9
  • PyTorch (tested on 1.10.2)
  • DGL (tested on 0.80post1)
  • NumPy (tested on 1.20.3)
  • SciPy (tested on 1.7.3)
  • scikit-learn (tested on 1.0.2)
  • pandas (tested on 1.3.5)
  • ogb (tested on 1.3.2)
  • DVC (tested on 2.9.5)
  • Signac (tested on 1.7.0)

A complete list of packages installed in the conda environment for experiments can be found in /envs/conda/scaling_GNNs-GPU.yml.

Usage

The code uses DVC and Signac to record and manage experiments under different hyperparameters. Default training and model hyperparameters are defined in params.yaml.

Train GNN models on single GPU

To train a GNN method on single GPU under default hyperparameters, use the following command:

dvc exp run --temp gnn_links -S model=${model}

For example, to train GCN under default hyperparameters, use the following command:

dvc exp run --temp gnn_links -S model=GCN

To override default parameters in params.yaml on the fly (e.g. use GPU #2 for the experiment with 40 training epochs), use the following command:

dvc exp run --temp gcn_links -S "train.gpu=2" -S "models.GCN.n_epochs=40"

Please refer to params.yaml for a complite list of parameters, and additional documents for more usages of the DVC experiment interface.

Train GNN models on Multiple GPUs of a single machine

To train a GNN method on multiple GPUs of a single machine, specify a list of GPU IDs to use as follows:

dvc exp run --temp gnn_links -S model=${model} -S "train.gpu=${GPU_IDs}"

For example, to train GCN on GPU with IDs from 0 to 7, use the following command:

dvc exp run --temp gnn_links -S model=GCN -S "train.gpu=[0,1,2,3,4,5,6,7]"

Usage of other parameters are the same as single-GPU training. See bin/ for example scripts to replicate the experiments on each dataset.

Running Distributed Time-based Model Aggregation with Randomized Partitions

Distributed training mode should be set as AvgCluster in the distributed.mode entry in params.yaml for model aggregation training. When the distributed.mode is set as null, the code will instead train on single / multiple GPUs on a single machine as described above.

Preparing distributed environment

Before running the code distributedly, each participants should have its own copy of the code in separate folders which are not shared with other clients. The the address and port to connect to the server (Rank 0 participant) should be set in the shared environment file src/dist_environ.json as following:

{
    "MASTER_PORT": 12345,
    "MASTER_ADDR": "172.31.xx.xxx",
    "RANK": 0,
    "TCPSTORE_PORT": 1234,
    "KMP_DUPLICATE_LIB_OK": true
}

The rank for each participant can be set the local environment file in src/__environ__.json.

{
    "RANK": 0
}

This local environment file will not be synced through Git, but you can track src/dist_environ.json with git because it is supposed to be the same on all machines. The rank number should be within the range of range(world_size) set in the server.

Please refer to PyTorch documentations for additional information on setting these parameters.

Start distributed training

Make sure you have finished preparing the distributed environment before starting distributed training.

  • Each participant (server or trainer) should run in its own folder.
  • To start the server (rank 0), follow the "Train GNN models on Multiple GPUs of a single machine" section, but with distributed.mode set as AvgCluster.
  • The trainers can be started using the same command as the server, but any hyperparameters that are specified differently on the client machine will be ignored as it will always follow the hyperparameters used in the server, so please make sure you specify the hyperparameters in the server.

For example, suppose we have a server with rank 0 and a trainer with rank 1. The server should run the following command:

dvc exp run --temp gnn_links -S model=GCN \
-S "distributed.mode=AvgCluster" \
-S "distributed.gpu=[0,1]" \
-S "distributed.world_size=2" \
-S "distributed.rank=0"

And the trainer should run the following command:

dvc exp run --temp gnn_links -S model=GCN \
-S "distributed.mode=AvgCluster" \
-S "distributed.world_size=2" \
-S "distributed.rank=1"

If distributed.rank is not specified in the command line, it will be loaded from the environment variables (e.g., if it is set in src/__environ__.json). See bin/ for example scripts to replicate the experiments on each dataset.

File Structure

  • params.yaml: define default training and model hyperparameters.
  • src/: folder for source code.
    • src/train_links.py: main entry for link prediction tasks.
    • src/dist_environ.json: global environment file shared by all machines for distributed training.
    • src/__environ__.json: local environment file for specifying rank of the machine in distributed training. Not synced through Git.
    • src/models: implementation of GNN models.
    • src/pipelines: pipelines for initialization, training and evaluation.
    • src/distributed: modules for MultiGPU and time-based model aggregation training.
    • src/dataloading: modules for data loading and preprocessing.
    • src/samplers: modules for customized mini-batch and graph samplers.
  • bin/: bash scripts for replicating experiments.
  • workspace/: workspace folder for experiment jobs created by Signac.
  • dataset/: dataset used in experiments.
  • dvc.yaml: DVC configuration file.

Contact

Contact Jiong Zhu (jiongzhu@, or jiongzhu@umich.edu) or Aishwarya Reganti (areganti@amazon.com) if you have any questions for the code.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published