Training on mutiple nodes #13

d-li14 · 2021-04-03T10:03:49Z

Thanks for your great work. If there are two machines (each with 8 V100 GPUs) connected with ethernet, without slurm management, then how to run the code with your stated 16 V100 config?

jingli9111 · 2021-04-03T13:50:12Z

I am not sure how to run it through ethernet connected nodes.
I think you can directly run the training on single machine (with 8 GPUs) with a smaller batch size, e.g.
python main.py /path/to/imagenet/ --epochs 1000 --batch-size 1024 --learning-rate 0.25 --lambd 0.0051 --projector 8192-8192-8192 --scale-loss 0.024
or
python main.py /path/to/imagenet/ --epochs 1000 --batch-size 512 --learning-rate 0.3 --lambd 0.0051 --projector 8192-8192-8192 --scale-loss 0.024
It should give similar results.

d-li14 · 2021-04-03T14:39:43Z

@jingli9111 Thanks very much for your reply. Anyway, the concern is that it might be too slow to train on a single machine.

shoaibahmed · 2021-04-05T00:40:10Z

@d-li14 The provided main.py script internally uses multiprocessing. In order to use two nodes without slurm, best is to get rid of the multiprocessing.spawn command in the main, and merge the main function with main_worker. I have attached the kind of main function I have for your reference at the bottom.

Once you have that code structure, use multiproc.py from NVIDIA to execute the code (https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Classification/ConvNets/multiproc.py). An alternate is to use the internal PyTorch launcher i.e. python -m torch.distributed.launch, but I always prefer multiproc due to redirection of outputs from other processes to different files, avoiding amalgamation of logs from different processes (each process will print the exact same output otherwise, meaning that each print statement will be printed world_size times). Furthermore, multiproc.py better handles the generated interrupts across different processes.

It takes the --nnodes parameters, which should be set to 2, as well as --nproc_per_node which should be set to 8. Since we need a way for the two nodes to communicate, you have to assume one node to be the master, and note its IP address. Once you have the IP address, just use that IP address with --master_addr ... for both nodes. So the final command will look something like: python ./multiproc.py --nnodes 2 --nproc_per_node 8 --master_addr ... main.py /path/to/imagenet/ --epochs 1000 --batch-size 2048 --learning-rate 0.2 --lambd 0.0051 --projector 8192-8192-8192 --scale-loss 0.024

def main():
    args = parser.parse_args()
    args.ngpus_per_node = torch.cuda.device_count()
    
    # Initialize the distributed environment
    args.gpu = 0
    args.world_size = 1
    args.local_rank = 0
    args.distributed = int(os.getenv('WORLD_SIZE', 1)) > 1
    args.rank = int(os.getenv('RANK', 0))

    if "SLURM_NNODES" in os.environ:
        args.local_rank = args.rank % torch.cuda.device_count()
        print(f"SLURM tasks/nodes: {os.getenv('SLURM_NTASKS', 1)}/{os.getenv('SLURM_NNODES', 1)}")
    elif "WORLD_SIZE" in os.environ:
        args.local_rank = int(os.getenv('LOCAL_RANK', 0))

    args.gpu = args.local_rank
    torch.cuda.set_device(args.gpu)
    torch.distributed.init_process_group(backend="nccl", init_method="env://")
    args.world_size = torch.distributed.get_world_size()
    assert int(os.getenv('WORLD_SIZE', 1)) == args.world_size
    print(f"Initializing the environment with {args.world_size} processes | Current process rank: {args.local_rank}")

    if args.rank == 0:
        args.checkpoint_dir.mkdir(parents=True, exist_ok=True)
        stats_file = open(args.checkpoint_dir / 'stats.txt', 'a', buffering=1)
        print(' '.join(sys.argv))
        print(' '.join(sys.argv), file=stats_file)

    gpu = args.gpu
    torch.backends.cudnn.benchmark = True

d-li14 · 2021-04-05T11:50:02Z

@shoaibahmed Thanks a lot for your reply and detailed instructions! I will try it.

kartikgupta-at-anu · 2021-04-29T09:58:45Z

does anyone know how much GPU memory does it consume with a batch size 1024? I am running out of memory even with 1024 batch size on 8 V100

jzbontar · 2021-05-03T09:00:17Z

does anyone know how much GPU memory does it consume with a batch size 1024? I am running out of memory even with 1024 batch size on 8 V100

Me too. The largest batch size I was able to fit on 8 V100 (with 16GB of memory) was 512, which used a little over 11GB of memory per gpu.

jzbontar closed this as completed Aug 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training on mutiple nodes #13

Training on mutiple nodes #13

d-li14 commented Apr 3, 2021

jingli9111 commented Apr 3, 2021 •

edited

Loading

d-li14 commented Apr 3, 2021

shoaibahmed commented Apr 5, 2021 •

edited

Loading

d-li14 commented Apr 5, 2021

kartikgupta-at-anu commented Apr 29, 2021

jzbontar commented May 3, 2021 •

edited

Loading

Training on mutiple nodes #13

Training on mutiple nodes #13

Comments

d-li14 commented Apr 3, 2021

jingli9111 commented Apr 3, 2021 • edited Loading

d-li14 commented Apr 3, 2021

shoaibahmed commented Apr 5, 2021 • edited Loading

d-li14 commented Apr 5, 2021

kartikgupta-at-anu commented Apr 29, 2021

jzbontar commented May 3, 2021 • edited Loading

jingli9111 commented Apr 3, 2021 •

edited

Loading

shoaibahmed commented Apr 5, 2021 •

edited

Loading

jzbontar commented May 3, 2021 •

edited

Loading