Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Training on mutiple nodes #13

Closed
d-li14 opened this issue Apr 3, 2021 · 6 comments
Closed

Training on mutiple nodes #13

d-li14 opened this issue Apr 3, 2021 · 6 comments

Comments

@d-li14
Copy link

d-li14 commented Apr 3, 2021

Thanks for your great work. If there are two machines (each with 8 V100 GPUs) connected with ethernet, without slurm management, then how to run the code with your stated 16 V100 config?

@jingli9111
Copy link
Contributor

jingli9111 commented Apr 3, 2021

I am not sure how to run it through ethernet connected nodes.
I think you can directly run the training on single machine (with 8 GPUs) with a smaller batch size, e.g.
python main.py /path/to/imagenet/ --epochs 1000 --batch-size 1024 --learning-rate 0.25 --lambd 0.0051 --projector 8192-8192-8192 --scale-loss 0.024
or
python main.py /path/to/imagenet/ --epochs 1000 --batch-size 512 --learning-rate 0.3 --lambd 0.0051 --projector 8192-8192-8192 --scale-loss 0.024
It should give similar results.

@d-li14
Copy link
Author

d-li14 commented Apr 3, 2021

@jingli9111 Thanks very much for your reply. Anyway, the concern is that it might be too slow to train on a single machine.

@shoaibahmed
Copy link

shoaibahmed commented Apr 5, 2021

@d-li14 The provided main.py script internally uses multiprocessing. In order to use two nodes without slurm, best is to get rid of the multiprocessing.spawn command in the main, and merge the main function with main_worker. I have attached the kind of main function I have for your reference at the bottom.

Once you have that code structure, use multiproc.py from NVIDIA to execute the code (https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Classification/ConvNets/multiproc.py). An alternate is to use the internal PyTorch launcher i.e. python -m torch.distributed.launch, but I always prefer multiproc due to redirection of outputs from other processes to different files, avoiding amalgamation of logs from different processes (each process will print the exact same output otherwise, meaning that each print statement will be printed world_size times). Furthermore, multiproc.py better handles the generated interrupts across different processes.

It takes the --nnodes parameters, which should be set to 2, as well as --nproc_per_node which should be set to 8. Since we need a way for the two nodes to communicate, you have to assume one node to be the master, and note its IP address. Once you have the IP address, just use that IP address with --master_addr ... for both nodes. So the final command will look something like: python ./multiproc.py --nnodes 2 --nproc_per_node 8 --master_addr ... main.py /path/to/imagenet/ --epochs 1000 --batch-size 2048 --learning-rate 0.2 --lambd 0.0051 --projector 8192-8192-8192 --scale-loss 0.024

def main():
    args = parser.parse_args()
    args.ngpus_per_node = torch.cuda.device_count()
    
    # Initialize the distributed environment
    args.gpu = 0
    args.world_size = 1
    args.local_rank = 0
    args.distributed = int(os.getenv('WORLD_SIZE', 1)) > 1
    args.rank = int(os.getenv('RANK', 0))

    if "SLURM_NNODES" in os.environ:
        args.local_rank = args.rank % torch.cuda.device_count()
        print(f"SLURM tasks/nodes: {os.getenv('SLURM_NTASKS', 1)}/{os.getenv('SLURM_NNODES', 1)}")
    elif "WORLD_SIZE" in os.environ:
        args.local_rank = int(os.getenv('LOCAL_RANK', 0))

    args.gpu = args.local_rank
    torch.cuda.set_device(args.gpu)
    torch.distributed.init_process_group(backend="nccl", init_method="env://")
    args.world_size = torch.distributed.get_world_size()
    assert int(os.getenv('WORLD_SIZE', 1)) == args.world_size
    print(f"Initializing the environment with {args.world_size} processes | Current process rank: {args.local_rank}")

    if args.rank == 0:
        args.checkpoint_dir.mkdir(parents=True, exist_ok=True)
        stats_file = open(args.checkpoint_dir / 'stats.txt', 'a', buffering=1)
        print(' '.join(sys.argv))
        print(' '.join(sys.argv), file=stats_file)

    gpu = args.gpu
    torch.backends.cudnn.benchmark = True

@d-li14
Copy link
Author

d-li14 commented Apr 5, 2021

@shoaibahmed Thanks a lot for your reply and detailed instructions! I will try it.

@kartikgupta-at-anu
Copy link

does anyone know how much GPU memory does it consume with a batch size 1024? I am running out of memory even with 1024 batch size on 8 V100

@jzbontar
Copy link
Contributor

jzbontar commented May 3, 2021

does anyone know how much GPU memory does it consume with a batch size 1024? I am running out of memory even with 1024 batch size on 8 V100

Me too. The largest batch size I was able to fit on 8 V100 (with 16GB of memory) was 512, which used a little over 11GB of memory per gpu.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants