trainer.validate will run full validation set on every GPU. #8

ypw-cortex · 2018-11-20T07:11:58Z

example for efficient multi-gpu training of resnet50 (4 gpus, label-smoothing, fast regime by fast-ai):

python -m torch.distributed.launch --nproc_per_node=4  main.py --model resnet --model-config "{'depth': 50, 'regime': 'fast'}" --eval-batch-size 512 --save resnet50_fast --label-smoothing 0.1

I made some changes:

python -m torch.distributed.launch --nproc_per_node=8 main.py --model resnet --model-config "{'depth': 34, 'regime': 'fast'}" --batch-size 256 --eval-batch-size 512 --label-smoothing 0.1

The log shows:

TRAINING - Epoch: [15][10/625]	Time 0.810 (1.640)
EVALUATING - Epoch: [15][10/98]	Time 1.353 (3.035)

According to the following formulas:

1281167 / 256 = 5004.5, 5004.5 / 8 = 625.5
50000 / 512 = 97.6, 97.6 / 8 = 12.2

So validation steps should be 12 or 13, not 98.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trainer.validate will run full validation set on every GPU. #8

trainer.validate will run full validation set on every GPU. #8

ypw-cortex commented Nov 20, 2018

trainer.validate will run full validation set on every GPU. #8

trainer.validate will run full validation set on every GPU. #8

Comments

ypw-cortex commented Nov 20, 2018