[OSS-SDP] Acuracy bug - results differ from DDP and OSS #132

blefaudeux · 2020-10-09T04:43:43Z

🐛 Bug

See #130 for a repro , with 4+ GPU there's a measurable accuracy discrepancy on the same problem vs normal DDP and OSS+DDP. It does not show with 2 GPUs.

To Reproduce

Steps to reproduce the behavior:

'python3 fairscale/benchmark/oss.py' on a machine with 4 or more GPUs
Observe that the two first runs (DDP and OSS+DDP) match, but that the third one differs measurably
Example with CircleCI

Expected behavior

The logs should exactly match for all three methods

Environment

`
Torch version: 1.6.0+cu101
Collecting environment information...
PyTorch version: 1.6.0+cu101
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla M60
GPU 1: Tesla M60
GPU 2: Tesla M60
GPU 3: Tesla M60

Nvidia driver version: 418.87.00
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.17.4
[pip3] torch==1.6.0+cu101
[pip3] torchtext==0.6.0
[pip3] torchvision==0.7.0
[conda] Could not collect
`

Additional context

In this toy example all the ranks get the same seed, but the data served for every rank differ (as it should)

blefaudeux · 2020-10-09T04:44:50Z

cc @msbaines @min-xu-ai, FYI, I'll look into it but just so you know

blefaudeux · 2020-10-22T06:38:27Z

Looks like the bug is in computing the update actually, we're not using the params which we should be using possibly. It was not visible with DDP thanks to the all_reduce meaning that the data was there anyway
See P146437109, the first gradient is completely matching, but the first update differs slightly

blefaudeux · 2020-11-05T00:08:59Z

Changing torch versions also introduces discrepancies on the reduced gradients, so that seems beyond the reach of ShardedDDP. If anything the DDP <> ShardedDDP discrepancy is reduced with pytorch 1.7

blefaudeux self-assigned this Oct 9, 2020

msbaines added the bug Something isn't working label Oct 20, 2020

blefaudeux linked a pull request Oct 21, 2020 that will close this issue

[feat] ShardedDataParallel with autoreduce #157

Merged

8 tasks

blefaudeux mentioned this issue Oct 22, 2020

ShardedDDP: automatic gradient reduce #160

Closed

blefaudeux closed this as completed Nov 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OSS-SDP] Acuracy bug - results differ from DDP and OSS #132

[OSS-SDP] Acuracy bug - results differ from DDP and OSS #132

blefaudeux commented Oct 9, 2020 •

edited

Loading

blefaudeux commented Oct 9, 2020

blefaudeux commented Oct 22, 2020

blefaudeux commented Nov 5, 2020

[OSS-SDP] Acuracy bug - results differ from DDP and OSS #132

[OSS-SDP] Acuracy bug - results differ from DDP and OSS #132

Comments

blefaudeux commented Oct 9, 2020 • edited Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

blefaudeux commented Oct 9, 2020

blefaudeux commented Oct 22, 2020

blefaudeux commented Nov 5, 2020

blefaudeux commented Oct 9, 2020 •

edited

Loading