Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OSS-SDP] Acuracy bug - results differ from DDP and OSS #132

Closed
blefaudeux opened this issue Oct 9, 2020 · 3 comments · Fixed by #157
Closed

[OSS-SDP] Acuracy bug - results differ from DDP and OSS #132

blefaudeux opened this issue Oct 9, 2020 · 3 comments · Fixed by #157
Assignees
Labels
bug Something isn't working

Comments

@blefaudeux
Copy link
Contributor

blefaudeux commented Oct 9, 2020

🐛 Bug

See #130 for a repro , with 4+ GPU there's a measurable accuracy discrepancy on the same problem vs normal DDP and OSS+DDP. It does not show with 2 GPUs.

To Reproduce

Steps to reproduce the behavior:

  1. 'python3 fairscale/benchmark/oss.py' on a machine with 4 or more GPUs
  2. Observe that the two first runs (DDP and OSS+DDP) match, but that the third one differs measurably
    Example with CircleCI

Expected behavior

The logs should exactly match for all three methods

Environment

`
Torch version: 1.6.0+cu101
Collecting environment information...
PyTorch version: 1.6.0+cu101
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla M60
GPU 1: Tesla M60
GPU 2: Tesla M60
GPU 3: Tesla M60

Nvidia driver version: 418.87.00
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.17.4
[pip3] torch==1.6.0+cu101
[pip3] torchtext==0.6.0
[pip3] torchvision==0.7.0
[conda] Could not collect
`

Additional context

In this toy example all the ranks get the same seed, but the data served for every rank differ (as it should)

@blefaudeux blefaudeux self-assigned this Oct 9, 2020
@blefaudeux
Copy link
Contributor Author

cc @msbaines @min-xu-ai, FYI, I'll look into it but just so you know

@msbaines msbaines added the bug Something isn't working label Oct 20, 2020
@blefaudeux blefaudeux linked a pull request Oct 21, 2020 that will close this issue
8 tasks
@blefaudeux
Copy link
Contributor Author

Looks like the bug is in computing the update actually, we're not using the params which we should be using possibly. It was not visible with DDP thanks to the all_reduce meaning that the data was there anyway
See P146437109, the first gradient is completely matching, but the first update differs slightly

@blefaudeux
Copy link
Contributor Author

Changing torch versions also introduces discrepancies on the reduced gradients, so that seems beyond the reach of ShardedDDP. If anything the DDP <> ShardedDDP discrepancy is reduced with pytorch 1.7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants