Fixing bug introduced by PyTorch 1.10 #246

tmarkovich · 2022-01-19T19:04:07Z

Types of changes

Docs change / refactoring / dependency upgrade
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Motivation and Context / Related issue

This fixes issue #245.

How Has This Been Tested (if it applies)

I've confirmed that I've been able to train on PyTorch 1.10 using this fix.

Checklist

The documentation is up-to-date with the changes I made.
I have read the CONTRIBUTING document and completed the CLA (see CONTRIBUTING).
All tests passed, and additional code has been covered with new tests.

adamlerer

Thanks! Would it be possible to make this small change? I want to confirm that the error is really because of the expand rather than because of new_zeros.

adamlerer · 2022-01-19T19:17:40Z

torchbiggraph/losses.py

@@ -147,15 +147,15 @@ def forward(
        if weight is not None:
            loss_per_sample = F.cross_entropy(
                scores,
-                pos_scores.new_zeros((), dtype=torch.long).expand(num_pos),
+                torch.zeros((num_pos, ), dtype=torch.long, device=scores.device),


Suggested change

torch.zeros((num_pos, ), dtype=torch.long, device=scores.device),

pos_scores.new_zeros((num_pos, ), dtype=torch.long),

adamlerer · 2022-01-19T19:18:09Z

torchbiggraph/losses.py

                reduction="none",
            )
            match_shape(weight, num_pos)
            loss_per_sample = loss_per_sample * weight
        else:
            loss_per_sample = F.cross_entropy(
                scores,
-                pos_scores.new_zeros((), dtype=torch.long).expand(num_pos),
+                torch.zeros((num_pos, ), dtype=torch.long, device=scores.device),


Suggested change

torch.zeros((num_pos, ), dtype=torch.long, device=scores.device),

pos_scores.new_zeros((num_pos, ), dtype=torch.long),

lw · 2022-01-20T12:53:17Z

This change is undoing a perf optimization: the code before was carefully written to allocate only one element in memory, and then adapt the tensor's indexing so that it looks like that element is replicated many times. The new code however allocates multiple elements in memory. I don't know if the impact is significant, but it would certainly be better to avoid it.

If there has been a regression in PyTorch between 1.9 and 1.10 that should be reported upstream and ideally fixed there, instead of working around it in our code.

tmarkovich · 2022-01-20T14:39:50Z

@lw I agree in principal, but the timeframe on which the upstream bug will be fixed. I've also observed almost no performance degradation in practice with the above change -- with 16 A100 GPUs I'm able to train at 46m edges/second with pytorch 1.9 and new_zeros or pytorch 1.10 and the above change.

@adamlerer I did find that the issue is down to the expand and not the difference between zeros and new_zeros.

@lw with that the above in mind, what is the perf difference between new_zeros((), ...).expand(npos) and new_zeros((npos, ))?

Anyway, I'll report this bug upstream to pytorch as well.

lw · 2022-01-20T14:44:31Z

what is the perf difference between new_zeros((), ...).expand(npos) and new_zeros((npos, ))?

It's more of a memory gain (and, secondarily, the time gain to allocate/fill in that memory). But admittedly it's probably barely significant, since if I remember correctly we're talking about allocating 4 extra bytes for each input edge of a batch (rather than just 4 bytes once for the entire batch).

If you've confirmed that the expand is indeed the root cause of the bug and that there's no observable difference after this fix, then I'm fine with it. Thanks for reporting the bug upstream anyways!

tmarkovich · 2022-01-20T14:50:34Z

I'll go ahead and change to new_zeros((npos, )) and then add an issue to revert the fix once the upstream bug is fixed (with linked issue).

In the mean time, here's the upstream issue:
pytorch/pytorch#71550

adamlerer · 2022-01-20T14:59:47Z

@lw you're right that there's a (minor) performance implication to this, but I think it's overshadowed by the benefit of PyTorch 1.10 compatibility (even if this is fixed in a future version, we ideally want compatibility across a range of pytorch versions).

facebook-github-bot · 2022-01-20T15:01:18Z

@adamlerer has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ngimel · 2022-01-20T21:51:13Z

It is indeed fixed in point release (which everyone should be using instead of 1.10) https://github.com/pytorch/pytorch/pull/69617/files#diff-13bb6989036251a34e2a9f7bd28349761c20950940cd22300a72ffc860296ed0R285, but it's fixed in a way to undo your perf optimization inside cross-entropy implementation, by calling contiguous() on your expanded tensor.

tmarkovich · 2022-01-27T14:54:58Z

I just wanted to check in -- it looks like this was imported into the Meta internal repo -- did it get mainlined externally? If so, should we close this PR? And if not, should I make any changes?

adamlerer · 2022-02-06T17:27:58Z

It is indeed fixed in point release (which everyone should be using instead of 1.10) https://github.com/pytorch/pytorch/pull/69617/files#diff-13bb6989036251a34e2a9f7bd28349761c20950940cd22300a72ffc860296ed0R285, but it's fixed in a way to undo your perf optimization inside cross-entropy implementation, by calling contiguous() on your expanded tensor.

What is "point release" @ngimel ?

adamlerer · 2022-02-06T17:43:24Z

Sorry about the delay @tmarkovich , this PR should merge today.

ngimel · 2022-02-07T20:13:32Z

What is "point release" @ngimel ?

1.10.1 or 1.10.2 (it's fixed in both).

Fixing bug introduced by PyTorch 1.10

b825903

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 19, 2022

adamlerer suggested changes Jan 19, 2022

View reviewed changes

PR comments

13e196f

tmarkovich mentioned this pull request Jan 20, 2022

Revert expand removal #247

Open

facebook-github-bot closed this in 67c42f9 Feb 7, 2022

tmarkovich mentioned this pull request Feb 11, 2022

GPU Training with Softmax Loss throws Asserts #245

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing bug introduced by PyTorch 1.10 #246

Fixing bug introduced by PyTorch 1.10 #246

tmarkovich commented Jan 19, 2022

adamlerer left a comment

adamlerer Jan 19, 2022

adamlerer Jan 19, 2022

lw commented Jan 20, 2022

tmarkovich commented Jan 20, 2022

lw commented Jan 20, 2022

tmarkovich commented Jan 20, 2022

adamlerer commented Jan 20, 2022

facebook-github-bot commented Jan 20, 2022

ngimel commented Jan 20, 2022

tmarkovich commented Jan 27, 2022

adamlerer commented Feb 6, 2022

adamlerer commented Feb 6, 2022

ngimel commented Feb 7, 2022

	torch.zeros((num_pos, ), dtype=torch.long, device=scores.device),
	pos_scores.new_zeros((num_pos, ), dtype=torch.long),

Fixing bug introduced by PyTorch 1.10 #246

Fixing bug introduced by PyTorch 1.10 #246

Conversation

tmarkovich commented Jan 19, 2022

Types of changes

Motivation and Context / Related issue

How Has This Been Tested (if it applies)

Checklist

adamlerer left a comment

Choose a reason for hiding this comment

adamlerer Jan 19, 2022

Choose a reason for hiding this comment

adamlerer Jan 19, 2022

Choose a reason for hiding this comment

lw commented Jan 20, 2022

tmarkovich commented Jan 20, 2022

lw commented Jan 20, 2022

tmarkovich commented Jan 20, 2022

adamlerer commented Jan 20, 2022

facebook-github-bot commented Jan 20, 2022

ngimel commented Jan 20, 2022

tmarkovich commented Jan 27, 2022

adamlerer commented Feb 6, 2022

adamlerer commented Feb 6, 2022

ngimel commented Feb 7, 2022