fix: add setup_context for torch.func compatibility by roycho96 · Pull Request #7916 · deepspeedai/DeepSpeed

roycho96 · 2026-03-21T09:22:42Z

LinearFunctionForZeroStage3 uses the legacy forward(ctx, ...) pattern which is incompatible with torch.func transforms (torch.func.grad, torch.func.grad_and_value, vmap, etc.):

RuntimeError: In order to use an autograd.Function with functorch transforms
(vmap, grad, jvp, jacrev, ...), it must override the setup_context staticmethod.

This affects any library that uses torch.func internally on a ZeRO-3 model.

Fix

Fixes #7913

Note

As pointed out by @zhangj1an in #7913, PostBackwardFunctionModule and PreBackwardFunctionForModule in parameter_offload.py have the same issue. Those will be addressed in a follow-up commit within this PR.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eed37042bc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

deepspeed/runtime/zero/linear.py

Signed-off-by: Sung Hyun Cho <hope5487@gmail.com>

… unpack error Signed-off-by: Sung Hyun Cho <hope5487@gmail.com>

roycho96 · 2026-03-22T03:17:12Z

Thanks for the work! I implemented the same fix, so it looks good to me. To reduce reviewer's effort, I had 2 minor comments. This should lead to linear.py to only have 17 insertions (+) and 7 deletions (-).

Hi @zhangj1an, I've sent you a collaborator invite to my fork. Feel free to push your fix directly to the branch. Thanks for the suggestion!

… setup_context Co-authored-by: zhangj1an <jianmusings@gmail.com> Signed-off-by: Sung Hyun Cho <hope5487@gmail.com>

…afe linear Avoid asymmetric custom_bwd without custom_fwd on the setup_context forward path; mirror forward AMP in backward via torch.amp.autocast. Signed-off-by: Zhang <jianmusings@gmail.com>

PyTorch versions that expose autograd.Function.setup_context need the modern forward + setup_context shape for torch.func / functorch. Signed-off-by: Zhang <jianmusings@gmail.com>

Signed-off-by: Zhang <jianmusings@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 60d20da79f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

deepspeed/runtime/zero/linear.py

Flink-ddd · 2026-03-28T18:16:21Z

Hi @tohtana , would you mind reviewing this PR when you're free? It addresses a useful compatibility fix for torch.func. Much appreciated!

Signed-off-by: Sung Hyun Cho <hope5487@gmail.com>

roycho96 · 2026-03-29T03:24:56Z

I additionally fix autocast backward: always wrap with autocast(enabled=ctx._fwd_used_autocast) to match @custom_bwd semantics and prevent outer autocast leaking into backward.

Signed-off-by: Sung Hyun Cho <hope5487@gmail.com>

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana

Thank you @roycho96, @zhangj1an!

This PR overall looks good to me. The issue is I merged my PR #7920 and it caused a conflict as I hadn't checked the changes by this PR. Sorry for that.
I opened a new PR on your fork to resolve the conflict. Please check and merge it.

scripts/repro_pr7916.py

deepspeed/runtime/zero/linear.py

Resolve master merge conflict for deepspeedai#7916

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

roycho96 requested review from tjruwase and tohtana as code owners March 21, 2026 09:22

roycho96 marked this pull request as draft March 21, 2026 09:22

roycho96 mentioned this pull request Mar 21, 2026

[BUG] LinearFunctionForZeroStage3 crashes with torch.func transforms (missing setup_context) #7913

Open

chatgpt-codex-connector bot reviewed Mar 21, 2026

View reviewed changes

deepspeed/runtime/zero/linear.py Outdated Show resolved Hide resolved

deepspeed/runtime/zero/linear.py Outdated Show resolved Hide resolved

roycho96 added 2 commits March 21, 2026 19:27

fix: fix LinearFunctionForZeroStage3 to support torch.func transforms

33db7c4

Signed-off-by: Sung Hyun Cho <hope5487@gmail.com>

fix: always pass bias arg in zero3_linear_wrap to avoid setup_context…

39b1755

… unpack error Signed-off-by: Sung Hyun Cho <hope5487@gmail.com>

roycho96 force-pushed the fix/support-func-torch branch from 252aea1 to 39b1755 Compare March 21, 2026 10:28

This comment was marked as resolved.

Sign in to view

fix: remove @autocast_custom_fwd from forward, move autocast state to…

6df37af

… setup_context Co-authored-by: zhangj1an <jianmusings@gmail.com> Signed-off-by: Sung Hyun Cho <hope5487@gmail.com>

zhangj1an force-pushed the fix/support-func-torch branch from 444122c to 6df37af Compare March 22, 2026 08:45

zhangj1an added 2 commits March 22, 2026 08:52

fix(zero3): replace custom_bwd with explicit autocast for functorch-s…

c0b9694

…afe linear Avoid asymmetric custom_bwd without custom_fwd on the setup_context forward path; mirror forward AMP in backward via torch.amp.autocast. Signed-off-by: Zhang <jianmusings@gmail.com>

fix(zero): use setup_context for offload pre/post backward Functions

5e83d05

PyTorch versions that expose autograd.Function.setup_context need the modern forward + setup_context shape for torch.func / functorch. Signed-off-by: Zhang <jianmusings@gmail.com>

zhangj1an force-pushed the fix/support-func-torch branch from 0a66444 to 5e83d05 Compare March 22, 2026 09:34

This comment was marked as resolved.

Sign in to view

zhangj1an and others added 9 commits March 24, 2026 22:21

Merge branch 'master' into fix/support-func-torch

7483701

run pre-commit checks

a1e798d

Signed-off-by: Zhang <jianmusings@gmail.com>

update unit tests to reproduce main branch error

8762d00

Signed-off-by: Zhang <jianmusings@gmail.com>

add reproduce scripts

dd037da

Signed-off-by: Zhang <jianmusings@gmail.com>

Merge branch 'master' into fix/support-func-torch

01ee5a6

update reproduce script

f69c1f1

Signed-off-by: Zhang <jianmusings@gmail.com>

update reproduce script to skip repeated env setup

e58ac18

Signed-off-by: Zhang <jianmusings@gmail.com>

update reproduce script to remove duplicated code

3121a7f

Signed-off-by: Zhang <jianmusings@gmail.com>

update reproduce script to print test env

60d20da

Signed-off-by: Zhang <jianmusings@gmail.com>

This comment was marked as resolved.

Sign in to view

roycho96 marked this pull request as ready for review March 25, 2026 14:35

roycho96 requested a review from loadams as a code owner March 25, 2026 14:35

chatgpt-codex-connector bot reviewed Mar 25, 2026

View reviewed changes

deepspeed/runtime/zero/linear.py Outdated Show resolved Hide resolved

drop PyTorch < 2.0 support and fix autocast backward in ZeRO linear

bb245b2

Signed-off-by: Sung Hyun Cho <hope5487@gmail.com>

change PyTorch version in README

04c456f

Signed-off-by: Sung Hyun Cho <hope5487@gmail.com>

roycho96 force-pushed the fix/support-func-torch branch from 1acca1f to 04c456f Compare March 29, 2026 03:40

resolve conflict with master

703aad3

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana mentioned this pull request Mar 29, 2026

Resolve master merge conflict for #7916 roycho96/DeepSpeed#1

Merged

tohtana reviewed Mar 29, 2026

View reviewed changes

scripts/repro_pr7916.py Outdated Show resolved Hide resolved

deepspeed/runtime/zero/linear.py Show resolved Hide resolved

zhangj1an added 4 commits March 30, 2026 10:01

Merge pull request #1 from tohtana/tohtana/pr7916-merge-master-resolve

8468149

Resolve master merge conflict for deepspeedai#7916

remove repro scripts

e309a6f

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

update unit test

e425569

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

drop support for pytorch<2.0

39f7e3c

Signed-off-by: Zhang Jian <jianmusings@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add setup_context for torch.func compatibility#7916

fix: add setup_context for torch.func compatibility#7916
roycho96 wants to merge 21 commits intodeepspeedai:masterfrom
roycho96:fix/support-func-torch

roycho96 commented Mar 21, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

roycho96 commented Mar 22, 2026

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Flink-ddd commented Mar 28, 2026

Uh oh!

roycho96 commented Mar 29, 2026

Uh oh!

tohtana left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

roycho96 commented Mar 21, 2026

Fix

Note

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

roycho96 commented Mar 22, 2026

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Flink-ddd commented Mar 28, 2026

Uh oh!

roycho96 commented Mar 29, 2026

Uh oh!

tohtana left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tohtana left a comment •

edited

Loading