Skip to content

[pull] main from pytorch:main#543

Merged
pull[bot] merged 19 commits intoais-developer:mainfrom
pytorch:main
Oct 1, 2025
Merged

[pull] main from pytorch:main#543
pull[bot] merged 19 commits intoais-developer:mainfrom
pytorch:main

Conversation

@pull
Copy link
Copy Markdown

@pull pull bot commented Oct 1, 2025

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

qihqi and others added 19 commits September 30, 2025 19:21
Follow up to #162620.  Add half support, as well.  This fixes some failures in inductor benchmarks such as from this log https://github.com/pytorch/pytorch/actions/runs/18051942373/job/51376749459.

`NotImplementedError: "aminmax_kernel" not implemented for 'Half'`
Pull Request resolved: #164175
Approved by: https://github.com/malfet, https://github.com/jerryzh168
Those were very useful in the past, because:
- CI builder jobs did not generates wheels, but rather run `python setup.py develop` and shared docker layers, which is no longer the case, all CI jobs produce wheels
- CD jobs were targeting pre-CXX11 ABI, but this is no longer the case after manylinux2_28 migration

Existing, but acceptable gaps:
 - Windows libtorch debug builds sometimes might fail, but IMO it's ok not to be able to produce those for a few days, as number of libtorch users are somewhat small
 - All CD jobs are based on AlmaLinux, while CI are based on Ubuntu, but this could be adjusted if needed, besides AlmaLinux-9 and Ubuntu-22.04 are pretty close in terms of glibc and gcc versions
 - CD jobs build for all GPU architectures, while CI only for the one being tested, but there are now periodic H100 and B200 jobs, and not a lot of development happens for Voltas or Pascals

Besides there are better tools to alert about the nightly failures

Pull Request resolved: #164260
Approved by: https://github.com/seemethere, https://github.com/atalman
Add input checks like meta functions for standard ops in `ATen/native/LinearAlgebra.cpp` for the `out_dtype` variants. Fixes silent incorrectness in #163816

Pull Request resolved: #164095
Approved by: https://github.com/ngimel
…obatching (#162839)

**Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. The first test verifies Replicate works with gradient accumulation properly. The second verifies that replicate works correctly with a One-Forward-One-Backward (1F1B) pipeline parallelism schedule

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_gradient_accumulation
2. pytest test/distributed/_composable/test_replicate_training.py -k test_1f1b_microbatching

Pull Request resolved: #162839
Approved by: https://github.com/mori360
ghstack dependencies: #162830, #162836
**Summary: tests replicate works when users use custom forward methods**

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_register_fsdp_forward_method

Pull Request resolved: #162851
Approved by: https://github.com/mori360
ghstack dependencies: #162830, #162836, #162839
**Summary:** Proof that new replicate API is composable with TP

**Test Case**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_replicate_tp

Pull Request resolved: #162853
Approved by: https://github.com/mori360
ghstack dependencies: #162830, #162836, #162839, #162851
…sion (#162855)

**Summary:** Ensures that replicate functionality works the same as fully shard's when mixed precision is used

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k TestReplicateMixedPrecisionTraining

Pull Request resolved: #162855
Approved by: https://github.com/mori360
ghstack dependencies: #162830, #162836, #162839, #162851, #162853
for fsdp2 + EP, titan has fully_shard(AC(layer)) and fully_shard(layer.moe.experts): pytorch/torchtitan#1624

for implicit prefetching, backward order is
* _pre_backward unshard (norm, output)
* _backward_prefetch unshard layers.6
* post_backward reshard (norm, output)
* _pre_backward unshard layers.6 (no-op, unsharded already)
* _backward_prefetch unshard layers.6.moe.experts
* recompute_fn pre_forward unshard layers.6.moe.experts (no-op, unsharded already)
* ~~recompute_fn post_forward reshard layers.6.moe.experts~~ <----- this PR make it a no-op
* _pre_backward unshard layers.6.moe.experts (no-op, unsharded already)
* _backward_prefetch unshard layers.5
* post_backward reshard layers.6.moe.experts
* post_backward reshard layers.6

unit test: `pytest -s test/distributed/_composable/fsdp/test_fully_shard_comm.py -k test_set_modules_to_backward_prefetch_inside_ac`

before fix: `NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.expert_parallel_degree=2`
```
[rank0]:[titan] 2025-09-30 11:43:01,714 - root - INFO - step:  1  loss: 12.0162  grad_norm:  1.7315  memory: 45.64GiB(48.05%)  tps: 1,028  tflops: 10.87  mfu: 1.10%
[rank0]:[titan] 2025-09-30 11:43:01,714 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-09-30 11:43:35,233 - root - INFO - [GC] Performing periodical GC collection 0.06 seconds
[rank0]:[titan] 2025-09-30 11:43:35,987 - root - INFO - step: 50  loss:  6.9302  grad_norm:  0.9985  memory: 59.66GiB(62.80%)  tps: 11,712  tflops: 123.89  mfu: 12.53%
```

after fix: `NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.expert_parallel_degree=2`
```
[rank0]:[titan] 2025-09-30 11:38:57,377 - root - INFO - step:  1  loss: 12.0134  grad_norm:  1.6916  memory: 38.42GiB(40.45%)  tps: 805  tflops: 8.51  mfu: 0.86%
[rank0]:[titan] 2025-09-30 11:38:57,377 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-09-30 11:39:28,541 - root - INFO - [GC] Performing periodical GC collection 0.06 seconds
[rank0]:[titan] 2025-09-30 11:39:29,279 - root - INFO - step: 50  loss:  6.9346  grad_norm:  1.1875  memory: 52.58GiB(55.36%)  tps: 12,583  tflops: 133.10  mfu: 13.46%
```

for explicit prefetching, layers.6 backward prefetch layers.5 and layers.5.moe.experts. layers.6.moe.experts does not have explicit prefetch. backward order is like this
* _pre_backward unshard (norm, output)
* _prefetch_unshard layers.6
* post_backward reshard (norm, output)
* _pre_backward unshard layers.6 (no-op, unsharded already)
* _prefetch_unshard layers.5
* _prefetch_unshard layers.5.moe.experts
* recompute_fn pre_forward unshard layers.6.moe.experts
* ~~recompute_fn post_forward reshard layers.6.moe.experts~~ <----- this PR makes it a no-op
* _pre_backward unshard layers.6.moe.expert (no-op, unsharded already)
* post_backward reshard layers.6.moe.expert
* post_backward reshard layers.6

before fix: `NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.expert_parallel_degree=2`
```
[rank0]:[titan] 2025-09-30 11:53:24,574 - root - INFO - step:  1  loss: 12.0180  grad_norm:  1.6948  memory: 45.77GiB(48.18%)  tps: 849  tflops: 8.98  mfu: 0.91%
[rank0]:[titan] 2025-09-30 11:53:24,574 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-09-30 11:53:57,768 - root - INFO - [GC] Performing periodical GC collection 0.07 seconds
[rank0]:[titan] 2025-09-30 11:53:58,515 - root - INFO - step: 50  loss:  6.9358  grad_norm:  1.0528  memory: 59.80GiB(62.95%)  tps: 11,827  tflops: 125.10  mfu: 12.65%```
```

after fix: `NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.expert_parallel_degree=2`
```
[rank0]:[titan] 2025-09-30 12:08:39,404 - root - INFO - step:  1  loss: 12.0143  grad_norm:  1.7030  memory: 38.55GiB(40.58%)  tps: 988  tflops: 10.45  mfu: 1.06%
[rank0]:[titan] 2025-09-30 12:08:39,404 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-09-30 12:09:10,482 - root - INFO - [GC] Performing periodical GC collection 0.06 seconds
[rank0]:[titan] 2025-09-30 12:09:11,168 - root - INFO - step: 50  loss:  6.9356  grad_norm:  0.9911  memory: 52.81GiB(55.59%)  tps: 12,637  tflops: 133.68  mfu: 13.52%
```

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: #164009
Approved by: https://github.com/soulitzer
…s in mixed precision (#162861)

**Summary:** Ensures that replicate can handle the same type casting behavior and edge cases that fully shard can when mixed precision is used

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_float16_on_one_submodule
2. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_submodules_with_external_inputs
3. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_norm_modules_bf16
4. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_norm_modules_fp16
5. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_clamp_reduce_dtype
6. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_dataclass_input

Pull Request resolved: #162861
Approved by: https://github.com/mori360
ghstack dependencies: #162830, #162836, #162839, #162851, #162853, #162855
This is mostly a cosmetic change which replace the deprecating `data_ptr` API with mutable or const one.
Pull Request resolved: #164276
Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/kwen2501
…162186)

Previous work #158352 delivered CUDAGraph memory footprint reduction with no replay-time impact, but capture time regressed (up to 20× slower) due to repeated full-graph traversals. See previous benchmark results [here](#158352 (comment))

This PR removes capture/reply overhead while preserving the memory savings:

1. **Terminals as free markers**
   We stop inserting empty nodes and instead record the current stream terminals as free markers. This avoids mutating the user’s graph and keeps semantics unchanged.

2. **Incremental, cached reachability**
   We add a **per-graph reuse context** that caches reverse-traversal state:

   * `graph_reuse_context[graph].visited[stream]` tracks nodes already seen from that stream’s terminal frontier.
   * On each allocation during capture, we resume traversal from the latest terminals and only visit unseen nodes.
   * A block is freed when all its recorded markers are in the visited set of its allocation stream—i.e., all markers are proven predecessors of future work.

See [the performance results here](https://docs.google.com/spreadsheets/d/e/2PACX-1vRPvdd9Xa8W87ixbiA0da_qvOhrUAjUpFz0G-_j-MsDnoeRyhEa4_ut_W3rqcg1VVZVFJ-gucwov-3b/pubhtml?gid=1468302443&single=true), we sweep synthetic multi-stream CUDA Graphs built by `capture_benchmark.py` (same as before, we generate random interleaving of alloc/free/join with given probabilities, see [gist here](https://gist.github.com/eee4017/e2092d215b1d4bd46534148939af39e3)), and we compare median capture/replay times and memory. On an NVIDIA H100 PCIe across 24 configs, the optimization preserves reserved memory reduction at ~24–98%, leaves allocated memory unchanged, and brings capture time back to baseline (range 0.96–1.04× vs. baseline) with replay time unchanged (range 0.97–1.11×).

Pull Request resolved: #162186
Approved by: https://github.com/eqy, https://github.com/ngimel
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: #164190
Approved by: https://github.com/pytorchbot
Pull Request resolved: #164244
Approved by: https://github.com/jeffdaily, https://github.com/Skylion007

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
@pull pull bot locked and limited conversation to collaborators Oct 1, 2025
@pull pull bot added the ⤵️ pull label Oct 1, 2025
@pull pull bot merged commit ad7e3c9 into ais-developer:main Oct 1, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.