[pull] main from pytorch:main by pull[bot] · Pull Request #543 · ais-developer/pytorch

pull · 2025-10-01T01:05:11Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

…158652) Pull Request resolved: #158652 Approved by: https://github.com/JackCaoG, https://github.com/albanD

Follow up to #162620. Add half support, as well. This fixes some failures in inductor benchmarks such as from this log https://github.com/pytorch/pytorch/actions/runs/18051942373/job/51376749459. `NotImplementedError: "aminmax_kernel" not implemented for 'Half'` Pull Request resolved: #164175 Approved by: https://github.com/malfet, https://github.com/jerryzh168

These issues are detected by ruff [FURB171](https://docs.astral.sh/ruff/rules/single-item-membership-test/#single-item-membership-test-furb171). Pull Request resolved: #164224 Approved by: https://github.com/rec, https://github.com/Skylion007

Those were very useful in the past, because: - CI builder jobs did not generates wheels, but rather run `python setup.py develop` and shared docker layers, which is no longer the case, all CI jobs produce wheels - CD jobs were targeting pre-CXX11 ABI, but this is no longer the case after manylinux2_28 migration Existing, but acceptable gaps: - Windows libtorch debug builds sometimes might fail, but IMO it's ok not to be able to produce those for a few days, as number of libtorch users are somewhat small - All CD jobs are based on AlmaLinux, while CI are based on Ubuntu, but this could be adjusted if needed, besides AlmaLinux-9 and Ubuntu-22.04 are pretty close in terms of glibc and gcc versions - CD jobs build for all GPU architectures, while CI only for the one being tested, but there are now periodic H100 and B200 jobs, and not a lot of development happens for Voltas or Pascals Besides there are better tools to alert about the nightly failures Pull Request resolved: #164260 Approved by: https://github.com/seemethere, https://github.com/atalman

Add input checks like meta functions for standard ops in `ATen/native/LinearAlgebra.cpp` for the `out_dtype` variants. Fixes silent incorrectness in #163816 Pull Request resolved: #164095 Approved by: https://github.com/ngimel

Pull Request resolved: #164225 Approved by: https://github.com/Skylion007

…obatching (#162839) **Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. The first test verifies Replicate works with gradient accumulation properly. The second verifies that replicate works correctly with a One-Forward-One-Backward (1F1B) pipeline parallelism schedule **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_gradient_accumulation 2. pytest test/distributed/_composable/test_replicate_training.py -k test_1f1b_microbatching Pull Request resolved: #162839 Approved by: https://github.com/mori360 ghstack dependencies: #162830, #162836

**Summary: tests replicate works when users use custom forward methods** **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_register_fsdp_forward_method Pull Request resolved: #162851 Approved by: https://github.com/mori360 ghstack dependencies: #162830, #162836, #162839

**Summary:** Proof that new replicate API is composable with TP **Test Case** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_replicate_tp Pull Request resolved: #162853 Approved by: https://github.com/mori360 ghstack dependencies: #162830, #162836, #162839, #162851

…sion (#162855) **Summary:** Ensures that replicate functionality works the same as fully shard's when mixed precision is used **Test Cases** 1. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k TestReplicateMixedPrecisionTraining Pull Request resolved: #162855 Approved by: https://github.com/mori360 ghstack dependencies: #162830, #162836, #162839, #162851, #162853

Fixes #ISSUE_NUMBER Pull Request resolved: #164219 Approved by: https://github.com/Lucaskabela

for fsdp2 + EP, titan has fully_shard(AC(layer)) and fully_shard(layer.moe.experts): pytorch/torchtitan#1624 for implicit prefetching, backward order is * _pre_backward unshard (norm, output) * _backward_prefetch unshard layers.6 * post_backward reshard (norm, output) * _pre_backward unshard layers.6 (no-op, unsharded already) * _backward_prefetch unshard layers.6.moe.experts * recompute_fn pre_forward unshard layers.6.moe.experts (no-op, unsharded already) * ~~recompute_fn post_forward reshard layers.6.moe.experts~~ <----- this PR make it a no-op * _pre_backward unshard layers.6.moe.experts (no-op, unsharded already) * _backward_prefetch unshard layers.5 * post_backward reshard layers.6.moe.experts * post_backward reshard layers.6 unit test: `pytest -s test/distributed/_composable/fsdp/test_fully_shard_comm.py -k test_set_modules_to_backward_prefetch_inside_ac` before fix: `NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.expert_parallel_degree=2` ``` [rank0]:[titan] 2025-09-30 11:43:01,714 - root - INFO - step: 1 loss: 12.0162 grad_norm: 1.7315 memory: 45.64GiB(48.05%) tps: 1,028 tflops: 10.87 mfu: 1.10% [rank0]:[titan] 2025-09-30 11:43:01,714 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:[titan] 2025-09-30 11:43:35,233 - root - INFO - [GC] Performing periodical GC collection 0.06 seconds [rank0]:[titan] 2025-09-30 11:43:35,987 - root - INFO - step: 50 loss: 6.9302 grad_norm: 0.9985 memory: 59.66GiB(62.80%) tps: 11,712 tflops: 123.89 mfu: 12.53% ``` after fix: `NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.expert_parallel_degree=2` ``` [rank0]:[titan] 2025-09-30 11:38:57,377 - root - INFO - step: 1 loss: 12.0134 grad_norm: 1.6916 memory: 38.42GiB(40.45%) tps: 805 tflops: 8.51 mfu: 0.86% [rank0]:[titan] 2025-09-30 11:38:57,377 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:[titan] 2025-09-30 11:39:28,541 - root - INFO - [GC] Performing periodical GC collection 0.06 seconds [rank0]:[titan] 2025-09-30 11:39:29,279 - root - INFO - step: 50 loss: 6.9346 grad_norm: 1.1875 memory: 52.58GiB(55.36%) tps: 12,583 tflops: 133.10 mfu: 13.46% ``` for explicit prefetching, layers.6 backward prefetch layers.5 and layers.5.moe.experts. layers.6.moe.experts does not have explicit prefetch. backward order is like this * _pre_backward unshard (norm, output) * _prefetch_unshard layers.6 * post_backward reshard (norm, output) * _pre_backward unshard layers.6 (no-op, unsharded already) * _prefetch_unshard layers.5 * _prefetch_unshard layers.5.moe.experts * recompute_fn pre_forward unshard layers.6.moe.experts * ~~recompute_fn post_forward reshard layers.6.moe.experts~~ <----- this PR makes it a no-op * _pre_backward unshard layers.6.moe.expert (no-op, unsharded already) * post_backward reshard layers.6.moe.expert * post_backward reshard layers.6 before fix: `NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.expert_parallel_degree=2` ``` [rank0]:[titan] 2025-09-30 11:53:24,574 - root - INFO - step: 1 loss: 12.0180 grad_norm: 1.6948 memory: 45.77GiB(48.18%) tps: 849 tflops: 8.98 mfu: 0.91% [rank0]:[titan] 2025-09-30 11:53:24,574 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:[titan] 2025-09-30 11:53:57,768 - root - INFO - [GC] Performing periodical GC collection 0.07 seconds [rank0]:[titan] 2025-09-30 11:53:58,515 - root - INFO - step: 50 loss: 6.9358 grad_norm: 1.0528 memory: 59.80GiB(62.95%) tps: 11,827 tflops: 125.10 mfu: 12.65%``` ``` after fix: `NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.expert_parallel_degree=2` ``` [rank0]:[titan] 2025-09-30 12:08:39,404 - root - INFO - step: 1 loss: 12.0143 grad_norm: 1.7030 memory: 38.55GiB(40.58%) tps: 988 tflops: 10.45 mfu: 1.06% [rank0]:[titan] 2025-09-30 12:08:39,404 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:[titan] 2025-09-30 12:09:10,482 - root - INFO - [GC] Performing periodical GC collection 0.06 seconds [rank0]:[titan] 2025-09-30 12:09:11,168 - root - INFO - step: 50 loss: 6.9356 grad_norm: 0.9911 memory: 52.81GiB(55.59%) tps: 12,637 tflops: 133.68 mfu: 13.52% ``` Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: #164009 Approved by: https://github.com/soulitzer

…s in mixed precision (#162861) **Summary:** Ensures that replicate can handle the same type casting behavior and edge cases that fully shard can when mixed precision is used **Test Cases** 1. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_float16_on_one_submodule 2. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_submodules_with_external_inputs 3. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_norm_modules_bf16 4. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_norm_modules_fp16 5. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_clamp_reduce_dtype 6. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_dataclass_input Pull Request resolved: #162861 Approved by: https://github.com/mori360 ghstack dependencies: #162830, #162836, #162839, #162851, #162853, #162855

This is mostly a cosmetic change which replace the deprecating `data_ptr` API with mutable or const one. Pull Request resolved: #164276 Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/kwen2501

…162186) Previous work #158352 delivered CUDAGraph memory footprint reduction with no replay-time impact, but capture time regressed (up to 20× slower) due to repeated full-graph traversals. See previous benchmark results [here](#158352 (comment)) This PR removes capture/reply overhead while preserving the memory savings: 1. **Terminals as free markers** We stop inserting empty nodes and instead record the current stream terminals as free markers. This avoids mutating the user’s graph and keeps semantics unchanged. 2. **Incremental, cached reachability** We add a **per-graph reuse context** that caches reverse-traversal state: * `graph_reuse_context[graph].visited[stream]` tracks nodes already seen from that stream’s terminal frontier. * On each allocation during capture, we resume traversal from the latest terminals and only visit unseen nodes. * A block is freed when all its recorded markers are in the visited set of its allocation stream—i.e., all markers are proven predecessors of future work. See [the performance results here](https://docs.google.com/spreadsheets/d/e/2PACX-1vRPvdd9Xa8W87ixbiA0da_qvOhrUAjUpFz0G-_j-MsDnoeRyhEa4_ut_W3rqcg1VVZVFJ-gucwov-3b/pubhtml?gid=1468302443&single=true), we sweep synthetic multi-stream CUDA Graphs built by `capture_benchmark.py` (same as before, we generate random interleaving of alloc/free/join with given probabilities, see [gist here](https://gist.github.com/eee4017/e2092d215b1d4bd46534148939af39e3)), and we compare median capture/replay times and memory. On an NVIDIA H100 PCIe across 24 configs, the optimization preserves reserved memory reduction at ~24–98%, leaves allocated memory unchanged, and brings capture time back to baseline (range 0.96–1.04× vs. baseline) with replay time unchanged (range 0.97–1.11×). Pull Request resolved: #162186 Approved by: https://github.com/eqy, https://github.com/ngimel

Pull Request resolved: #164253 Approved by: https://github.com/zpcore

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: #164190 Approved by: https://github.com/pytorchbot

Pull Request resolved: #163999 Approved by: https://github.com/mikaylagawarecki

Pull Request resolved: #164244 Approved by: https://github.com/jeffdaily, https://github.com/Skylion007 Co-authored-by: Jeff Daily <jeff.daily@amd.com>

qihqi and others added 19 commits September 30, 2025 19:21

Update persons of interest for XLA. The previous one is out of date. (#…

60f0a35

…158652) Pull Request resolved: #158652 Approved by: https://github.com/JackCaoG, https://github.com/albanD

Missing lambda in torch._check (#164225)

9e63139

Pull Request resolved: #164225 Approved by: https://github.com/Skylion007

[export] avoid checks during tracing of export verification (#164219)

adc11a7

Fixes #ISSUE_NUMBER Pull Request resolved: #164219 Approved by: https://github.com/Lucaskabela

[DTensor] Allow redistribute to Partial if src matches (#164253)

60a4961

Pull Request resolved: #164253 Approved by: https://github.com/zpcore

Migrate DeviceType to torch/headeronly (#163999)

7f3dc45

Pull Request resolved: #163999 Approved by: https://github.com/mikaylagawarecki

[ROCm][CD] librocroller.so missing from ROCm 7 wheel (#164244)

ad7e3c9

Pull Request resolved: #164244 Approved by: https://github.com/jeffdaily, https://github.com/Skylion007 Co-authored-by: Jeff Daily <jeff.daily@amd.com>

pull bot locked and limited conversation to collaborators Oct 1, 2025

pull bot added the ⤵️ pull label Oct 1, 2025

pull bot merged commit ad7e3c9 into ais-developer:main Oct 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from pytorch:main#543

[pull] main from pytorch:main#543
pull[bot] merged 19 commits intoais-developer:mainfrom
pytorch:main

pull bot commented Oct 1, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

Conversation

pull bot commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

pull bot commented Oct 1, 2025 •

edited

Loading