Don't add unoptimized steps to computational graph in coupled training by jpdunc23 · Pull Request #1013 · ai2cm/ace

jpdunc23 · 2026-03-26T02:27:22Z

Avoid adding unoptimized steps (i.e., those where LossContributionsConfig settings result in 0 loss weight) to the computational graph by computing those steps with torch.no_grad(). In a production job, this change resulted in ~13% decrease in GPU memory utilization.

Changes:

Adds step_is_optimized() helper method to CoupledStepperTrainLoss which can be passed to CoupledStepper.get_prediction_generator() via its new argument of the same name.
Tests added

jpdunc23 · 2026-03-26T16:02:24Z

Slide here showing the reduction in GPU memory util.

mcgibbon

Comment: Not something you need to change, but noting the separation of responsibilities is different between the coupled code and ace. In ace, the TrainStepper is responsible for deciding/knowing which steps should be optimized, keeping the loss object a simpler "gets the loss on a particular step" object. Here the loss defines the loss on a series of steps in a window, though the way it's called to compute the loss is still by passing particular steps.

I think this leads to more coupling between the train stepper and the loss, but also, I can see the feeling that because the window of losses is more complicated in the coupled case, it's nice to pull it out into a level other than the stepper.

mcgibbon · 2026-03-26T16:40:35Z

fme/coupled/stepper.py

        initial_condition: CoupledPrognosticState,
        forcing_data: CoupledBatchData,
        optimizer: OptimizationABC,
+        step_is_optimized: Callable[[str, int], bool] | None = None,


Issue: It took me a while to understand what this was doing, at first I mis-read below and thought that this argument overrides a default implementation that calls self.step_is_optimized, but then I noticed below there's no self..

Suggestion: I think the behavior would be clear and the logic below simpler if you made the default lambda n, c: True or something similar.

Suggested change

step_is_optimized: Callable[[str, int], bool] | None = None,

step_is_optimized: Callable[[str, int], bool] = lambda n, c: None,

mcgibbon

Approving pending the line suggestion or something similar.

jpdunc23 · 2026-03-26T18:36:04Z

Comment: Not something you need to change, but noting the separation of responsibilities is different between the coupled code and ace. In ace, the TrainStepper is responsible for deciding/knowing which steps should be optimized, keeping the loss object a simpler "gets the loss on a particular step" object. Here the loss defines the loss on a series of steps in a window, though the way it's called to compute the loss is still by passing particular steps.

I think this leads to more coupling between the train stepper and the loss, but also, I can see the feeling that because the window of losses is more complicated in the coupled case, it's nice to pull it out into a level other than the stepper.

Agreed. Will refactor as in the implementation in #868 when I get back to that PR.

Don't add unoptimized steps to computational graph

e0ab6cb

jpdunc23 changed the title ~~Don't add unoptimized steps to computational graph~~ Don't add unoptimized steps to computational graph in coupled training Mar 26, 2026

Merge branch 'main' into refactor/coupled-no-grad-steps

2f6394a

jpdunc23 marked this pull request as ready for review March 26, 2026 16:02

mcgibbon reviewed Mar 26, 2026

View reviewed changes

mcgibbon approved these changes Mar 26, 2026

View reviewed changes

Address review comment

80ff4c1

jpdunc23 merged commit c905556 into main Mar 26, 2026
7 checks passed

jpdunc23 deleted the refactor/coupled-no-grad-steps branch March 26, 2026 18:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't add unoptimized steps to computational graph in coupled training#1013

Don't add unoptimized steps to computational graph in coupled training#1013
jpdunc23 merged 3 commits intomainfrom
refactor/coupled-no-grad-steps

jpdunc23 commented Mar 26, 2026 •

edited

Loading

Uh oh!

jpdunc23 commented Mar 26, 2026 •

edited

Loading

Uh oh!

mcgibbon left a comment

Uh oh!

mcgibbon Mar 26, 2026

Uh oh!

mcgibbon Mar 26, 2026

Uh oh!

mcgibbon left a comment

Uh oh!

jpdunc23 commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	step_is_optimized: Callable[[str, int], bool] \| None = None,
	step_is_optimized: Callable[[str, int], bool] = lambda n, c: None,

Conversation

jpdunc23 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpdunc23 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcgibbon left a comment

Choose a reason for hiding this comment

Uh oh!

mcgibbon Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

mcgibbon Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

mcgibbon left a comment

Choose a reason for hiding this comment

Uh oh!

jpdunc23 commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jpdunc23 commented Mar 26, 2026 •

edited

Loading

jpdunc23 commented Mar 26, 2026 •

edited

Loading