Realm support for data movement by elliottslaughter · Pull Request #1637 · flexflow/flexflow-train

elliottslaughter · 2026-03-18T04:13:52Z

This PR adds support for issuing copies in Realm when operators are spread out over multiple devices. In principle this should enable distributed model parallelism. Other parallel operators (e.g., required for data parallelism) are not implemented.

Overview of changes:

Adds CopyAttrs to TrainingOperationAttrs in task-spec to permit copies to be represented in the dynamic graph
DynamicValueAttrs now track their mapping explicitly
Adds an explicit copy insertion pass to the dynamic graph that fills mapping on DynamicValueAttrs and inserts copies where this would break edges in the dependence graph
Update shard expansion to expand copies
Update Realm infrastructure to issue copies when present in the dynamic graph

This change is

…dencies.

lockshaw

@lockshaw reviewed 26 files and all commit messages, and made 13 comments.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on elliottslaughter).

lib/realm-execution/src/realm-execution/pcg_instance.cc line 163 at r1 (raw file):

}

static Realm::Event spawn_dynamic_node_invocation(

Minor: A small docstring would be nice, as the number of arguments makes it a bit hard to quickly skim for the meaning of this function

lib/realm-execution/src/realm-execution/realm_context.cc line 174 at r1 (raw file):

      /*field_id=*/0,
      /*size=*/
      static_cast<size_t>(int{size_of_datatype(src_piece_shape.data_type)}),

Minor: Slightly clearer/more idiomatic in the codebase

Suggestion:

      static_cast<size_t>(size_of_datatype(src_piece_shape.data_type).int_from_positive_int()),

lib/realm-execution/src/realm-execution/realm_context.cc line 218 at r1 (raw file):

    default:
      PANIC("TensorShape dims greater than REALM_MAX_DIM",
            fmt::to_string(src_piece_shape.dims.ff_ordered.num_dims()));

Minor: I don't think you need the explicit to_string call

Suggestion:

      PANIC("TensorShape dims greater than REALM_MAX_DIM: {}", src_piece_shape.dims.ff_ordered.num_dims());

lib/task-spec/include/task-spec/dynamic_graph/copy_insertion.h line 0 at r1 (raw file):
Add a high-level explanation of copy insertion to dynamic_graph/index.dox and (ideally) link to there from here

lib/task-spec/src/task-spec/dynamic_graph/copy_insertion.cc line 115 at r1 (raw file):

      auto const &[filtered_source, filtered_use] =
          filter_mapping_to_avoid_degenerate_copies(source_value, use_value);
      DynamicNodeInvocation copy{

FYI Normally we'd do DynamicNodeInvocation copy = DynamicNodeInvocation{, so that would technically be slightly more idiomatic in the codebase, but it really doesn't matter much

Code quote:

      DynamicNodeInvocation copy{

lib/task-spec/src/task-spec/dynamic_graph/copy_insertion.cc line 154 at r1 (raw file):

  ASSERT(no_part_of_graph_is_copy_inserted(g));

  std::unordered_map<DynamicValueAttrs, DynamicValueAttrs> sources;

A more specific variable name here would be really helpful, especially since the type declaration is not particularly illuminating

Code quote:

  std::unordered_map<DynamicValueAttrs, DynamicValueAttrs> sources;

lib/task-spec/src/task-spec/dynamic_graph/dynamic_task_type.cc line 6 at r1 (raw file):

namespace FlexFlow {

DynamicTaskType decide_copy_task_type(DynamicTensorRole role) {

Minor: Slightly clearer name. Unless I'm misunderstanding, this function isn't really doing any "deciding", it's really just flattening some nesting of task types

Suggestion:

DynamicTaskType dynamic_task_type_from_tensor_role(DynamicTensorRole role) {

lib/task-spec/src/task-spec/dynamic_graph/shard_expansion.cc line 49 at r1 (raw file):

        ParallelTensorSpaceCoordinate const &parallel_tensor_coord) {
  return filter_keys(mapping, [&](ParallelTensorSpaceCoordinate const &p) {
    return p == parallel_tensor_coord;

If you're fixing the key, what's the point of returning a bidict over an unordered_set from this function?

lib/task-spec/src/task-spec/dynamic_graph/shard_expansion.cc line 90 at r1 (raw file):

static std::unordered_set<DynamicNodeInvocation>
    perform_shard_expansion_for_copy(DynamicNodeInvocation const &i) {
  auto const &[input_slot, input] = get_only(i.inputs);

Minor: Generally I discourage assigning references unless necessary, as it opens up more room for lifetime bugs

Suggestion:

 auto [input_slot, input] = get_only(i.inputs);

lib/task-spec/src/task-spec/dynamic_graph/shard_expansion.cc line 96 at r1 (raw file):

  bidict<ParallelTensorSpaceCoordinate, MachineSpaceCoordinate> output_mapping =
      assert_unwrap(output.mapping);
  require_same(input_mapping.left_values(), output_mapping.left_values());

Minor: Slightly more idiomatic, as that way you don't have to arbitrarily choose which mapping (input or output) to use in the rest of the function

Suggestion:

  bidict<ParallelTensorSpaceCoordinate, MachineSpaceCoordinate> mapping =
    require_same(assert_unwrap(input.mapping), assert_unwrap(output.mapping));

lib/task-spec/test/src/task-spec/dynamic_graph/copy_insertion.cc line 388 at r1 (raw file):

    SUBCASE("copy one tensor, one point") {
      std::unordered_map<DynamicValueAttrs, DynamicValueAttrs> sources_copy1{
          {graph_input1, graph_input1_src_copy1},

It seems that some of the initialization is used for a single subcase? If so, it would make it more readable to move the creation of those subcase-specific values into the subcase itself.

Also, is there any way to shrink the amount of stuff needed in this test? Wading through all the construction is not fun, though admittedly maybe just coalescing the setup into the subcases will make the storyline of the setup sufficiently clear that we won't need to do this.

Code quote:

         {graph_input1, graph_input1_src_copy1},

lib/task-spec/test/src/task-spec/dynamic_graph/shard_expansion.cc line 335 at r1 (raw file):

          DynamicNodeAttrs{
              /*task_type=*/std::nullopt,
              /*device_coord=*/device_coord,

What is the meaning of the device placement of a copy? Is it the source or destination of the copy? It seems like either way that's going to run into issues in the backward pass, where the copy will have to operate the other direction, but I don't see any code for handling that currently?

lib/task-spec/test/src/task-spec/dynamic_graph/shard_expansion.cc line 351 at r1 (raw file):

              },
          },
      };

Minor: Might be clearer to make this a modification of input rather than a full reconstruction? It looks like very little is changed, but it's kinda hard to spot what changes in all the initalization

Code quote:

      return DynamicNodeInvocation{
          /*inputs=*/{
              {
                  mk_slot(TensorSlotName::INPUT),
                  mk_value(0,
                           TensorSlotName::OUTPUT,
                           src_binding,
                           tensor_shard_coord),
              },
          },
          /*node_attrs=*/
          DynamicNodeAttrs{
              /*task_type=*/std::nullopt,
              /*device_coord=*/device_coord,
              /*mapping=*/std::nullopt,
              /*op_attrs=*/TrainingOperationAttrs{CopyAttrs{}},
              /*layer_guid=*/dynamic_layer_guid_t{dynamic_copy_layer_guid_t{}},
              /*per_device_op_state=*/std::nullopt,
          },
          /*outputs=*/
          {
              {
                  mk_slot(TensorSlotName::OUTPUT),
                  mk_value(20,
                           TensorSlotName::OUTPUT,
                           dst_binding,
                           tensor_shard_coord),
              },
          },
      };

elliottslaughter

@elliottslaughter made 8 comments and resolved 4 discussions.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on lockshaw).

lib/task-spec/include/task-spec/dynamic_graph/copy_insertion.h line at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Add a high-level explanation of copy insertion to dynamic_graph/index.dox and (ideally) link to there from here

I added the link next to the others, but I don't see any existing explanations of any other passes there (or anywhere). Am I missing something? I think the version pushed with this comment should match the standard that the others are currently held to (as far as I can tell).

lib/task-spec/src/task-spec/dynamic_graph/copy_insertion.cc line 154 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

A more specific variable name here would be really helpful, especially since the type declaration is not particularly illuminating

Done.

lib/task-spec/src/task-spec/dynamic_graph/dynamic_task_type.cc line 6 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Minor: Slightly clearer name. Unless I'm misunderstanding, this function isn't really doing any "deciding", it's really just flattening some nesting of task types

Yes, but it's for a copy. Otherwise it's nonsensical to even be speaking of converting a tensor role into a task type.

I named it dynamic_task_type_from_tensor_role_for_copy in the revision but let me know if you have something better.

lib/task-spec/src/task-spec/dynamic_graph/shard_expansion.cc line 49 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

If you're fixing the key, what's the point of returning a bidict over an unordered_set from this function?

Because this has to plug back into DynamicValueAttrs field mapping. Which we discussed previously makes sense to be a bidict due to the properties that operators have.

lib/task-spec/src/task-spec/dynamic_graph/shard_expansion.cc line 96 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Minor: Slightly more idiomatic, as that way you don't have to arbitrarily choose which mapping (input or output) to use in the rest of the function

Unless I'm misunderstanding something, this does not work because the input and output mappings are NOT the same. This is a copy, not an operator. If the mappings were the same a whole lot of code could be simplified.

lib/task-spec/test/src/task-spec/dynamic_graph/copy_insertion.cc line 388 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

It seems that some of the initialization is used for a single subcase? If so, it would make it more readable to move the creation of those subcase-specific values into the subcase itself.

Also, is there any way to shrink the amount of stuff needed in this test? Wading through all the construction is not fun, though admittedly maybe just coalescing the setup into the subcases will make the storyline of the setup sufficiently clear that we won't need to do this.

See if you're satisfied with the updated version. I'm not honestly sure there's a way to meaningfully simplify any of this without reducing the number or complexity of the covered test scenarios, but if you have ideas let me know.

lib/task-spec/test/src/task-spec/dynamic_graph/shard_expansion.cc line 335 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What is the meaning of the device placement of a copy? Is it the source or destination of the copy? It seems like either way that's going to run into issues in the backward pass, where the copy will have to operate the other direction, but I don't see any code for handling that currently?

Right now this is sort of meaningless because we have one controller issuing all copies for the entire graph, no matter where they are. However the intention is this to be the "owner" or "issuer" of the copy, which matters a lot more down the road once we write the control replicated version of the Realm backend. At that point, you will have to pick one specific node to issue the copy, and the program should be correct no matter what you pick, but there may be performance implications to those choices.

TL;DR: there should never be a correctness concern due to this choice but in the future there may be performance decisions related to it.

lib/task-spec/test/src/task-spec/dynamic_graph/shard_expansion.cc line 351 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Minor: Might be clearer to make this a modification of input rather than a full reconstruction? It looks like very little is changed, but it's kinda hard to spot what changes in all the initalization

Honestly, this doesn't seem any worse than the pre-existing test case in this file. I don't disagree necessarily but it seems to be a feature of these tests generally?

lockshaw

@lockshaw reviewed 8 files and all commit messages, made 7 comments, and resolved 3 discussions.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on elliottslaughter).

lib/task-spec/include/task-spec/dynamic_graph/copy_insertion.h line at r1 (raw file):

Previously, elliottslaughter (Elliott Slaughter) wrote…

I added the link next to the others, but I don't see any existing explanations of any other passes there (or anywhere). Am I missing something? I think the version pushed with this comment should match the standard that the others are currently held to (as far as I can tell).

This would be the first one, I'm still working on backfilling the previous ones as this doc is new. Happy to merge without it for now, but can you add it in a follow-up PR?

lib/task-spec/src/task-spec/dynamic_graph/copy_insertion.cc line 154 at r1 (raw file):

Previously, elliottslaughter (Elliott Slaughter) wrote…

Done.

I would normally expect "source" to represent something like analogous to a Node or a DataflowOutput, but it seems that here it's a value. Maybe mapped_source_value would be a clearer name?

lib/task-spec/src/task-spec/dynamic_graph/shard_expansion.cc line 49 at r1 (raw file):

Previously, elliottslaughter (Elliott Slaughter) wrote…

Because this has to plug back into DynamicValueAttrs field mapping. Which we discussed previously makes sense to be a bidict due to the properties that operators have.

Makes sense. BTW we'd normally call a function like this something more along the line of restrict_tensor_mapping_keys_to_coord along the lines of utils/containers/restrict_keys.h, but this is not a big deal especially since this function is only used here

lib/task-spec/src/task-spec/dynamic_graph/shard_expansion.cc line 96 at r1 (raw file):

Previously, elliottslaughter (Elliott Slaughter) wrote…

Unless I'm misunderstanding something, this does not work because the input and output mappings are NOT the same. This is a copy, not an operator. If the mappings were the same a whole lot of code could be simplified.

Oh, missed the .left_values() access here. In that case, you'd require_same on the .left_values() which would be fine but also isn't quite as clean as the former becasue you still need inupt_mapping hanging around

lib/task-spec/test/src/task-spec/dynamic_graph/copy_insertion.cc line 388 at r1 (raw file):

Previously, elliottslaughter (Elliott Slaughter) wrote…

See if you're satisfied with the updated version. I'm not honestly sure there's a way to meaningfully simplify any of this without reducing the number or complexity of the covered test scenarios, but if you have ideas let me know.

No worries, while not perfect this is alreay quite improved--thanks!

lib/task-spec/test/src/task-spec/dynamic_graph/shard_expansion.cc line 335 at r1 (raw file):

Previously, elliottslaughter (Elliott Slaughter) wrote…

Right now this is sort of meaningless because we have one controller issuing all copies for the entire graph, no matter where they are. However the intention is this to be the "owner" or "issuer" of the copy, which matters a lot more down the road once we write the control replicated version of the Realm backend. At that point, you will have to pick one specific node to issue the copy, and the program should be correct no matter what you pick, but there may be performance implications to those choices.

TL;DR: there should never be a correctness concern due to this choice but in the future there may be performance decisions related to it.

Got it, would you mind adding a quick explanation of that (essentially a copy of the above comment would be fine) into a docstring on that field in DynamicNodeAttrs then? Just so we don't lose track of the reasoning in the future

lib/task-spec/test/src/task-spec/dynamic_graph/shard_expansion.cc line 351 at r1 (raw file):

Previously, elliottslaughter (Elliott Slaughter) wrote…

Honestly, this doesn't seem any worse than the pre-existing test case in this file. I don't disagree necessarily but it seems to be a feature of these tests generally?

At some point you do need to construct raw values to test stuff, and frequently we just reconstruct the result as for more complicated transformations you'd just end up reimplementing the whole function under test in the test itself, but here it's only one field changing so at least to me in this case the tradeoff seems worth it for the improved clarity.

elliottslaughter

@elliottslaughter made 5 comments and resolved 2 discussions.
Reviewable status: 22 of 27 files reviewed, 3 unresolved discussions (waiting on lockshaw).

lib/task-spec/include/task-spec/dynamic_graph/copy_insertion.h line at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

This would be the first one, I'm still working on backfilling the previous ones as this doc is new. Happy to merge without it for now, but can you add it in a follow-up PR?

I added a basic description for now. I'm happy to iterate further after this PR once I see what your intention is for the rest of these.

lib/task-spec/src/task-spec/dynamic_graph/copy_insertion.cc line 154 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I would normally expect "source" to represent something like analogous to a Node or a DataflowOutput, but it seems that here it's a value. Maybe mapped_source_value would be a clearer name?

Sure, that's fine. Just to clarify, this is the source from the perspective of the copy, which by definition takes as its input the output of some other operator. Since this is the copy insertion pass I use copy-centric language rather than operator-centric.

lib/task-spec/src/task-spec/dynamic_graph/shard_expansion.cc line 96 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Oh, missed the .left_values() access here. In that case, you'd require_same on the .left_values() which would be fine but also isn't quite as clean as the former becasue you still need inupt_mapping hanging around

I think I got it this time, but double check if I understood your meaning.

lib/task-spec/test/src/task-spec/dynamic_graph/shard_expansion.cc line 335 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Got it, would you mind adding a quick explanation of that (essentially a copy of the above comment would be fine) into a docstring on that field in DynamicNodeAttrs then? Just so we don't lose track of the reasoning in the future

I added a comment into the main source file for shard_expansion.cc, because it seems odd to document this decision inside a test.

lib/task-spec/test/src/task-spec/dynamic_graph/shard_expansion.cc line 351 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

At some point you do need to construct raw values to test stuff, and frequently we just reconstruct the result as for more complicated transformations you'd just end up reimplementing the whole function under test in the test itself, but here it's only one field changing so at least to me in this case the tradeoff seems worth it for the improved clarity.

Done. Note that inputs and outputs are not identical so we don't get any compression there, but node_attrs is mostly similar so we do get some benefit in that one.

lockshaw

@lockshaw reviewed 6 files and all commit messages, made 4 comments, and resolved 3 discussions.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on elliottslaughter).

lib/task-spec/src/task-spec/dynamic_graph/copy_insertion.cc line 154 at r1 (raw file):

Previously, elliottslaughter (Elliott Slaughter) wrote…

Sure, that's fine. Just to clarify, this is the source from the perspective of the copy, which by definition takes as its input the output of some other operator. Since this is the copy insertion pass I use copy-centric language rather than operator-centric.

I meant unmapped_value_to_mapped_source_value--sorry for the confusion, I pushed the fix directly so you don't have to take care of it.

lib/task-spec/src/task-spec/dynamic_graph/shard_expansion.cc line 96 at r1 (raw file):

Previously, elliottslaughter (Elliott Slaughter) wrote…

I think I got it this time, but double check if I understood your meaning.

What you have here is totally fine--this was mainly a comment from when I missed the .left_values() access, so don't worry about it anymore

lib/task-spec/test/src/task-spec/dynamic_graph/shard_expansion.cc line 335 at r1 (raw file):

Previously, elliottslaughter (Elliott Slaughter) wrote…

I added a comment into the main source file for shard_expansion.cc, because it seems odd to document this decision inside a test.

I meant in the actual dtg.toml file--I pushed the fix so you don't have to take care of it, but I also like the comment in copy_insertion.cc so let's keep both.

lib/task-spec/test/src/task-spec/dynamic_graph/shard_expansion.cc line 351 at r1 (raw file):

Previously, elliottslaughter (Elliott Slaughter) wrote…

Done. Note that inputs and outputs are not identical so we don't get any compression there, but node_attrs is mostly similar so we do get some benefit in that one.

Got it, I missed the change in the inputs and outputs when originally reviewing this. What you have now is great.

codecov · 2026-03-20T23:34:19Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 0.00%. Comparing base (94fd1fc) to head (679b25f).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@      Coverage Diff       @@
##   master   #1637   +/-   ##
==============================
==============================

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

elliottslaughter added 16 commits March 17, 2026 21:11

Initial work on exposing copies to dynamic graph.

9d68a63

And now with actually fixed tests.

b8af3ab

More work on copy insertion.

4b8411e

Don't test mapping in machine_slicing.

76afe66

Basic test for no copies.

efc8d56

Test copy case.

f43019f

Check no copies pre-exist copy insertion.

890bf54

Filter to avoid degenerate copies.

9a1479c

Wire up copy insertion and fix shard expansion.

49fd444

Sketch interface for issuing operations.

878b822

Sketch interface for copies.

b4fcf43

Implement copies.

9d06077

Assign copies to a phase based on tensor roles.

4560087

Update shard expansion test to include copy case.

fe37fc9

It is safe to return NO_EVENT for nop tasks even in presence of depen…

fdb4ebe

…dencies.

Update to match Realm PR changes.

fbb9eba

lockshaw self-requested a review March 18, 2026 04:15

lockshaw requested changes Mar 19, 2026

View reviewed changes

lockshaw and others added 2 commits March 19, 2026 01:26

Merge branch 'master' into realm-data-movement-explicit

8778da4

Updates in response to feedback.

5304a6b

elliottslaughter commented Mar 19, 2026

View reviewed changes

lockshaw requested changes Mar 20, 2026

View reviewed changes

Respond to PR feedback.

ca4de4d

elliottslaughter commented Mar 20, 2026

View reviewed changes

lockshaw added 2 commits March 20, 2026 15:12

Fixes from PR review

d47668a

Format

679b25f

lockshaw approved these changes Mar 20, 2026

View reviewed changes

lockshaw enabled auto-merge (squash) March 20, 2026 22:19

lockshaw merged commit 5f49bc7 into flexflow:master Mar 20, 2026
3 of 4 checks passed

elliottslaughter deleted the realm-data-movement-explicit branch March 21, 2026 02:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Realm support for data movement#1637

Realm support for data movement#1637
lockshaw merged 21 commits intoflexflow:masterfrom
elliottslaughter:realm-data-movement-explicit

elliottslaughter commented Mar 18, 2026 •

edited by wmdi

Loading

Uh oh!

lockshaw left a comment

Uh oh!

elliottslaughter left a comment

Uh oh!

lockshaw left a comment

Uh oh!

elliottslaughter left a comment

Uh oh!

lockshaw left a comment

Uh oh!

Uh oh!

codecov bot commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

elliottslaughter commented Mar 18, 2026 • edited by wmdi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lockshaw left a comment

Choose a reason for hiding this comment

Uh oh!

elliottslaughter left a comment

Choose a reason for hiding this comment

Uh oh!

lockshaw left a comment

Choose a reason for hiding this comment

Uh oh!

elliottslaughter left a comment

Choose a reason for hiding this comment

Uh oh!

lockshaw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Mar 20, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

elliottslaughter commented Mar 18, 2026 •

edited by wmdi

Loading