Skip to content

Realm support for data movement#1637

Merged
lockshaw merged 21 commits intoflexflow:masterfrom
elliottslaughter:realm-data-movement-explicit
Mar 20, 2026
Merged

Realm support for data movement#1637
lockshaw merged 21 commits intoflexflow:masterfrom
elliottslaughter:realm-data-movement-explicit

Conversation

@elliottslaughter
Copy link
Contributor

@elliottslaughter elliottslaughter commented Mar 18, 2026

This PR adds support for issuing copies in Realm when operators are spread out over multiple devices. In principle this should enable distributed model parallelism. Other parallel operators (e.g., required for data parallelism) are not implemented.

Overview of changes:

  • Adds CopyAttrs to TrainingOperationAttrs in task-spec to permit copies to be represented in the dynamic graph
  • DynamicValueAttrs now track their mapping explicitly
  • Adds an explicit copy insertion pass to the dynamic graph that fills mapping on DynamicValueAttrs and inserts copies where this would break edges in the dependence graph
  • Update shard expansion to expand copies
  • Update Realm infrastructure to issue copies when present in the dynamic graph

This change is Reviewable

@lockshaw lockshaw self-requested a review March 18, 2026 04:15
Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lockshaw reviewed 26 files and all commit messages, and made 13 comments.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on elliottslaughter).


lib/realm-execution/src/realm-execution/pcg_instance.cc line 163 at r1 (raw file):

}

static Realm::Event spawn_dynamic_node_invocation(

Minor: A small docstring would be nice, as the number of arguments makes it a bit hard to quickly skim for the meaning of this function


lib/realm-execution/src/realm-execution/realm_context.cc line 174 at r1 (raw file):

      /*field_id=*/0,
      /*size=*/
      static_cast<size_t>(int{size_of_datatype(src_piece_shape.data_type)}),

Minor: Slightly clearer/more idiomatic in the codebase

Suggestion:

      static_cast<size_t>(size_of_datatype(src_piece_shape.data_type).int_from_positive_int()),

lib/realm-execution/src/realm-execution/realm_context.cc line 218 at r1 (raw file):

    default:
      PANIC("TensorShape dims greater than REALM_MAX_DIM",
            fmt::to_string(src_piece_shape.dims.ff_ordered.num_dims()));

Minor: I don't think you need the explicit to_string call

Suggestion:

      PANIC("TensorShape dims greater than REALM_MAX_DIM: {}", src_piece_shape.dims.ff_ordered.num_dims());

lib/task-spec/include/task-spec/dynamic_graph/copy_insertion.h line 0 at r1 (raw file):
Add a high-level explanation of copy insertion to dynamic_graph/index.dox and (ideally) link to there from here


lib/task-spec/src/task-spec/dynamic_graph/copy_insertion.cc line 115 at r1 (raw file):

      auto const &[filtered_source, filtered_use] =
          filter_mapping_to_avoid_degenerate_copies(source_value, use_value);
      DynamicNodeInvocation copy{

FYI Normally we'd do DynamicNodeInvocation copy = DynamicNodeInvocation{, so that would technically be slightly more idiomatic in the codebase, but it really doesn't matter much

Code quote:

      DynamicNodeInvocation copy{

lib/task-spec/src/task-spec/dynamic_graph/copy_insertion.cc line 154 at r1 (raw file):

  ASSERT(no_part_of_graph_is_copy_inserted(g));

  std::unordered_map<DynamicValueAttrs, DynamicValueAttrs> sources;

A more specific variable name here would be really helpful, especially since the type declaration is not particularly illuminating

Code quote:

  std::unordered_map<DynamicValueAttrs, DynamicValueAttrs> sources;

lib/task-spec/src/task-spec/dynamic_graph/dynamic_task_type.cc line 6 at r1 (raw file):

namespace FlexFlow {

DynamicTaskType decide_copy_task_type(DynamicTensorRole role) {

Minor: Slightly clearer name. Unless I'm misunderstanding, this function isn't really doing any "deciding", it's really just flattening some nesting of task types

Suggestion:

DynamicTaskType dynamic_task_type_from_tensor_role(DynamicTensorRole role) {

lib/task-spec/src/task-spec/dynamic_graph/shard_expansion.cc line 49 at r1 (raw file):

        ParallelTensorSpaceCoordinate const &parallel_tensor_coord) {
  return filter_keys(mapping, [&](ParallelTensorSpaceCoordinate const &p) {
    return p == parallel_tensor_coord;

If you're fixing the key, what's the point of returning a bidict over an unordered_set from this function?


lib/task-spec/src/task-spec/dynamic_graph/shard_expansion.cc line 90 at r1 (raw file):

static std::unordered_set<DynamicNodeInvocation>
    perform_shard_expansion_for_copy(DynamicNodeInvocation const &i) {
  auto const &[input_slot, input] = get_only(i.inputs);

Minor: Generally I discourage assigning references unless necessary, as it opens up more room for lifetime bugs

Suggestion:

 auto [input_slot, input] = get_only(i.inputs);

lib/task-spec/src/task-spec/dynamic_graph/shard_expansion.cc line 96 at r1 (raw file):

  bidict<ParallelTensorSpaceCoordinate, MachineSpaceCoordinate> output_mapping =
      assert_unwrap(output.mapping);
  require_same(input_mapping.left_values(), output_mapping.left_values());

Minor: Slightly more idiomatic, as that way you don't have to arbitrarily choose which mapping (input or output) to use in the rest of the function

Suggestion:

  bidict<ParallelTensorSpaceCoordinate, MachineSpaceCoordinate> mapping =
    require_same(assert_unwrap(input.mapping), assert_unwrap(output.mapping));

lib/task-spec/test/src/task-spec/dynamic_graph/copy_insertion.cc line 388 at r1 (raw file):

    SUBCASE("copy one tensor, one point") {
      std::unordered_map<DynamicValueAttrs, DynamicValueAttrs> sources_copy1{
          {graph_input1, graph_input1_src_copy1},

It seems that some of the initialization is used for a single subcase? If so, it would make it more readable to move the creation of those subcase-specific values into the subcase itself.

Also, is there any way to shrink the amount of stuff needed in this test? Wading through all the construction is not fun, though admittedly maybe just coalescing the setup into the subcases will make the storyline of the setup sufficiently clear that we won't need to do this.

Code quote:

         {graph_input1, graph_input1_src_copy1},

lib/task-spec/test/src/task-spec/dynamic_graph/shard_expansion.cc line 335 at r1 (raw file):

          DynamicNodeAttrs{
              /*task_type=*/std::nullopt,
              /*device_coord=*/device_coord,

What is the meaning of the device placement of a copy? Is it the source or destination of the copy? It seems like either way that's going to run into issues in the backward pass, where the copy will have to operate the other direction, but I don't see any code for handling that currently?


lib/task-spec/test/src/task-spec/dynamic_graph/shard_expansion.cc line 351 at r1 (raw file):

              },
          },
      };

Minor: Might be clearer to make this a modification of input rather than a full reconstruction? It looks like very little is changed, but it's kinda hard to spot what changes in all the initalization

Code quote:

      return DynamicNodeInvocation{
          /*inputs=*/{
              {
                  mk_slot(TensorSlotName::INPUT),
                  mk_value(0,
                           TensorSlotName::OUTPUT,
                           src_binding,
                           tensor_shard_coord),
              },
          },
          /*node_attrs=*/
          DynamicNodeAttrs{
              /*task_type=*/std::nullopt,
              /*device_coord=*/device_coord,
              /*mapping=*/std::nullopt,
              /*op_attrs=*/TrainingOperationAttrs{CopyAttrs{}},
              /*layer_guid=*/dynamic_layer_guid_t{dynamic_copy_layer_guid_t{}},
              /*per_device_op_state=*/std::nullopt,
          },
          /*outputs=*/
          {
              {
                  mk_slot(TensorSlotName::OUTPUT),
                  mk_value(20,
                           TensorSlotName::OUTPUT,
                           dst_binding,
                           tensor_shard_coord),
              },
          },
      };

Copy link
Contributor Author

@elliottslaughter elliottslaughter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elliottslaughter made 8 comments and resolved 4 discussions.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on lockshaw).


lib/task-spec/include/task-spec/dynamic_graph/copy_insertion.h line at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Add a high-level explanation of copy insertion to dynamic_graph/index.dox and (ideally) link to there from here

I added the link next to the others, but I don't see any existing explanations of any other passes there (or anywhere). Am I missing something? I think the version pushed with this comment should match the standard that the others are currently held to (as far as I can tell).


lib/task-spec/src/task-spec/dynamic_graph/copy_insertion.cc line 154 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

A more specific variable name here would be really helpful, especially since the type declaration is not particularly illuminating

Done.


lib/task-spec/src/task-spec/dynamic_graph/dynamic_task_type.cc line 6 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Minor: Slightly clearer name. Unless I'm misunderstanding, this function isn't really doing any "deciding", it's really just flattening some nesting of task types

Yes, but it's for a copy. Otherwise it's nonsensical to even be speaking of converting a tensor role into a task type.

I named it dynamic_task_type_from_tensor_role_for_copy in the revision but let me know if you have something better.


lib/task-spec/src/task-spec/dynamic_graph/shard_expansion.cc line 49 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

If you're fixing the key, what's the point of returning a bidict over an unordered_set from this function?

Because this has to plug back into DynamicValueAttrs field mapping. Which we discussed previously makes sense to be a bidict due to the properties that operators have.


lib/task-spec/src/task-spec/dynamic_graph/shard_expansion.cc line 96 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Minor: Slightly more idiomatic, as that way you don't have to arbitrarily choose which mapping (input or output) to use in the rest of the function

Unless I'm misunderstanding something, this does not work because the input and output mappings are NOT the same. This is a copy, not an operator. If the mappings were the same a whole lot of code could be simplified.


lib/task-spec/test/src/task-spec/dynamic_graph/copy_insertion.cc line 388 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

It seems that some of the initialization is used for a single subcase? If so, it would make it more readable to move the creation of those subcase-specific values into the subcase itself.

Also, is there any way to shrink the amount of stuff needed in this test? Wading through all the construction is not fun, though admittedly maybe just coalescing the setup into the subcases will make the storyline of the setup sufficiently clear that we won't need to do this.

See if you're satisfied with the updated version. I'm not honestly sure there's a way to meaningfully simplify any of this without reducing the number or complexity of the covered test scenarios, but if you have ideas let me know.


lib/task-spec/test/src/task-spec/dynamic_graph/shard_expansion.cc line 335 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What is the meaning of the device placement of a copy? Is it the source or destination of the copy? It seems like either way that's going to run into issues in the backward pass, where the copy will have to operate the other direction, but I don't see any code for handling that currently?

Right now this is sort of meaningless because we have one controller issuing all copies for the entire graph, no matter where they are. However the intention is this to be the "owner" or "issuer" of the copy, which matters a lot more down the road once we write the control replicated version of the Realm backend. At that point, you will have to pick one specific node to issue the copy, and the program should be correct no matter what you pick, but there may be performance implications to those choices.

TL;DR: there should never be a correctness concern due to this choice but in the future there may be performance decisions related to it.


lib/task-spec/test/src/task-spec/dynamic_graph/shard_expansion.cc line 351 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Minor: Might be clearer to make this a modification of input rather than a full reconstruction? It looks like very little is changed, but it's kinda hard to spot what changes in all the initalization

Honestly, this doesn't seem any worse than the pre-existing test case in this file. I don't disagree necessarily but it seems to be a feature of these tests generally?

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lockshaw reviewed 8 files and all commit messages, made 7 comments, and resolved 3 discussions.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on elliottslaughter).


lib/task-spec/include/task-spec/dynamic_graph/copy_insertion.h line at r1 (raw file):

Previously, elliottslaughter (Elliott Slaughter) wrote…

I added the link next to the others, but I don't see any existing explanations of any other passes there (or anywhere). Am I missing something? I think the version pushed with this comment should match the standard that the others are currently held to (as far as I can tell).

This would be the first one, I'm still working on backfilling the previous ones as this doc is new. Happy to merge without it for now, but can you add it in a follow-up PR?


lib/task-spec/src/task-spec/dynamic_graph/copy_insertion.cc line 154 at r1 (raw file):

Previously, elliottslaughter (Elliott Slaughter) wrote…

Done.

I would normally expect "source" to represent something like analogous to a Node or a DataflowOutput, but it seems that here it's a value. Maybe mapped_source_value would be a clearer name?


lib/task-spec/src/task-spec/dynamic_graph/shard_expansion.cc line 49 at r1 (raw file):

Previously, elliottslaughter (Elliott Slaughter) wrote…

Because this has to plug back into DynamicValueAttrs field mapping. Which we discussed previously makes sense to be a bidict due to the properties that operators have.

Makes sense. BTW we'd normally call a function like this something more along the line of restrict_tensor_mapping_keys_to_coord along the lines of utils/containers/restrict_keys.h, but this is not a big deal especially since this function is only used here


lib/task-spec/src/task-spec/dynamic_graph/shard_expansion.cc line 96 at r1 (raw file):

Previously, elliottslaughter (Elliott Slaughter) wrote…

Unless I'm misunderstanding something, this does not work because the input and output mappings are NOT the same. This is a copy, not an operator. If the mappings were the same a whole lot of code could be simplified.

Oh, missed the .left_values() access here. In that case, you'd require_same on the .left_values() which would be fine but also isn't quite as clean as the former becasue you still need inupt_mapping hanging around


lib/task-spec/test/src/task-spec/dynamic_graph/copy_insertion.cc line 388 at r1 (raw file):

Previously, elliottslaughter (Elliott Slaughter) wrote…

See if you're satisfied with the updated version. I'm not honestly sure there's a way to meaningfully simplify any of this without reducing the number or complexity of the covered test scenarios, but if you have ideas let me know.

No worries, while not perfect this is alreay quite improved--thanks!


lib/task-spec/test/src/task-spec/dynamic_graph/shard_expansion.cc line 335 at r1 (raw file):

Previously, elliottslaughter (Elliott Slaughter) wrote…

Right now this is sort of meaningless because we have one controller issuing all copies for the entire graph, no matter where they are. However the intention is this to be the "owner" or "issuer" of the copy, which matters a lot more down the road once we write the control replicated version of the Realm backend. At that point, you will have to pick one specific node to issue the copy, and the program should be correct no matter what you pick, but there may be performance implications to those choices.

TL;DR: there should never be a correctness concern due to this choice but in the future there may be performance decisions related to it.

Got it, would you mind adding a quick explanation of that (essentially a copy of the above comment would be fine) into a docstring on that field in DynamicNodeAttrs then? Just so we don't lose track of the reasoning in the future


lib/task-spec/test/src/task-spec/dynamic_graph/shard_expansion.cc line 351 at r1 (raw file):

Previously, elliottslaughter (Elliott Slaughter) wrote…

Honestly, this doesn't seem any worse than the pre-existing test case in this file. I don't disagree necessarily but it seems to be a feature of these tests generally?

At some point you do need to construct raw values to test stuff, and frequently we just reconstruct the result as for more complicated transformations you'd just end up reimplementing the whole function under test in the test itself, but here it's only one field changing so at least to me in this case the tradeoff seems worth it for the improved clarity.

Copy link
Contributor Author

@elliottslaughter elliottslaughter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elliottslaughter made 5 comments and resolved 2 discussions.
Reviewable status: 22 of 27 files reviewed, 3 unresolved discussions (waiting on lockshaw).


lib/task-spec/include/task-spec/dynamic_graph/copy_insertion.h line at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

This would be the first one, I'm still working on backfilling the previous ones as this doc is new. Happy to merge without it for now, but can you add it in a follow-up PR?

I added a basic description for now. I'm happy to iterate further after this PR once I see what your intention is for the rest of these.


lib/task-spec/src/task-spec/dynamic_graph/copy_insertion.cc line 154 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I would normally expect "source" to represent something like analogous to a Node or a DataflowOutput, but it seems that here it's a value. Maybe mapped_source_value would be a clearer name?

Sure, that's fine. Just to clarify, this is the source from the perspective of the copy, which by definition takes as its input the output of some other operator. Since this is the copy insertion pass I use copy-centric language rather than operator-centric.


lib/task-spec/src/task-spec/dynamic_graph/shard_expansion.cc line 96 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Oh, missed the .left_values() access here. In that case, you'd require_same on the .left_values() which would be fine but also isn't quite as clean as the former becasue you still need inupt_mapping hanging around

I think I got it this time, but double check if I understood your meaning.


lib/task-spec/test/src/task-spec/dynamic_graph/shard_expansion.cc line 335 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Got it, would you mind adding a quick explanation of that (essentially a copy of the above comment would be fine) into a docstring on that field in DynamicNodeAttrs then? Just so we don't lose track of the reasoning in the future

I added a comment into the main source file for shard_expansion.cc, because it seems odd to document this decision inside a test.


lib/task-spec/test/src/task-spec/dynamic_graph/shard_expansion.cc line 351 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

At some point you do need to construct raw values to test stuff, and frequently we just reconstruct the result as for more complicated transformations you'd just end up reimplementing the whole function under test in the test itself, but here it's only one field changing so at least to me in this case the tradeoff seems worth it for the improved clarity.

Done. Note that inputs and outputs are not identical so we don't get any compression there, but node_attrs is mostly similar so we do get some benefit in that one.

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lockshaw reviewed 6 files and all commit messages, made 4 comments, and resolved 3 discussions.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on elliottslaughter).


lib/task-spec/src/task-spec/dynamic_graph/copy_insertion.cc line 154 at r1 (raw file):

Previously, elliottslaughter (Elliott Slaughter) wrote…

Sure, that's fine. Just to clarify, this is the source from the perspective of the copy, which by definition takes as its input the output of some other operator. Since this is the copy insertion pass I use copy-centric language rather than operator-centric.

I meant unmapped_value_to_mapped_source_value--sorry for the confusion, I pushed the fix directly so you don't have to take care of it.


lib/task-spec/src/task-spec/dynamic_graph/shard_expansion.cc line 96 at r1 (raw file):

Previously, elliottslaughter (Elliott Slaughter) wrote…

I think I got it this time, but double check if I understood your meaning.

What you have here is totally fine--this was mainly a comment from when I missed the .left_values() access, so don't worry about it anymore


lib/task-spec/test/src/task-spec/dynamic_graph/shard_expansion.cc line 335 at r1 (raw file):

Previously, elliottslaughter (Elliott Slaughter) wrote…

I added a comment into the main source file for shard_expansion.cc, because it seems odd to document this decision inside a test.

I meant in the actual dtg.toml file--I pushed the fix so you don't have to take care of it, but I also like the comment in copy_insertion.cc so let's keep both.


lib/task-spec/test/src/task-spec/dynamic_graph/shard_expansion.cc line 351 at r1 (raw file):

Previously, elliottslaughter (Elliott Slaughter) wrote…

Done. Note that inputs and outputs are not identical so we don't get any compression there, but node_attrs is mostly similar so we do get some benefit in that one.

Got it, I missed the change in the inputs and outputs when originally reviewing this. What you have now is great.

@lockshaw lockshaw enabled auto-merge (squash) March 20, 2026 22:19
@lockshaw lockshaw merged commit 5f49bc7 into flexflow:master Mar 20, 2026
3 of 4 checks passed
@codecov
Copy link

codecov bot commented Mar 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 0.00%. Comparing base (94fd1fc) to head (679b25f).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@      Coverage Diff       @@
##   master   #1637   +/-   ##
==============================
==============================
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@elliottslaughter elliottslaughter deleted the realm-data-movement-explicit branch March 21, 2026 02:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants