Resharding Add Flush and Refresh functionality #136880

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

ankikuma merged 25 commits into elastic:main from ankikuma:10202025/ReshardFlushSplit

Nov 14, 2025

Contributor

ankikuma commented Oct 21, 2025 •

edited

Loading

Address ES-13295 and ES-13170

Implement shard split logic for Flush and Refresh. Also forward the most recent shardCountRequestSummary to the target shards, when splitting the request at the source.


          commit

840b91a

elasticsearchmachine added v9.3.0 serverless-linked labels

elasticsearchmachine and others added 13 commits

October 21, 2025 14:30


          [CI] Auto commit changes from spotless

dfd2a30


          commit

3048e20


          Merge remote-tracking branch 'upstream/main' into 10202025/ReshardFlu…

0093f1c

…shSplit

Refresh


          Merge remote-tracking branch 'upstream/main' into 10202025/ReshardFlu…

292c5ed

…shSplit

Refresh


          commit

7dfe6fb


          commit

4bf8d9a


          Merge remote-tracking branch 'upstream/main' into 10202025/ReshardFlu…

21e55aa

…shSplit

Refresh


          Merge branch '10202025/ReshardFlushSplit' of github.com:ankikuma/elas…

d4b186f

…ticsearch into 10202025/ReshardFlushSplit

pull


          commit

e5c65b7


          Merge remote-tracking branch 'upstream/main' into 10202025/ReshardFlu…

ef01d6d

…shSplit

Refresh


          commit

ea07083


          Merge remote-tracking branch 'upstream/main' into 10202025/ReshardFlu…

fbcfa99

…shSplit

Refresh


          Merge remote-tracking branch 'upstream/main' into 10202025/ReshardFlu…

dd09b3b

…shSplit

Refresh

ankikuma marked this pull request as ready for review

November 3, 2025 16:29

elasticsearchmachine added the needs:triage label

ankikuma added :Distributed Indexing/Distributed Team:Distributed Indexing labels

elasticsearchmachine removed the needs:triage label

Collaborator

elasticsearchmachine commented Nov 3, 2025

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

ankikuma added >non-issue needs:triage labels

elasticsearchmachine removed the needs:triage label

ankikuma commented

View reviewed changes

...er/src/main/java/org/elasticsearch/action/admin/indices/flush/TransportShardFlushAction.java

    
                  // We are here because there was a mismatch between the SplitShardCountSummary in the request

                  // and that on the primary shard node. We assume that the request is exactly 1 reshard split behind

                  // the current state.

Contributor Author

ankikuma Nov 3, 2025

This assumption will need to be revised in a follow up PR. I created ES-13413 for this.

ankikuma commented

View reviewed changes

...er/src/main/java/org/elasticsearch/action/admin/indices/flush/TransportShardFlushAction.java Outdated

    
                      // If the action fails on either one of the shards, we return an exception.

                      // Case 1: Both source and target shards return a response: Add up total, successful, failures

                      // Case 2: Both source and target shards return an exception : return exception

                      // Case 3: One shards returns a response, the other returns an exception : return exception

Contributor Author

ankikuma Nov 3, 2025

We fail the entire request with an exception if the action fails on either one of the 2 shards. Please comment if you think this behavior is incorrect.

ankikuma commented

View reviewed changes

...er/src/main/java/org/elasticsearch/action/admin/indices/flush/TransportShardFlushAction.java Outdated

    
                      assert failureArray.length == failed;

                      ReplicationResponse.ShardInfo shardInfo = ReplicationResponse.ShardInfo.of(total, successful, failureArray);

                      ReplicationResponse response = new ReplicationResponse();

                      response.setShardInfo(shardInfo);

Contributor Author

ankikuma Nov 3, 2025

ShardInfo adds up the successes and failures from the 2 responses.

lkts reviewed

View reviewed changes

...er/src/main/java/org/elasticsearch/action/admin/indices/flush/TransportShardFlushAction.java Outdated

    
                      Map<ShardId, ShardFlushRequest> requestsByShard = new HashMap<>();

                      requestsByShard.put(sourceShard, request);

                      // Create a request for original source shard and for each target shard.

                      // New requests that are to be handled by target shards should contain the

Contributor

lkts Nov 3, 2025

We should be careful with using cluster state on the source shard node. During handoff, the target shard node applies a non-acked cluster state update to transition to HANDOFF and then acks handoff request from the source shard. At this point there is no guarantee that this update to HANDOFF was observed and that SplitShardCountSummary.forIndexing returns a correct result. In practice it shouldn't really matter though because a particular request can only be split once?

Contributor Author

ankikuma Nov 11, 2025

Created ES-13488

...er/src/main/java/org/elasticsearch/action/admin/indices/flush/TransportShardFlushAction.java Outdated

    
                      // latest ShardCountSummary.

                      int targetShardId = indexMetadata.getReshardingMetadata().getSplit().targetShard(sourceShard.id());

                      ShardId targetShard = new ShardId(request.shardId().getIndex(), targetShardId);

                      requestsByShard.put(targetShard, new ShardFlushRequest(request.getRequest(), targetShard, shardCountSummary));

Contributor

lkts Nov 3, 2025

Do we want to skip this logic if shardCountSummary is up to date (the caveat above applies)?

Contributor

lkts Nov 3, 2025

Disregard, it can't be up to date because it is checked in the caller.

...er/src/main/java/org/elasticsearch/action/admin/indices/flush/TransportShardFlushAction.java Outdated

    
                  public static final String NAME = FlushAction.NAME + "[s]";

                  public static final ActionType<ReplicationResponse> TYPE = new ActionType<>(NAME);

                  protected final ProjectResolver projectResolver;

Contributor

lkts Nov 3, 2025

Can it be private?

Contributor Author

ankikuma Nov 11, 2025

Yes

server/src/main/java/org/elasticsearch/action/bulk/BulkOperation.java

    
                              final List<BulkItemRequest> requests = entry.getValue();

                              // Get effective shardCount for shardId and pass it on as parameter to new BulkShardRequest

                              var indexMetadata = project.getIndexSafe(shardId.getIndex());

Contributor

lkts Nov 3, 2025

Why do we need this?

Contributor Author

ankikuma Nov 11, 2025

This is to replace the commented code below. The indexMetadata cannot really be null at this point and we must have a valid SplitShardCountSummary to pass to the BulkShardRequest.

Contributor

bcully Nov 12, 2025

can we remove the commented code before pushing?

Contributor

lkts Nov 12, 2025

Does this change the behavior in any way? I can't really image it does but i want to check.

Contributor Author

ankikuma Nov 12, 2025

This is just to make sure we throw an assert if we do see null indexmetadata, which was not the case earlier. We never expected indexMetadata to be NULL in the first place so the behavior has not changed.

I will remove commented code.

bcully reviewed

View reviewed changes

server/src/main/java/org/elasticsearch/action/admin/indices/flush/TransportFlushAction.java Outdated

Comment on lines 64 to 68

    
                  protected ShardFlushRequest newShardRequest(FlushRequest request, ShardId shardId, ProjectMetadata project) {

                      // Get effective shardCount for shardId and pass it on as parameter to new ShardFlushRequest

                      var indexMetadata = project.getIndexSafe(shardId.getIndex());

                      SplitShardCountSummary reshardSplitShardCountSummary = SplitShardCountSummary.forIndexing(indexMetadata, shardId.getId());

                      return new ShardFlushRequest(request, shardId, reshardSplitShardCountSummary);

Contributor

bcully Nov 6, 2025

I might have been tempted to modify TransportBroadcastReplicationAction.shards to return a list of shardid, summary pairs, to keep the binding between the shards we've chosen for routing and the summary that is calculated based on that routing table explicit. It doesn't address any kind of bug here, it just makes the API a little clearer and harder to use incorrectly.

Contributor Author

ankikuma Nov 12, 2025

Yes that makes sense. Made the changes in latest upload.

...rc/main/java/org/elasticsearch/action/admin/indices/refresh/TransportShardRefreshAction.java

Comment on lines 121 to 176

    
                  // We are here because there was mismatch between the SplitShardCountSummary in the request

                  // and that on the primary shard node. We assume that the request is exactly 1 reshard split behind

                  // the current state.

                  @Override

                  protected Map<ShardId, BasicReplicationRequest> splitRequestOnPrimary(BasicReplicationRequest request) {

                      ProjectMetadata project = projectResolver.getProjectMetadata(clusterService.state());

                      final ShardId sourceShard = request.shardId();

                      IndexMetadata indexMetadata = project.getIndexSafe(request.shardId().getIndex());

                      SplitShardCountSummary shardCountSummary = SplitShardCountSummary.forIndexing(indexMetadata, sourceShard.getId());

                      Map<ShardId, BasicReplicationRequest> requestsByShard = new HashMap<>();

                      requestsByShard.put(sourceShard, request);

                      // Create a request for original source shard and for each target shard.

                      // New requests that are to be handled by target shards should contain the

                      // latest ShardCountSummary.

                      int targetShardId = indexMetadata.getReshardingMetadata().getSplit().targetShard(sourceShard.id());

                      ShardId targetShard = new ShardId(request.shardId().getIndex(), targetShardId);

                      requestsByShard.put(targetShard, new BasicReplicationRequest(targetShard, shardCountSummary));

                      return requestsByShard;

                  }

                  @Override

                  protected Tuple<ReplicationResponse, Exception> combineSplitResponses(

                      BasicReplicationRequest originalRequest,

                      Map<ShardId, BasicReplicationRequest> splitRequests,

                      Map<ShardId, Tuple<ReplicationResponse, Exception>> responses

                  ) {

                      int failed = 0;

                      int successful = 0;

                      int total = 0;

                      List<ReplicationResponse.ShardInfo.Failure> failures = new ArrayList<>();

                      // If the action fails on either one of the shards, we return an exception.

                      // Case 1: Both source and target shards return a response: Add up total, successful, failures

                      // Case 2: Both source and target shards return an exception : return exception

                      // Case 3: One shards returns a response, the other returns an exception : return exception

                      for (Map.Entry<ShardId, Tuple<ReplicationResponse, Exception>> entry : responses.entrySet()) {

                          ShardId shardId = entry.getKey();

                          Tuple<ReplicationResponse, Exception> value = entry.getValue();

                          Exception exception = value.v2();

                          if (exception != null) {

                              return new Tuple<>(null, exception);

                          } else {

                              ReplicationResponse response = value.v1();

                              failed += response.getShardInfo().getFailed();

                              successful += response.getShardInfo().getSuccessful();

                              total += response.getShardInfo().getTotal();

                              Collections.addAll(failures, response.getShardInfo().getFailures());

                          }

                      }

                      ReplicationResponse.ShardInfo.Failure[] failureArray = failures.toArray(new ReplicationResponse.ShardInfo.Failure[0]);

                      assert failureArray.length == failed;

                      ReplicationResponse.ShardInfo shardInfo = ReplicationResponse.ShardInfo.of(total, successful, failureArray);

                      ReplicationResponse response = new ReplicationResponse();

                      response.setShardInfo(shardInfo);

                      return new Tuple<>(response, null);

                  }

Contributor

bcully Nov 6, 2025

Can we share this logic with the flush action instead of duplicating it? That way e.g. fixing the possible stale cluster read happens in one place.

Contributor Author

ankikuma Nov 12, 2025 •

edited

Loading

Done. Created ReplicationRequestSplitHelper for this.

ankikuma added 5 commits

November 11, 2025 14:23


          refactor splitRequest

952e670


          refactor

adb8644


          Merge remote-tracking branch 'upstream/main' into 10202025/ReshardFlu…

6cd99c7

…shSplit

Refresh


          review comments

ae72d00


          Merge remote-tracking branch 'upstream/main' into 10202025/ReshardFlu…

…shSplit

Refresh

bcully reviewed

View reviewed changes

Contributor

bcully left a comment

Looks pretty good. How are we tracking the TODOs though?

server/src/main/java/org/elasticsearch/action/bulk/BulkOperation.java

    
                              final List<BulkItemRequest> requests = entry.getValue();

                              // Get effective shardCount for shardId and pass it on as parameter to new BulkShardRequest

                              var indexMetadata = project.getIndexSafe(shardId.getIndex());

Contributor

bcully Nov 12, 2025

can we remove the commented code before pushing?

...n/java/org/elasticsearch/action/support/replication/TransportBroadcastReplicationAction.java Outdated

    
                              try (var refs = new RefCountingRunnable(() -> finish(listener))) {

                                  for (final ShardId shardId : shards) {

                                  // for (final ShardId shardId : shards) {

Contributor

bcully Nov 12, 2025

can remove this

...n/java/org/elasticsearch/action/support/replication/TransportBroadcastReplicationAction.java Outdated

Comment on lines 200 to 219

    
                  /*

                  protected List<ShardId> shards(Request request, ProjectState projectState) {

                      assert Transports.assertNotTransportThread("may hit all the shards");

                      List<ShardId> shardIds = new ArrayList<>();

                      OperationRouting operationRouting = clusterService.operationRouting();

                      String[] concreteIndices = indexNameExpressionResolver.concreteIndexNames(projectState.metadata(), request);

                      ProjectMetadata project = projectState.metadata();

                      String[] concreteIndices = indexNameExpressionResolver.concreteIndexNames(project, request);

                      for (String index : concreteIndices) {

                          Iterator<IndexShardRoutingTable> iterator = operationRouting.allWritableShards(projectState, index);

                          var indexMetadata = project.index(index);

                          while (iterator.hasNext()) {

                              shardIds.add(iterator.next().shardId());

                              ShardId shardId = iterator.next().shardId();

                              SplitShardCountSummary reshardSplitShardCountSummary = SplitShardCountSummary.forIndexing(indexMetadata, shardId.getId());

                              shardIds.add(shardId);

                          }

                      }

                      return shardIds;

                  }

Contributor

bcully Nov 12, 2025

probably better not to commit all this commented out code

Contributor Author

ankikuma Nov 12, 2025

Yes sorry, I will remove it.

...n/java/org/elasticsearch/action/support/replication/TransportBroadcastReplicationAction.java Outdated

    
                  /**

                   * @return all shard ids the request should run on

                   */

                  protected List<Tuple<ShardId, SplitShardCountSummary>> shards(Request request, ProjectState projectState) {

Contributor

bcully Nov 12, 2025

this is fine but I do personally generally prefer records to tuples to make it easier to read the code where the result is consumed.

Contributor Author

ankikuma Nov 12, 2025

Sure I can switch to using a record


          Merge remote-tracking branch 'upstream/main' into 10202025/ReshardFlu…

3f1b722

…shSplit

Refresh

lkts reviewed

View reviewed changes

Contributor

lkts left a comment

Overall LGTM, but i agree with feedback from @bcully - let's remove commented code and make sure we track all TODOs.

server/src/main/java/org/elasticsearch/action/admin/indices/flush/ShardFlushRequest.java

    
                  public ShardFlushRequest(FlushRequest request, ShardId shardId, SplitShardCountSummary reshardSplitShardCountSummary) {

                      super(shardId, reshardSplitShardCountSummary);

                      this.request = request;

                      this.waitForActiveShards = ActiveShardCount.NONE; // don't wait for any active shards before proceeding, by default

Contributor

lkts Nov 12, 2025

nit: extract this into a constant and share it between two constructors

...rc/main/java/org/elasticsearch/action/admin/indices/refresh/TransportShardRefreshAction.java Outdated

    
                  public static final String SOURCE_API = "api";

                  private final Executor refreshExecutor;

                  protected final ProjectResolver projectResolver;

Contributor

lkts Nov 12, 2025

Suggested change

      
                protected final ProjectResolver projectResolver;
          
                private final ProjectResolver projectResolver;

...rc/main/java/org/elasticsearch/action/admin/indices/refresh/TransportShardRefreshAction.java

    
                      }));

                  }

                  // We are here because there was mismatch between the SplitShardCountSummary in the request

Contributor

lkts Nov 12, 2025

Should this comment be on the method in the super class?

Contributor Author

ankikuma Nov 12, 2025

Sure, that should make it more clear.

server/src/main/java/org/elasticsearch/action/bulk/BulkOperation.java

    
                              final List<BulkItemRequest> requests = entry.getValue();

                              // Get effective shardCount for shardId and pass it on as parameter to new BulkShardRequest

                              var indexMetadata = project.getIndexSafe(shardId.getIndex());

Contributor

lkts Nov 12, 2025

Does this change the behavior in any way? I can't really image it does but i want to check.

server/src/main/java/org/elasticsearch/action/support/replication/BasicReplicationRequest.java Outdated

    
                  /**

                   * Creates a new request with resolved shard id

                   */

                  // TODO: Check if callers of this need to be modified to pass in shardCountSummary

Contributor

lkts Nov 12, 2025

I think i would prefer for this constructor to not exist eventually because it's a "trap". If a particular call site does not need to pass summary, it can (with explanation).

Contributor Author

ankikuma Nov 12, 2025

I agree.

Contributor Author

ankikuma Nov 12, 2025

Created a ticket ES-13508 for this.

ankikuma added 2 commits

November 12, 2025 18:57


          review comments

7ff45aa


          commit

da76468

lkts approved these changes

View reviewed changes


          Merge remote-tracking branch 'upstream/main' into 10202025/ReshardFlu…

99ec802

…shSplit

Refresh

bcully approved these changes

View reviewed changes

server/src/main/java/org/elasticsearch/action/get/TransportShardMultiGetAction.java

    
                      ShardId shardId = indexShard.shardId();

                      if (request.refresh()) {

                          logger.trace("send refresh action for shard {}", shardId);

                          // TODO: Do we need to pass in shardCountSummary here ?

Contributor

bcully Nov 13, 2025

I guess we'd look at this while working on get? I think it's a stray todo in this PR though.

Contributor Author

ankikuma Nov 13, 2025

This is covered by ES-13508. Sorry should have mentioned it here. I think I have tickets for all the TODOs.

ankikuma added 2 commits

November 13, 2025 14:23


          minor change

3efd785


          Merge remote-tracking branch 'upstream/main' into 10202025/ReshardFlu…

bc3d547

…shSplit

Refresh

ankikuma merged commit f9ae4c3 into elastic:main

34 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Indexing/Distributed >non-issue serverless-linked Team:Distributed Indexing v9.3.0