-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Reapply "Add an option to return early from an allocate call" #135589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reapply "Add an option to return early from an allocate call" #135589
Conversation
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
if (routingNodes.getRelocatingShardCount() > 0) { | ||
// ES-12955: Check routingNodes.getRelocatingShardCount() > 0 in case the first relocation is a THROTTLE. | ||
// This should rarely happen since in most cases, we don't throttle unless there is an existing relocation. | ||
// But it can happen in production for frozen indices when the cache is still being prepared. It can also | ||
// happen in tests because we have decider like RandomAllocationDecider that can randomly return THROTTLE | ||
// when there is no existing relocation. | ||
shardBalanced = true; | ||
} | ||
if (completeEarlyOnShardAssignmentChange && shardBalanced) { | ||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the fix. See also 70b99de for its commit.
There were two issues:
- The main one is the we should only flag
shardBalanced
totrue
when there is shard movement on the cluster instead of just on the model node. - The second issue is that in production, we can actually have balance being throttled without any prior shard movement due to HasFrozenCacheAllocationDecider. I updated the comment to reflect it.
Test failure not related, but reproducible on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, tricky one to catch
* upstream/main: (22 commits) Fix InternalCategorizationAggregationTests.testReduceRandom (elastic#135533) [DOCS] GeoIP processor: add clarification about using a reverse proxy endpoint (elastic#135534) Move `ProjectRoutingInfo` and related classes (elastic#135586) Refactor IndexAbstractionResolver (elastic#135587) Simplify returnLocalAll handling in ES|QL (elastic#135353) Reapply "Add an option to return early from an allocate call" (elastic#135589) Mute org.elasticsearch.xpack.esql.qa.single_node.GenerativeIT test elastic#134407 Mute org.elasticsearch.xpack.esql.heap_attack.HeapAttackIT testAggTooManyMvLongs elastic#135585 Mute org.elasticsearch.multiproject.test.XpackWithMultipleProjectsClientYamlTestSuiteIT test {yaml=esql/60_usage/Basic ESQL usage output (telemetry) snapshot version} elastic#135579 Mute org.elasticsearch.search.ccs.KnnVectorQueryBuilderCrossClusterSearchIT testKnnQueryWithCcsMinimizeRoundTripsFalse elastic#135573 Mute org.elasticsearch.xpack.esql.inference.textembedding.TextEmbeddingOperatorTests testSimpleCircuitBreaking elastic#135569 Add telemetry for `TS` command (elastic#135471) Mute org.elasticsearch.cluster.routing.allocation.decider.RestoreInProgressAllocationDeciderTests testCanAllocatePrimaryExistingInRestoreInProgress elastic#135566 allocation: clarify RestoreInProgressAllocationDecider failure message (elastic#132307) [ES|QL] Register AggregateMetricDoubleLiteral (elastic#135054) Validate Logstash pipeline ID when creating. (elastic#135378) Migrate transport versions 8841050 through 8841041 (elastic#135555) Mute org.elasticsearch.search.ccs.SparseVectorQueryBuilderCrossClusterSearchIT testSparseVectorQueryWithCcsMinimizeRoundTripsFalse elastic#135559 Mute org.elasticsearch.action.admin.cluster.stats.SearchUsageStatsTests testToXContent elastic#135558 Testing indices query cache memory stats (elastic#135298) ...
This PR reapply #134786 with a bug fix which previously caused many test failures.
Resolves: #135406
Resolves: #135407
Resolves: #135150
Resolves: #135151
Resolves: #135194
Resolves: #135248
Resolves: #135249
Resolves: #135408
Resolves: #135473
Resolves: #135474
The following is the original commit message
Instead of running simulation all the way to balance or no possible movement, this PR adds an option to finish early based on the number of relocating shards. Note this early return mechanism is enabled only when desired balance is in use, i.e. it affects simulation only.
Resolves: ES-12862