Skip to content

Conversation

@nicktindall
Copy link
Contributor

@nicktindall nicktindall commented Oct 22, 2025

This caused an incident in QA, we will continue to investigate WHY an index might be missing uptime/write load for all shards, but this should protect against it if/when it happens again.

Fixes: ES-13286

if (someIndicesHadUptime) {
assertThat(forecastedWriteLoad.getAsDouble(), not(notANumber()));
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could probably have tested this by calling forecastIndexWriteLoad directly (it's exposed for testing). Happy to do that instead if we want to reduce the boilerplate.

@nicktindall nicktindall requested a review from ywangd October 22, 2025 03:50
@nicktindall nicktindall added >bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Oct 22, 2025
@nicktindall nicktindall changed the title Handle indices with zero uptime correctly in write-load calculation Handle indices with zero/missing uptime correctly in write-load calculation Oct 22, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @nicktindall, I've created a changelog YAML for you.

@nicktindall nicktindall marked this pull request as ready for review October 22, 2025 03:51
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Oct 22, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I noticed some transport version changes (#136336) and its revert (#136510) touched on how uptime is serialized. But given the transport version, it does not seem possible to impact serverless. It remains as a puzzle on why it happened.

// that index. It should be safe to extrapolate our weighted average out to the
// maximum uptime observed, based on the assumption that write-load is roughly
// evenly distributed across shards of a datastream index.
assert Double.isNaN(weightedAverageShardWriteLoad) == false : "Invalid average shard write load";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe Double.isFinite instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed these assertions to assert on the two values we are adding to the overall totals, please re-check. I think this is a better approach as it's more agnostic of how they are calculated.

See d59c122

@nicktindall nicktindall merged commit 2e340de into elastic:main Oct 22, 2025
34 checks passed
@nicktindall nicktindall deleted the write_load_forecast_handles_zero_uptime branch October 22, 2025 06:34
@nicktindall
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
9.2

Questions ?

Please refer to the Backport tool documentation

nicktindall added a commit to nicktindall/elasticsearch that referenced this pull request Oct 22, 2025
nicktindall added a commit that referenced this pull request Oct 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants