Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overlord incremental select worker strategy #14934

Closed

Conversation

jakubmatyszewski
Copy link
Contributor

Description

I was trying to use existing fillCapacity select worker strategy in an effort to implement MiddleManager autoscaling, but since MM instances that I use are created by k8s statefulsets it wasn't consistent enough. Proposed strategy fills capacity based on worker's hostname, with assumption that host is k8s FQDN - eg. druid-middlemanager-0.druid-middlemanager.druid-test.svc.cluster.local:8091.

Similar problem was discussed in #8801.
I'm not sure if this feature is desirable and this is my first java admission, so I'll take every bit of critique here. I can add tests if it will appear that this feature might be useful and we want to add it to codebase.

Key changed/added classes in this PR
  • FillIncrementallyWorkerSelectStrategy
  • FillIncrementallyWithAffinityWorkerSelectStrategy
  • FillIncrementallyWithCategorySpecWorkerSelectStrategy

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

// Assumes the host format is based on k8s statefulset FQDN - "druid-middlemanager-<numeral>.druid-middlemanager..."
String[] parts = host.split("\\.");
String numeralPart = parts[0].substring(parts[0].lastIndexOf("-") + 1);
return Integer.parseInt(numeralPart);

Check notice

Code scanning / CodeQL

Missing catch of NumberFormatException Note

Potential uncaught 'java.lang.NumberFormatException'.
@abhishekagarwal87
Copy link
Contributor

thank you for your contribution. can you elaborate a bit on following

but since MM instances that I use are created by k8s statefulsets it wasn't consistent enough.

what is not consistent?

@jakubmatyszewski
Copy link
Contributor Author

thank you for your contribution. can you elaborate a bit on following

but since MM instances that I use are created by k8s statefulsets it wasn't consistent enough.

what is not consistent?

For statefulset to scale down we need to take into consideration that:

When Pods are being deleted, they are terminated in reverse order, from {N-1..0}.

But with existing fillCapacity overlord strategy it doesn't care on what host ordinal index it accumulates tasks - it just checks which middlemanager instance has highest currCapacityUsed value. So in scenarios when there are multiple middlemanagers with same amount of currCapacityUsed it seems like this strategy picks middlemanagers at random.

Copy link

github-actions bot commented Mar 3, 2024

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label Mar 3, 2024
Copy link

This pull request/issue has been closed due to lack of activity. If you think that
is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Mar 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants