Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic auto scale Kinesis-Stream ingest tasks #10985

Merged

Conversation

zhangyue19921010
Copy link
Contributor

@zhangyue19921010 zhangyue19921010 commented Mar 12, 2021

Followed PR for #10524.
The core logic is the same, difference lies in the implementation of computeLagStats between Kafka and Kinesis.

Description

This PR implements and documents the autoscaler based on ingest/kinesis/lag/time metrics for kinesis indexing service.
Also only LagBased autoScalerStrategy is supported for now.


Key changed/added classes in this PR
  • KinesisSupervisor.java
  • KinesisSupervisorIOConfig.java

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@zhangyue19921010 zhangyue19921010 changed the title Kinesis Dynamic Scale Ingest Task Dynamic auto scale Kinesis-Stream ingest tasks Mar 12, 2021
@zhangyue19921010
Copy link
Contributor Author

Failed CI jobs are not this PR related. Maybe retry will be passed :)

@suneet-s
Copy link
Contributor

#10691 documents a few known flaky tests. Any ideas on how to make them less flaky will be much appreciated. I've re-triggered the failing tests - just to see if that fixes it

@zhangyue19921010
Copy link
Contributor Author

#10691 documents a few known flaky tests. Any ideas on how to make them less flaky will be much appreciated. I've re-triggered the failing tests - just to see if that fixes it

Thanks for your help. Sure I will keep an eye on these jobs and try to do some tunning work if I can.

@zhangyue19921010
Copy link
Contributor Author

This Change has been running in our Dev cluster for 2 weeks and works fine. So that I believe it is ready to be reviewed.

Hi @pjain1 Sorry to bother you. Are you available to help me review this code? Since you have review the original design and may be more familiar with Stream Tasks autoscaler.

Will appreciate it very much if you could lend me a hand :)

@zhangyue19921010
Copy link
Contributor Author

Also kinesis-index and kinesis-data-format are passed on my laptop.

kinesis-index :

屏幕快照 2021-04-07 下午5 44 44

kinesis-data-format :

屏幕快照 2021-04-07 下午5 59 00

@suneet-s
Copy link
Contributor

@zhangyue19921010 any learnings from running in your dev cluster over the last few months? I've just started reviewing the change now. But hearing feedback first hand of anything you've noticed is very helpful! Thanks for your patience on this review.

@zhangyue19921010
Copy link
Contributor Author

Hi @suneet-s Thanks a lot for your attention. As far as I know this patch works fine on our dev/stg cluster and will deploy to PRD cluster soon.

Copy link
Member

@pjain1 pjain1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code changes looks good to me, never used kinesis indexing service so not so sure about the time based lag thing and how it behaves. If someone is using kinesis service, it would be good if they can try it on their cluster.

Copy link
Contributor

@techdocsmith techdocsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some suggestions as far as language and style the scaleOutThreshold and scaleInThreshold logic is confusing to me.

docs/development/extensions-core/kafka-ingestion.md Outdated Show resolved Hide resolved
docs/development/extensions-core/kinesis-ingestion.md Outdated Show resolved Hide resolved
docs/development/extensions-core/kinesis-ingestion.md Outdated Show resolved Hide resolved
docs/development/extensions-core/kinesis-ingestion.md Outdated Show resolved Hide resolved
docs/development/extensions-core/kinesis-ingestion.md Outdated Show resolved Hide resolved
| `lagCollectionRangeMillis` | The total time window of lag collection, Use with `lagCollectionIntervalMillis`,it means that in the recent `lagCollectionRangeMillis`, collect lag metric points every `lagCollectionIntervalMillis`. | no (default == 600000) |
| `scaleOutThreshold` | The Threshold of scale out action | no (default == 6000000) |
| `triggerScaleOutFractionThreshold` | If `triggerScaleOutFractionThreshold` percent of lag points are higher than `scaleOutThreshold`, then do scale out action. | no (default == 0.3) |
| `scaleInThreshold` | The Threshold of scale in action | no (default == 1000000) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as for scale out

docs/development/extensions-core/kinesis-ingestion.md Outdated Show resolved Hide resolved
docs/development/extensions-core/kinesis-ingestion.md Outdated Show resolved Hide resolved
docs/development/extensions-core/kinesis-ingestion.md Outdated Show resolved Hide resolved
docs/development/extensions-core/kinesis-ingestion.md Outdated Show resolved Hide resolved
Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code changes lgtm 👍

@techdocsmith do the doc changes need to be fixed or is it ok to do as a follow-up?

@techdocsmith
Copy link
Contributor

@clintropolis , assuming I didn't break anything w/ my suggested edits, docs changes LGTM.

@clintropolis clintropolis merged commit 6d14ea2 into apache:master Aug 30, 2021
@clintropolis clintropolis added this to the 0.22.0 milestone Sep 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants