-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic auto scale Kafka-Stream ingest tasks #10524
Dynamic auto scale Kafka-Stream ingest tasks #10524
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a super cool idea. I have begun a review and included comments/questions/suggestions.
initial takeaways:
- Docs, lets get all the new IOConfig stuff documented in .md files
- Javadocs, would be great to get javadocs created for all the new methods, especially the more critical/complex ones
- logging. Lots of logging that does not provide information about the supervisor that is logging. Also I think we can do some scaling back on what gets info level. Some logs seemed better off for debug.
- logging 2. in places where we catch errors and just continue on, should we log warn instead of error?
- configuration: For the new config block in IOconfig, is there anyway to have a POJO structure for this new block? If not that, I think at least extracting the default values out to final variables in the classes they live would help make things easier to follow.
I'll try to continue reviewing during the rest of the week but wanted to submit what I have so far as I need to take a break to work on other things
...ng-service/src/main/java/org/apache/druid/indexing/kinesis/supervisor/KinesisSupervisor.java
Show resolved
Hide resolved
...ce/src/main/java/org/apache/druid/indexing/kinesis/supervisor/KinesisSupervisorIOConfig.java
Outdated
Show resolved
Hide resolved
.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java
Outdated
Show resolved
Hide resolved
.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java
Show resolved
Hide resolved
.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java
Outdated
Show resolved
Hide resolved
.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java
Outdated
Show resolved
Hide resolved
.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java
Outdated
Show resolved
Hide resolved
.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java
Outdated
Show resolved
Hide resolved
.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java
Outdated
Show resolved
Hide resolved
.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java
Outdated
Show resolved
Hide resolved
@capistrant Thanks for your review! I will make changes as soon as possible :) |
Hi @pjain1 Thanks a lot for your review and approval. |
Hi @capistrant and Hi @himanshug Sorry to bother you. All your suggested changes have been completed and tested. Also CI is passed. So could you please +1 for this PR? Or If there's any further suggestion, pleeeeeease let me know, I will try my best to get it done! Thanks. |
@zhangyue19921010 I was wondering if you have thought of the case when the desired task count becomes more than the number of topic partitions. In that case the number of actual tasks will remain equal to the number of topic partitions and will not grow beyond that. In case the overall lag is still higher than threshold increasing the desired task count will not help and I think as per the logic every time dynamic allocate task notice is run it will increase the desired task count by scaleOutStep. Do you see this as a problem ? |
Hi @pjain1 Thanks for asking. Yep, if desired task count becomes more than the number of topic partitions, it will increase task number every |
I think we can add this to the documentation. I was just wondering if we can be more defensive in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, we are getting much closer now.
As a user, I find it hard to understand autoscaling behavior based on what is documented and had to read the code. But, hopefully this can be improved/refined as this feature gets more adoption.
Also, for the first release when this shows up. I think, we should call it an experimental feature mostly so that we can change the naming of various fields in documented autoscaler configuration slightly based on user feedback if needed.
| `scaleInStep` | How many tasks to reduce at a time | no (default == 1) | | ||
| `scaleOutStep` | How many tasks to add at a time | no (default == 2) | | ||
| `minTriggerScaleActionFrequencyMillis` | Minimum time interval between two scale actions | no (default == 600000) | | ||
| `autoScalerStrategy` | The algorithm of `autoScaler`. ONLY `lagBased` is supported for now. | no (default == `lagBased`) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the distinction that, following properties are common to any autoscaler and rest are specific to lagBased
autoscaler , maybe have two tables.
autoScalerStrategy
enableTaskAutoScaler
taskCountMin
taskCountMax
minTriggerScaleActionFrequencyMillis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Done
| Property | Description | Required | | ||
| ------------- | ------------- | ------------- | | ||
| `enableTaskAutoScaler` | Whether enable this feature or not. Set false or ignored here will disable `autoScaler` even though `autoScalerConfig` is not null| no (default == false) | | ||
| `lagCollectionIntervalMillis` | Define the frequency of lag points collection. | no (default == 30000) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `lagCollectionIntervalMillis` | Define the frequency of lag points collection. | no (default == 30000) | | |
| `lagCollectionIntervalMillis` | Period of lag points collection. | no (default == 30000) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks && changed.
lagMetricsQueue.offer(0L); | ||
} else { | ||
long totalLags = lagStats.getTotalLag(); | ||
lagMetricsQueue.offer(totalLags > 0 ? totalLags : 0L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why shouldn't we expect lagStats.getTotalLag() to return a value >= 0 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we can occasionally get negative lags in our practice. Something like https://stackoverflow.com/questions/60847952/how-to-get-rid-of-negative-consumer-lag-in-kafka
Negative lag values is un-necessary and a poison into our lag metrics. So just filter it here.
* @param lags the lag metrics of Stream(Kafka/Kinesis) | ||
* @return Integer. target number of tasksCount, -1 means skip scale action. | ||
*/ | ||
private Integer computeDesiredTaskCount(List<Long> lags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private Integer computeDesiredTaskCount(List<Long> lags) | |
private int computeDesiredTaskCount(List<Long> lags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks && changed.
); | ||
|
||
int currentActiveTaskCount = supervisor.getActiveTaskGroupsCount(); | ||
if (currentActiveTaskCount < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it legitimate for supervisor.getActiveTaskGroupsCount()
to return a negative value? if not, then supervisor.getActiveTaskGroupsCount()
should always return a value >= 0 and this check shouldn't be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks && removed.
Collection<TaskGroup> activeTaskGroups = activelyReadingTaskGroups.values(); | ||
currentActiveTaskCount = activeTaskGroups.size(); | ||
|
||
if (desiredActiveTaskCount == -1 || desiredActiveTaskCount == currentActiveTaskCount) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (desiredActiveTaskCount == -1 || desiredActiveTaskCount == currentActiveTaskCount) { | |
if (desiredActiveTaskCount < 0 || desiredActiveTaskCount == currentActiveTaskCount) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks && changed.
allocationExec.scheduleAtFixedRate( | ||
supervisor.buildDynamicAllocationTask(scaleAction), | ||
lagBasedAutoScalerConfig.getScaleActionStartDelayMillis() + lagBasedAutoScalerConfig | ||
.getLagCollectionRangeMillis(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure why lagCollectionRangeMillis
was added to scaleActionStartDelayMillis
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When scaleActionStartDelayMillis
meets, lagComputationExec start to work to collect metrics. And allocationExec need to wait for another lagCollectionRangeMillis
which means wait for lagComputationExec to collect enough lag metrics.
| `taskCountMin` | Minimum value of task count. When enable autoscaler, the value of taskCount in `IOConfig` will be ignored, and `taskCountMin` will be the number of tasks that ingestion starts going up to `taskCountMax`| yes | | ||
| `scaleInStep` | How many tasks to reduce at a time | no (default == 1) | | ||
| `scaleOutStep` | How many tasks to add at a time | no (default == 2) | | ||
| `minTriggerScaleActionFrequencyMillis` | Minimum time interval between two scale actions | no (default == 600000) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't time interval between two scale actions be always greater/equal to scaleActionPeriodMillis
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, scaleActionPeriodMillis
is to control the frequency of detection and minTriggerScaleActionFrequencyMillis
is to set a cool-down time between two scale actions. There is no hard association between the two parameters. For example users can set scaleActionPeriodMillis == 10min
and minTriggerScaleActionFrequencyMillis == 5min
. It means Druid will check lags every 10mins. If triggered scale action, then could not scale again within 5 minutes.
| `lagCollectionIntervalMillis` | Define the frequency of lag points collection. | no (default == 30000) | | ||
| `lagCollectionRangeMillis` | The total time window of lag collection, Use with `lagCollectionIntervalMillis`,it means that in the recent `lagCollectionRangeMillis`, collect lag metric points every `lagCollectionIntervalMillis`. | no (default == 600000) | | ||
| `scaleOutThreshold` | The Threshold of scale out action | no (default == 6000000) | | ||
| `triggerScaleOutThresholdFrequency` | If `triggerScaleOutThresholdFrequency` percent of lag points are higher than `scaleOutThreshold`, then do scale out action. | no (default == 0.3) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if it is a "frequency". maybe triggerScaleOutFractionThreshold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks && changed.
| `scaleOutThreshold` | The Threshold of scale out action | no (default == 6000000) | | ||
| `triggerScaleOutThresholdFrequency` | If `triggerScaleOutThresholdFrequency` percent of lag points are higher than `scaleOutThreshold`, then do scale out action. | no (default == 0.3) | | ||
| `scaleInThreshold` | The Threshold of scale in action | no (default == 1000000) | | ||
| `triggerScaleInThresholdFrequency` | If `triggerScaleInThresholdFrequency` percent of lag points are lower than `scaleOutThreshold`, then do scale in action. | no (default == 0.9) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `triggerScaleInThresholdFrequency` | If `triggerScaleInThresholdFrequency` percent of lag points are lower than `scaleOutThreshold`, then do scale in action. | no (default == 0.9) | | |
| `triggerScaleInFractionThreshold` | If `triggerScaleInThresholdFrequency` percent of lag points are lower than `scaleOutThreshold`, then do scale in action. | no (default == 0.9) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks && changed.
@bananaaggle saw your message earlier, It is harder to associate any timelines specially with PRs that are more involved and consequently take more time to get through the review process. But, hopefully this is getting closer to getting merged. |
We should definitely make sure that returned |
Thanks guys, will get it done ASAP. |
Hi @pjain1 and @himanshug changes are done including new condition between |
restarted travis jobs, just looked at partitionNumber task limitation part and that looks good. Thanks 👍 |
Thanks @pjain1 appreciate it. |
@@ -1901,6 +2058,11 @@ protected boolean supportsPartitionExpiration() | |||
return false; | |||
} | |||
|
|||
public int getPartitionNumbers() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
public int getPartitionNumbers() | |
public int getPartitionsCount() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks a lot for your review and approval!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not able to give a full detailed review right now so I will just comment. Left 3 comments. The only one that would be a blocker for merge is the licenses comment. I want to make sure we handle that correctly according to the document I linked in the comment.
Overall the code looks good to me and I think the idea is sound and implementation looks logical and extensible
...ervice/src/main/java/org/apache/druid/indexing/kafka/supervisor/KafkaSupervisorIOConfig.java
Show resolved
Hide resolved
...ng-service/src/main/java/org/apache/druid/indexing/kinesis/supervisor/KinesisSupervisor.java
Show resolved
Hide resolved
Hi @capistrant . Thanks for your review. All the changes are done. As for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates. My bad on missing that this was already in licenses!
Hi @pjain1 @himanshug and @capistrant Thanks a lot for your guys’ help! |
…pache#273) This reverts commit bddacbb.
…10524)" (apache#273)" (apache#288) This reverts commit 935931d.
Description
In druid, users need to set 'taskCount' when submit Kafka ingestion supervisor. It has a few limitations :
For example,
![traffic-pattern](https://user-images.githubusercontent.com/69956021/96677861-09cda080-13a3-11eb-9a88-d18ead0906db.png)
Here is our traffic pattern. I have to set taskCount to 8, avoiding Kafka lag during traffic peak. At other times, 4 tasks are enough. This PR provides the ability of auto scaling the number of Kafka ingest tasks based on Lag metrics when supervisors are running. Enable this feature and ingest tasks will auto scale out during traffic peak and scale in during traffic off-peak.Design
Here are the designs of this PR:
![屏幕快照 2020-10-21 下午1 44 54](https://user-images.githubusercontent.com/69956021/96679250-de988080-13a5-11eb-9d72-4c9396ef1177.png)
![屏幕快照 2020-10-21 下午1 45 11](https://user-images.githubusercontent.com/69956021/96679313-02f45d00-13a6-11eb-9f36-04630cddf4ff.png)
![屏幕快照 2020-10-21 下午1 45 36](https://user-images.githubusercontent.com/69956021/96679355-143d6980-13a6-11eb-9a8f-9f413d819472.png)
The work flow of supervisor controller based on druid source code
As the picture shows, SupervisorManger controls all the supervisors in OverLord Service. Each Kafka Supervisor serially consume notices in LinkedBlockingQueue. Notice is an interface. RunNotice, ShutdownNotice and RestNotice are implementations of this interface. I design a new implementation named DynamicAllocationTasksNotice. I create a new Timer(lagComputationExec) to collect Kafka lags at fix rate and create a new Timer(allocationExec) to check and do scale action at fix rate, as shown below
For allocationExec details ,
Furthermore, I expand the ioConfig spec and add new parameters to control the scale behave, for example
enableTaskAutoScaler
autoScaler
even thoughautoScalerConfig
is not nulllagCollectionIntervalMillis
lagCollectionRangeMillis
lagCollectionIntervalMillis
,it means that in the recentlagCollectionRangeMillis
, collect lag metric points everylagCollectionIntervalMillis
.scaleOutThreshold
triggerScaleOutThresholdFrequency
triggerScaleOutThresholdFrequency
percent of lag points are higher thanscaleOutThreshold
, then do scale out action.scaleInThreshold
triggerScaleInThresholdFrequency
triggerScaleInThresholdFrequency
percent of lag points are lower thanscaleOutThreshold
, then do scale in action.scaleActionStartDelayMillis
scaleActionPeriodMillis
taskCountMax
taskCountMax >= taskCountMin
taskCountMin
IOConfig
will be ignored, andtaskCountMin
will be the number of tasks that ingestion starts going up totaskCountMax
scaleInStep
scaleOutStep
minTriggerScaleActionFrequencyMillis
autoScalerStrategy
autoScaler
. ONLYlagBased
is supported for now.lagBased
)Effect evaluation :
I have deployed this feature in our Production Environment
![Ingest kafka lag](https://user-images.githubusercontent.com/69956021/96685238-6931ad80-13af-11eb-866d-893db812d862.jpeg)
Figure 1 : Kafka ingestion lag
Figure 2 : Task count
![Running task count per datasource](https://user-images.githubusercontent.com/69956021/96685280-76e73300-13af-11eb-860a-cd2d192d072b.png)
Figure 3 : Ingest speed total
![Ingest speed per datasource](https://user-images.githubusercontent.com/69956021/96686916-c0d11880-13b1-11eb-87fa-05875ec9a612.png)
Druid ingestion task is divided into two states: reading and writing, When druid scales out at 10:38, druid will launch 3 new tasks in reading state, and change the old one's state from reading to writing which will finish writing in few minutes. This is why the figure2 shows a peak of 4 from 10:38 to 10:42(3 new reading tasks and one writing task) and a peak of 5 from 11:06 to 11:08. In fact, what we really care about is the tasks in reading state. In other words, the real peak of task number is 3 all the time, which scale out at 10:39 due to Kafka lag and there is no gap between the traffic peak and task peak.
Conclusion
Here are the benefits of Druid Auto Scale :
This PR has:
Key changed/added classes in this PR
SeekableStreamSupervisor.java
KafkaSupervisor.java
KafkaSupervisorIOConfig.java