-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minion Batch ingestion scheduling bottleneck #11282
Comments
cc @snleee Do you see the log of |
Ok so these are the logs from a job scheduling. As you can see the task generation is very quick and the scheduling seems to be the bottleneck. 19 seconds passed between generation and the response. Let me know if you need any more info.
|
19 seconds is normal for Helix to create the task state after a task is submitted because several steps are ZK watcher callback based. |
Thanks this will help a lot 🚀 |
Hello, I've tried to debug why scheduling
SegmentGenerationAndPushTask
Minion jobs take so long to schedule and I've narrowed it down the problem to this part of the code.pinot/pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/minion/PinotHelixTaskResourceManager.java
Lines 297 to 309 in 78308da
I'm currently use
POST /tasks/execute
API to schedule the job.The culprit seems to be the while loop waiting for the task to get a state. I'm not familiar on how helix handles this in the background. Do you think it would be possible to avoid looping on
synchronized getTaskState()
and maybe implement a callback to get the result of a job scheduling.This is a big deal for us since scheduling takes more than ingestion and doesn't allow to keep up with new data and scale.
It might also be a misconfiguration problem but in this case I will need your help to find it.
Current configuration:
GKE
version 0.12.1
GCS for deep storage
3 ZK - 8 CPU and 18GB ram
6 Servers - 16CPU and 32 64GB ram 1.45TB SSD
2 Controllers - 16 CPU and 32GB ram
2 Brokers - 5 CPU 16.25GB ram
32 Minions - 2 CPU and 2GB of ram
1M Segments 4TB of data
The text was updated successfully, but these errors were encountered: