Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minion Batch ingestion scheduling bottleneck #11282

Closed
t0mpere opened this issue Aug 7, 2023 · 4 comments · Fixed by #11315
Closed

Minion Batch ingestion scheduling bottleneck #11282

t0mpere opened this issue Aug 7, 2023 · 4 comments · Fixed by #11315

Comments

@t0mpere
Copy link
Contributor

t0mpere commented Aug 7, 2023

Hello, I've tried to debug why scheduling SegmentGenerationAndPushTask Minion jobs take so long to schedule and I've narrowed it down the problem to this part of the code.

JobConfig.Builder jobBuilder =
new JobConfig.Builder().addTaskConfigs(helixTaskConfigs).setInstanceGroupTag(minionInstanceTag)
.setTimeoutPerTask(taskTimeoutMs).setNumConcurrentTasksPerInstance(numConcurrentTasksPerInstance)
.setIgnoreDependentJobFailure(true).setMaxAttemptsPerTask(1).setFailureThreshold(Integer.MAX_VALUE)
.setExpiry(_taskExpireTimeMs);
_taskDriver.enqueueJob(getHelixJobQueueName(taskType), parentTaskName, jobBuilder);
// Wait until task state is available
while (getTaskState(parentTaskName) == null) {
Uninterruptibles.sleepUninterruptibly(100, TimeUnit.MILLISECONDS);
}
return parentTaskName;

I'm currently use POST /tasks/execute API to schedule the job.
The culprit seems to be the while loop waiting for the task to get a state. I'm not familiar on how helix handles this in the background. Do you think it would be possible to avoid looping on synchronized getTaskState() and maybe implement a callback to get the result of a job scheduling.
This is a big deal for us since scheduling takes more than ingestion and doesn't allow to keep up with new data and scale.
It might also be a misconfiguration problem but in this case I will need your help to find it.

Current configuration:
GKE
version 0.12.1
GCS for deep storage
3 ZK - 8 CPU and 18GB ram
6 Servers - 16CPU and 32 64GB ram 1.45TB SSD
2 Controllers - 16 CPU and 32GB ram
2 Brokers - 5 CPU 16.25GB ram
32 Minions - 2 CPU and 2GB of ram

1M Segments 4TB of data

@Jackie-Jiang
Copy link
Contributor

cc @snleee

Do you see the log of Submitting parent task... before the scheduling returns? Usually the task state should be available very soon, so we need to figure out whether creating the tasks (in SegmentGenerationAndPushTaskGenerator) takes the time or scheduling.

@t0mpere
Copy link
Contributor Author

t0mpere commented Aug 8, 2023

Ok so these are the logs from a job scheduling. As you can see the task generation is very quick and the scheduling seems to be the bottleneck. 19 seconds passed between generation and the response. Let me know if you need any more info.

2023-08-04 16:50:00.118 BST Trying to create tasks of type: SegmentGenerationAndPushTask, table: TABLE
2023-08-04 16:50:00.434 BST Submitting ad-hoc task for task type: SegmentGenerationAndPushTask with task configs: [...]
2023-08-04 16:50:00.452 BST Submitting parent task: Task_SegmentGenerationAndPushTask_TABLE_c826881f-a0c5-48d0-bb58-8a43b3037b60 of type: SegmentGenerationAndPushTask with 1 child task configs
2023-08-04 16:50:00.456 BST Add job configuration TaskQueue_SegmentGenerationAndPushTask_Task_SegmentGenerationAndPushTask_TABLE_c826881f-a0c5-48d0-bb58-8a43b3037b60
2023-08-04 16:50:19.144 BST Handled request from 10.00.00.00 POST http://prod.host:80/tasks/execute, content-type application/json status code 200 OK
2023-08-04 16:50:19.376 BST Trying to create tasks of type: SegmentGenerationAndPushTask, table: TABLE_OTHER

@Jackie-Jiang
Copy link
Contributor

19 seconds is normal for Helix to create the task state after a task is submitted because several steps are ZK watcher callback based.
We shouldn't need to wait for the task state to show up though. Seems it is a workaround for a Helix bug introduced in #1894. Since we already upgraded to higher Helix version, let me see if we can remove the workaround.

@t0mpere
Copy link
Contributor Author

t0mpere commented Aug 10, 2023

Thanks this will help a lot 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants