-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split Runner to Runner and Poller #32
Comments
Not seriously considered for now (I'll take a look later), but that sounds better architecture at a glance 👍 |
I want to know the details of Poller implementation direction. How does it decide which task to poll? As far as I understand, process that spawns a task and one that polls a task are different. In other words, which process to poll was obvious if they were not distributed (a process should poll a task which the process spawned), but it will be arbitrary as distributed. Will it poll the task whose created_at is earliest and which is not polled by other poller? If so, will it get stuck when the polled task takes much time? Or, will it randomly select a task to poll every time? With my current understanding (one poller process can't poll multiple tasks), it has 2 problems:
Then, what do you think about receiving S3 task notification event via SQS? It would be cost-efficient for polling if all tasks take long time and solve "which task to poll" problem. I'm sorry but I don't consider about Docker runner counterpart 🙃 |
Randomly selected from all running job executions. One polling step for each execution is just
Right, but actual finished time of the task can be obtained from The poller process acts like follows. loop do
Barbeque::JobExecution.running.shuffle.each do |job_execution|
task_identifier = find_task_identifier_from_db(job_execution)
if ecs_task_stopped?(task_identifier)
task = get_task_result(task_identifier)
job_execution.update!(status: task.success? ? :success : :failed, finished_at: task.stopped_at)
end
end
sleep(interval)
end Moreover,
Yes, that's exactly what I'd like to implement in the next big step. It must be able to poll more efficiently. loop
message = sqs_client.receive_message
task = extract_task_info(message)
if task.stopped?
job_execution = find_execution_from_task_info(task)
if job_execution
job_execution.update!(status: task.success? ? :success : :failed, finished_at: task.stopped_at)
end
end
end I have to implement such feature in Hako at first, then support the feature in Barbeque (and Kuroko2!). SPOILER: I'm wrinting complete patch for this issue on this branch https://github.com/eagletmt/barbeque/tree/runner-and-poller |
I see. It totally made sense for that part. Having shuffle poller as the next step sounds a reasonable decision.
If ECS cluster scale-in is properly implemented, infinite scale-out wouldn't be a problem in barbeque side. However, it would be problematic in executed application side like following situations:
While ECS scale-out wouldn't be so fast and such cases would hardly be problematic, it would be better to think about a way to introduce the limit of running executions (per application?).
For casual and easy-to-implement way (but not scalable), we can query count of running executions every time from now. Another way would be adding a column to applications table and manage count in it by increment/decrement with single query (if we don't want to manage redis). |
I'm especially concerning about this situation. Barbeque jobs can spike when the end-user access spikes, which is often unpredictable to us. Running too many jobs concurrently could I will try the easiest way of keeping |
+1 for "Setting per-application limit" for concurrency problems. |
After having a short grace at this and #38, is "latency issue for monitoring" already solved? It's related with this sentence:
Will this problem be solved by getting actual finished time? I mean, can we ignore the latency between the moment when the job actually finished and the moment when the poller got actual finished time? Aha, or the problem might be out-of scope in this issue:
|
Yes. As I answered in #32 (comment) , the current implementation of executing |
Implemented in #38 |
Current barbeque-worker runs specified command as follows.
docker run cmd...
orhako oneshot cmd...
Problem
It's simple enough but there's some problems.
hako
is used, even if barbeque-worker goes down unexpectedly while runninghako
, the enqueued job is still running in another host (ECS container instance). barbeque-worker loses the running job completely and cannot recover from it.hako
is used, barbeque-worker doesn't consume so much server resources becausehako
executes the job in another host (ECS container instance). We can execute more jobs if the ECS cluster has enough capacity.Solution
Split Barbeque::Runner into two parts: Runner and Poller.
docker run --detach
and stores its container idhako oneshot --no-wait
and stores its ECS cluster and ECS Task ARN--no-wait
was added recently https://github.com/eagletmt/hako/blob/master/CHANGELOG.md#160-2017-06-23docker inspect
anddocker logs
docker run --detach cmd...
orhako oneshot --no-wait cmd...
I call these Runner and Poller as Executor. Executor can be customized just like the current Runner.
Pros
Cons
cc: @cookpad/dev-infra @k0kubun
The text was updated successfully, but these errors were encountered: