Assign fixed handler name since we only support one handler for now #105

nuwang · 2020-01-31T19:06:40Z

No description provided.

pcm32 · 2020-01-31T20:16:26Z

Why is it that you support a single handler only? Are we talking about job handlers here?

afgane · 2020-01-31T21:00:32Z

@pcm32 This is because of a bug in Galaxy where job handlers for jobs never get reassigned. Currently, this manifests itself if there is a running job and the job handler pod is restarted. The running job will be lost into the void for Galaxy and perpetually remain in running state in the UI. I believe @natefoo is aware of this (although I can't find an issue on Galaxy for it).

This is also related to the workflow schedulers, which may actually be the same problem internally: galaxyproject/galaxy#8209

pcm32 · 2020-01-31T23:00:03Z

Yes, I remember the workflow issue. On our HPC setup, with multiple web and job handlers, I used a workaround that pings the database every few seconds and assigns randomly jobs to handlers. Is the same applicable for this problem? I know is dirty, but I would rather have that and be able to have multiple handlers (in the meantime). We used this on a course with 30 attendants here and worked quite well. We could have a sidecar doing that, influencing the random number to move within the range of the set number of handlers.

pcm32 · 2020-01-31T23:05:40Z

although I guess the db assignment doesn't help for the case that the handler is restarted.... unless that on each restart the handler would acquire a new name (which seems to be the case since you are using the pod name) and then we would somehow keep track of existing and vanished handlers (which should be easy by asking k8s for existing pods that are job handlers), and re-assign jobs that are running on handlers that no longer exist to the existing handlers.

afgane · 2020-01-31T23:16:26Z

According to @jmchilton, a lot of logic has moved out of the job handlers with 20.01 (and more will be removed in 20.05) to the point where a single job handler should be able to handle many more concurrent requests than before so a single job handler should be sufficient for vast majority of applications.

I agree this is a workaround but I'm also skeptical of developing an external lookup to keep up with the past and current pod names and do the reassignment. It may just be easier to fix it inside Galaxy for good...

nuwang · 2020-02-01T15:13:45Z

@pcm32 Restricting to a single handler for now is the only way to get sane behaviour at present. An additional argument that John made was that until a single handler proves to be the bottleneck, we may as well restrict it to one handler till a proper solution is implemented. I buy this argument because any effort we implement in a workaround is better invested in Galaxy itself. What are your thoughts on this? If you have contradictory data (i.e. one handler is not enough), that would be useful input.

Ultimately, I'd love to see Galaxy fixed to support this. For all practical purposes, one handler may be enough, but in principle, it just gives more mental comfort to know there's no conceptual bottleneck at all. Till then, dynamic handler scaling doesn't work regardless of this patch, so at least, this patch makes the behaviour more inline with expectations (provided you never try to scale up). Check this issue also for some ongoing work: galaxyproject/galaxy#9180

nuwang · 2020-02-01T15:16:26Z

An additional note. We could use a statefulset to scale up with consistent naming, provided we never scale down. Again, doesn't seem worth the effort at present.

pcm32 · 2020-02-01T15:56:29Z

So, that would take me to the question, what is the expected capacity of a single handler (I guess the main variable factor is memory as it is clear that they run on a single thread), or at what point (in terms of the number of jobs) do you start to see deterioration? In most of my cases I simply provision more than 1 job handler to avoid hitting any bottlenecks and to be resilient to any punctual failures. If we are going to leave handlers fixed, I would at least fix them to a number bigger than one, for redundancy, which is a must in any cloud setup. Then we could use explicit naming of the handlers and avoid the dynamic problem for the time being. Even if the capacity of a single handler is high, you probably still want to be more fail-tolerant by having more than one.

nuwang · 2020-02-01T16:40:32Z

Fault tolerance is not too much of an issue because k8s will respawn the container if it dies. In fact, fault tolerance becomes better because when the new handler comes up, at least, it'll correctly pick up the jobs instead of orphaning them. So the remaining question seems to be one of scalability. I don't really know to what extent a single handler can scale (perhaps @jmchilton or @natefoo can answer that one), but until we have concrete data to show otherwise, it seems better to err in favour of correct behaviour.

pcm32 · 2020-02-01T22:08:20Z

Ok, re fault tolerance. I have however two use cases that I want to start using with this; training (normally around 30 concurrent users) and our RNA-Seq pipelines which deal with nearly all deposited sequences on ENA. I don't trust this to work on a single handler, and I don't want to be stressed about it thinking whether the course infrastructure will hold or not, so I would rather have the ability to spawn more than one handler, albeit fixed (not dynamic). I don't see why using fixed name handlers, dropping the ability to scale up or down, doesn't provide correct behaviour. The incorrect behaviour comes from the dynamic setup. So instead of having by default a single handler, I would suggest that we have by default 2 or 3 fixed name handlers and no ability to scale (or we say it is not supported).

nuwang · 2020-02-02T06:50:32Z

@pcm32 Is this what you were thinking? (just pushed a commit)

pcm32 · 2020-02-02T15:25:48Z

Thanks @nuwang... to be honest I was more after the longer term discussion, rather than make you change the PR. In the current setup that you propose, if I understand correctly after my quick skim, all handlers would be part of the same pod, correct? Wouldn't we want instead to have independently spawned containers as replicas of a deployment or other API construct?

nuwang · 2020-02-02T19:13:36Z

@pcm32 I guess it depends on what time period we are talking about. The long term goal would be for Galaxy to natively support dynamic handlers and to revert this code back to what it was. The interim plan is to implement a low-cost workaround till that happens.

I've just updated this to repeat at a deployment level.

afgane · 2020-02-03T02:34:50Z

I am either misunderstanding how this is supposed to work or it's not working as expected. I tested this PR live and am only getting an increased number of container replicas within the same deployment. Further, all the job handlers have the same name.

For example, using Rancher, I increased the number of replicas from 1 to 2 and a new pod starts within the same deployment (eg, wrapping-tarsier-galaxy-job-0). Looking at the Galaxy logs, both job handlers have the same name: job_handler_0. The upside is that Galaxy jobs continue to run and even a restart of the deployment while a long-running job is executing is picked up and successfully processed as the job completes.

Alternatively, using helm upgrade --reuse-values wrapping-tarsier cloudve/galaxy --set jobHandlers.replicaCount=2 results in a new deployment being spawned, eg wrapping-tarsier-galaxy-job-1 with 2 pods, with both pods started using the same server command: --server-name job_handler_1.

nuwang · 2020-02-03T04:51:53Z

Oops: 7c6ef0f
Thanks for testing this. I guess this would be a reasonable approach then? It's not super thrilling that it's a bigger change than we were originally hoping it would be, but no major pitfalls come to mind?

afgane · 2020-02-05T18:43:06Z

I'm going to merge this but hope a proper fix is coming in Galaxy soon. Meanwhile, without this workaround, jobs are lost on pod restarts.

Assign fixed handler name since we only support one handler for now

d87833b

afgane approved these changes Jan 31, 2020

View reviewed changes

Manually simulate replicas with a fixed number of handlers

8ead7c5

nuwang force-pushed the assigned_fix_handler_name branch from 8f162fa to 8ead7c5 Compare February 2, 2020 19:12

Fix no. of replicas per deployment to 1

7c6ef0f

afgane merged commit b97572e into galaxyproject:master Feb 5, 2020

afgane mentioned this pull request Mar 13, 2020

Fixing job handlers number of deployments #128

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assign fixed handler name since we only support one handler for now #105

Assign fixed handler name since we only support one handler for now #105

nuwang commented Jan 31, 2020

pcm32 commented Jan 31, 2020 •

edited

Loading

afgane commented Jan 31, 2020

pcm32 commented Jan 31, 2020

pcm32 commented Jan 31, 2020 •

edited

Loading

afgane commented Jan 31, 2020

nuwang commented Feb 1, 2020 •

edited

Loading

nuwang commented Feb 1, 2020

pcm32 commented Feb 1, 2020

nuwang commented Feb 1, 2020

pcm32 commented Feb 1, 2020

nuwang commented Feb 2, 2020

pcm32 commented Feb 2, 2020

nuwang commented Feb 2, 2020

afgane commented Feb 3, 2020

nuwang commented Feb 3, 2020 •

edited

Loading

afgane commented Feb 5, 2020

Assign fixed handler name since we only support one handler for now #105

Assign fixed handler name since we only support one handler for now #105

Conversation

nuwang commented Jan 31, 2020

pcm32 commented Jan 31, 2020 • edited Loading

afgane commented Jan 31, 2020

pcm32 commented Jan 31, 2020

pcm32 commented Jan 31, 2020 • edited Loading

afgane commented Jan 31, 2020

nuwang commented Feb 1, 2020 • edited Loading

nuwang commented Feb 1, 2020

pcm32 commented Feb 1, 2020

nuwang commented Feb 1, 2020

pcm32 commented Feb 1, 2020

nuwang commented Feb 2, 2020

pcm32 commented Feb 2, 2020

nuwang commented Feb 2, 2020

afgane commented Feb 3, 2020

nuwang commented Feb 3, 2020 • edited Loading

afgane commented Feb 5, 2020

pcm32 commented Jan 31, 2020 •

edited

Loading

pcm32 commented Jan 31, 2020 •

edited

Loading

nuwang commented Feb 1, 2020 •

edited

Loading

nuwang commented Feb 3, 2020 •

edited

Loading