-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assign fixed handler name since we only support one handler for now #105
Assign fixed handler name since we only support one handler for now #105
Conversation
Why is it that you support a single handler only? Are we talking about job handlers here? |
@pcm32 This is because of a bug in Galaxy where job handlers for jobs never get reassigned. Currently, this manifests itself if there is a running job and the job handler pod is restarted. The running job will be lost into the void for Galaxy and perpetually remain in This is also related to the workflow schedulers, which may actually be the same problem internally: galaxyproject/galaxy#8209 |
Yes, I remember the workflow issue. On our HPC setup, with multiple web and job handlers, I used a workaround that pings the database every few seconds and assigns randomly jobs to handlers. Is the same applicable for this problem? I know is dirty, but I would rather have that and be able to have multiple handlers (in the meantime). We used this on a course with 30 attendants here and worked quite well. We could have a sidecar doing that, influencing the random number to move within the range of the set number of handlers. |
although I guess the db assignment doesn't help for the case that the handler is restarted.... unless that on each restart the handler would acquire a new name (which seems to be the case since you are using the pod name) and then we would somehow keep track of existing and vanished handlers (which should be easy by asking k8s for existing pods that are job handlers), and re-assign jobs that are running on handlers that no longer exist to the existing handlers. |
According to @jmchilton, a lot of logic has moved out of the job handlers with 20.01 (and more will be removed in 20.05) to the point where a single job handler should be able to handle many more concurrent requests than before so a single job handler should be sufficient for vast majority of applications. I agree this is a workaround but I'm also skeptical of developing an external lookup to keep up with the past and current pod names and do the reassignment. It may just be easier to fix it inside Galaxy for good... |
@pcm32 Restricting to a single handler for now is the only way to get sane behaviour at present. An additional argument that John made was that until a single handler proves to be the bottleneck, we may as well restrict it to one handler till a proper solution is implemented. I buy this argument because any effort we implement in a workaround is better invested in Galaxy itself. What are your thoughts on this? If you have contradictory data (i.e. one handler is not enough), that would be useful input. Ultimately, I'd love to see Galaxy fixed to support this. For all practical purposes, one handler may be enough, but in principle, it just gives more mental comfort to know there's no conceptual bottleneck at all. Till then, dynamic handler scaling doesn't work regardless of this patch, so at least, this patch makes the behaviour more inline with expectations (provided you never try to scale up). Check this issue also for some ongoing work: galaxyproject/galaxy#9180 |
An additional note. We could use a statefulset to scale up with consistent naming, provided we never scale down. Again, doesn't seem worth the effort at present. |
So, that would take me to the question, what is the expected capacity of a single handler (I guess the main variable factor is memory as it is clear that they run on a single thread), or at what point (in terms of the number of jobs) do you start to see deterioration? In most of my cases I simply provision more than 1 job handler to avoid hitting any bottlenecks and to be resilient to any punctual failures. If we are going to leave handlers fixed, I would at least fix them to a number bigger than one, for redundancy, which is a must in any cloud setup. Then we could use explicit naming of the handlers and avoid the dynamic problem for the time being. Even if the capacity of a single handler is high, you probably still want to be more fail-tolerant by having more than one. |
Fault tolerance is not too much of an issue because k8s will respawn the container if it dies. In fact, fault tolerance becomes better because when the new handler comes up, at least, it'll correctly pick up the jobs instead of orphaning them. So the remaining question seems to be one of scalability. I don't really know to what extent a single handler can scale (perhaps @jmchilton or @natefoo can answer that one), but until we have concrete data to show otherwise, it seems better to err in favour of correct behaviour. |
Ok, re fault tolerance. I have however two use cases that I want to start using with this; training (normally around 30 concurrent users) and our RNA-Seq pipelines which deal with nearly all deposited sequences on ENA. I don't trust this to work on a single handler, and I don't want to be stressed about it thinking whether the course infrastructure will hold or not, so I would rather have the ability to spawn more than one handler, albeit fixed (not dynamic). I don't see why using fixed name handlers, dropping the ability to scale up or down, doesn't provide correct behaviour. The incorrect behaviour comes from the dynamic setup. So instead of having by default a single handler, I would suggest that we have by default 2 or 3 fixed name handlers and no ability to scale (or we say it is not supported). |
@pcm32 Is this what you were thinking? (just pushed a commit) |
Thanks @nuwang... to be honest I was more after the longer term discussion, rather than make you change the PR. In the current setup that you propose, if I understand correctly after my quick skim, all handlers would be part of the same pod, correct? Wouldn't we want instead to have independently spawned containers as replicas of a deployment or other API construct? |
8f162fa
to
8ead7c5
Compare
@pcm32 I guess it depends on what time period we are talking about. The long term goal would be for Galaxy to natively support dynamic handlers and to revert this code back to what it was. The interim plan is to implement a low-cost workaround till that happens. I've just updated this to repeat at a deployment level. |
I am either misunderstanding how this is supposed to work or it's not working as expected. I tested this PR live and am only getting an increased number of container replicas within the same deployment. Further, all the job handlers have the same name. For example, using Rancher, I increased the number of replicas from 1 to 2 and a new pod starts within the same deployment (eg, Alternatively, using |
Oops: 7c6ef0f |
I'm going to merge this but hope a proper fix is coming in Galaxy soon. Meanwhile, without this workaround, jobs are lost on pod restarts. |
No description provided.