Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assign fixed handler name since we only support one handler for now #105

Merged
merged 3 commits into from
Feb 5, 2020

Conversation

nuwang
Copy link
Member

@nuwang nuwang commented Jan 31, 2020

No description provided.

@pcm32
Copy link
Member

pcm32 commented Jan 31, 2020

Why is it that you support a single handler only? Are we talking about job handlers here?

@afgane
Copy link
Contributor

afgane commented Jan 31, 2020

@pcm32 This is because of a bug in Galaxy where job handlers for jobs never get reassigned. Currently, this manifests itself if there is a running job and the job handler pod is restarted. The running job will be lost into the void for Galaxy and perpetually remain in running state in the UI. I believe @natefoo is aware of this (although I can't find an issue on Galaxy for it).

This is also related to the workflow schedulers, which may actually be the same problem internally: galaxyproject/galaxy#8209

@pcm32
Copy link
Member

pcm32 commented Jan 31, 2020

Yes, I remember the workflow issue. On our HPC setup, with multiple web and job handlers, I used a workaround that pings the database every few seconds and assigns randomly jobs to handlers. Is the same applicable for this problem? I know is dirty, but I would rather have that and be able to have multiple handlers (in the meantime). We used this on a course with 30 attendants here and worked quite well. We could have a sidecar doing that, influencing the random number to move within the range of the set number of handlers.

@pcm32
Copy link
Member

pcm32 commented Jan 31, 2020

although I guess the db assignment doesn't help for the case that the handler is restarted.... unless that on each restart the handler would acquire a new name (which seems to be the case since you are using the pod name) and then we would somehow keep track of existing and vanished handlers (which should be easy by asking k8s for existing pods that are job handlers), and re-assign jobs that are running on handlers that no longer exist to the existing handlers.

@afgane
Copy link
Contributor

afgane commented Jan 31, 2020

According to @jmchilton, a lot of logic has moved out of the job handlers with 20.01 (and more will be removed in 20.05) to the point where a single job handler should be able to handle many more concurrent requests than before so a single job handler should be sufficient for vast majority of applications.

I agree this is a workaround but I'm also skeptical of developing an external lookup to keep up with the past and current pod names and do the reassignment. It may just be easier to fix it inside Galaxy for good...

@nuwang
Copy link
Member Author

nuwang commented Feb 1, 2020

@pcm32 Restricting to a single handler for now is the only way to get sane behaviour at present. An additional argument that John made was that until a single handler proves to be the bottleneck, we may as well restrict it to one handler till a proper solution is implemented. I buy this argument because any effort we implement in a workaround is better invested in Galaxy itself. What are your thoughts on this? If you have contradictory data (i.e. one handler is not enough), that would be useful input.

Ultimately, I'd love to see Galaxy fixed to support this. For all practical purposes, one handler may be enough, but in principle, it just gives more mental comfort to know there's no conceptual bottleneck at all. Till then, dynamic handler scaling doesn't work regardless of this patch, so at least, this patch makes the behaviour more inline with expectations (provided you never try to scale up). Check this issue also for some ongoing work: galaxyproject/galaxy#9180

@nuwang
Copy link
Member Author

nuwang commented Feb 1, 2020

An additional note. We could use a statefulset to scale up with consistent naming, provided we never scale down. Again, doesn't seem worth the effort at present.

@pcm32
Copy link
Member

pcm32 commented Feb 1, 2020

So, that would take me to the question, what is the expected capacity of a single handler (I guess the main variable factor is memory as it is clear that they run on a single thread), or at what point (in terms of the number of jobs) do you start to see deterioration? In most of my cases I simply provision more than 1 job handler to avoid hitting any bottlenecks and to be resilient to any punctual failures. If we are going to leave handlers fixed, I would at least fix them to a number bigger than one, for redundancy, which is a must in any cloud setup. Then we could use explicit naming of the handlers and avoid the dynamic problem for the time being. Even if the capacity of a single handler is high, you probably still want to be more fail-tolerant by having more than one.

@nuwang
Copy link
Member Author

nuwang commented Feb 1, 2020

Fault tolerance is not too much of an issue because k8s will respawn the container if it dies. In fact, fault tolerance becomes better because when the new handler comes up, at least, it'll correctly pick up the jobs instead of orphaning them. So the remaining question seems to be one of scalability. I don't really know to what extent a single handler can scale (perhaps @jmchilton or @natefoo can answer that one), but until we have concrete data to show otherwise, it seems better to err in favour of correct behaviour.

@pcm32
Copy link
Member

pcm32 commented Feb 1, 2020

Ok, re fault tolerance. I have however two use cases that I want to start using with this; training (normally around 30 concurrent users) and our RNA-Seq pipelines which deal with nearly all deposited sequences on ENA. I don't trust this to work on a single handler, and I don't want to be stressed about it thinking whether the course infrastructure will hold or not, so I would rather have the ability to spawn more than one handler, albeit fixed (not dynamic). I don't see why using fixed name handlers, dropping the ability to scale up or down, doesn't provide correct behaviour. The incorrect behaviour comes from the dynamic setup. So instead of having by default a single handler, I would suggest that we have by default 2 or 3 fixed name handlers and no ability to scale (or we say it is not supported).

@nuwang
Copy link
Member Author

nuwang commented Feb 2, 2020

@pcm32 Is this what you were thinking? (just pushed a commit)

@pcm32
Copy link
Member

pcm32 commented Feb 2, 2020

Thanks @nuwang... to be honest I was more after the longer term discussion, rather than make you change the PR. In the current setup that you propose, if I understand correctly after my quick skim, all handlers would be part of the same pod, correct? Wouldn't we want instead to have independently spawned containers as replicas of a deployment or other API construct?

@nuwang
Copy link
Member Author

nuwang commented Feb 2, 2020

@pcm32 I guess it depends on what time period we are talking about. The long term goal would be for Galaxy to natively support dynamic handlers and to revert this code back to what it was. The interim plan is to implement a low-cost workaround till that happens.

I've just updated this to repeat at a deployment level.

@afgane
Copy link
Contributor

afgane commented Feb 3, 2020

I am either misunderstanding how this is supposed to work or it's not working as expected. I tested this PR live and am only getting an increased number of container replicas within the same deployment. Further, all the job handlers have the same name.

For example, using Rancher, I increased the number of replicas from 1 to 2 and a new pod starts within the same deployment (eg, wrapping-tarsier-galaxy-job-0). Looking at the Galaxy logs, both job handlers have the same name: job_handler_0. The upside is that Galaxy jobs continue to run and even a restart of the deployment while a long-running job is executing is picked up and successfully processed as the job completes.

Alternatively, using helm upgrade --reuse-values wrapping-tarsier cloudve/galaxy --set jobHandlers.replicaCount=2 results in a new deployment being spawned, eg wrapping-tarsier-galaxy-job-1 with 2 pods, with both pods started using the same server command: --server-name job_handler_1.

@nuwang
Copy link
Member Author

nuwang commented Feb 3, 2020

Oops: 7c6ef0f
Thanks for testing this. I guess this would be a reasonable approach then? It's not super thrilling that it's a bigger change than we were originally hoping it would be, but no major pitfalls come to mind?

@afgane
Copy link
Contributor

afgane commented Feb 5, 2020

I'm going to merge this but hope a proper fix is coming in Galaxy soon. Meanwhile, without this workaround, jobs are lost on pod restarts.

@afgane afgane merged commit b97572e into galaxyproject:master Feb 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants