Coexecution jobs don't work with non-default managers #313

natefoo · 2023-02-03T19:27:51Z

A non-default manager is necessary in this case because it's how we route messages to the correct Pulsar via AMQP.

Upon upgrading to usegalaxy.org, 23.0, Pulsar Kubernetes runner jobs are failing with the following error (where tacc_k8s is the manager defined in the runner plugin and is the destination id:

Traceback (most recent call last):
  File "/usr/local/bin/pulsar-submit", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/pulsar/scripts/submit.py", line 20, in main
    run_server_for_job(args)
  File "/usr/local/lib/python3.7/site-packages/pulsar/scripts/submit_util.py", line 29, in run_server_for_job
    manager, app = manager_from_args(config_builder)
  File "/usr/local/lib/python3.7/site-packages/pulsar/scripts/submit_util.py", line 94, in manager_from_args
    manager = pulsar_app.managers[manager_name]
KeyError: 'tacc_k8s'

usegalaxy.org had been running a custom coexecution image and unfortunately I no longer have any recollection as to why or how that image was built. But the version of Pulsar appears to be from somewhere around d9b7102 and I can't see any differences that suggest I manually hacked a fix. But in local testing I can reproduce this error all the way back to very old 0.14.x versions, so I have been unable to figure out how this was working until now. This image might still work if not for the fact that the client is adding --wait to the pulsar-submit args and the version of pulsar-submit in that image doesn't have --wait.

It may be possible/correct to set a pulsar app config on the Galaxy side that explicitly defines the tacc_k8s manager. The app conf generated for pulsar-submit's --app_conf_base64 option for these jobs contains:

{
  "staging_directory": "/pulsar_staging",
  "message_queue_url": "amqp://user:pass@host:5671//main_pulsar?ssl=1",
  "manager": {
    "type": "coexecution",
    "monitor": "background"
  },
  "persistence_directory": "/pulsar_staging/persisted_data"
}

I just wish I understood how this worked before.

The text was updated successfully, but these errors were encountered:

natefoo · 2023-02-03T20:37:14Z

Ok, confirmed, if I configure the following in the config file referenced by pulsar_app_config_path in the destination config in Galaxy, then it fixes the issue:

---

message_queue_url: amqp://user:pass@host:5671//main_pulsar?ssl=1
managers:
  tacc_k8s:
    type: coexecution
    monitor: background

Which is maybe not ideal but at least there's a workaround without the change in #314.

jmchilton · 2023-02-08T16:14:08Z

I've tried to get away from having to define managers at all in there and I'd like to stick with the default manager - I think #315 should fix that? You can just define amqp_key_prefix: pulsar_tacc_k8s_ as a destination parameter and it all should just work. I wanted to give you some time to check my work before I try to rebuild and test but I guess I should fast track testing and publishing this huh? I assume ITs are broken on main?

jmchilton · 2023-02-08T16:16:44Z

The type coexecution and background things shouldn't be needed anymore either - you should just use the PulsarCoexecutionJobRunner.

jmchilton · 2023-02-09T16:28:33Z

galaxyproject/galaxy#15535

jmchilton · 2023-02-09T19:11:16Z

Okay @natefoo - I think this is all ready to go on the dev side - the TL;DR:

Update to the latest Galaxy.
Revert any changes that would cause a non-default Pulsar container to be used.
If your job runner is not PulsarKubernetesJobRunner (https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/jobs/runners/pulsar.py#L1001) switch to that.
In the runner - setup a unique amqp key the way you would have used a manager name in the past https://github.com/galaxyproject/galaxy/pull/15535/files#diff-69443828eb5971b7ccfbc3956de897b8667470501eb847cc02432723a07a4c75R197 - I would think this only needs to be unique per Galaxy but you've got it keyed on more things and I will just trust you.
Drop the managers key in the destination and its overrides - they shouldn't be needed for our configuration in 23.0 (stuff like saying this was a co-execution manager is no longer needed).

natefoo · 2023-02-10T17:32:58Z

This looks perfect, thanks, I'll try it out. Yes, the reason I had a named manager was for AMQP purposes, having a way to specify the AMQP exchange but still use the default manager seems to me like the best solution for the coexecution case where named managers don't make sense.

mvdbeek · 2023-04-10T15:20:11Z

In the absence of news I assume this is working now, but please re-open if that's not the case @natefoo

natefoo mentioned this issue Feb 3, 2023

If manager is specified in the app conf and --manager is supplied, make the single manager have the supplied name #314

Closed

natefoo added a commit to galaxyproject/usegalaxy-playbook that referenced this issue Feb 3, 2023

Fix for k8s coexecution pods (see galaxyproject/pulsar#313)

c94bc79

mvdbeek closed this as completed Apr 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coexecution jobs don't work with non-default managers #313

Coexecution jobs don't work with non-default managers #313

natefoo commented Feb 3, 2023

natefoo commented Feb 3, 2023

jmchilton commented Feb 8, 2023

jmchilton commented Feb 8, 2023

jmchilton commented Feb 9, 2023

jmchilton commented Feb 9, 2023

natefoo commented Feb 10, 2023

mvdbeek commented Apr 10, 2023

Coexecution jobs don't work with non-default managers #313

Coexecution jobs don't work with non-default managers #313

Comments

natefoo commented Feb 3, 2023

natefoo commented Feb 3, 2023

jmchilton commented Feb 8, 2023

jmchilton commented Feb 8, 2023

jmchilton commented Feb 9, 2023

jmchilton commented Feb 9, 2023

natefoo commented Feb 10, 2023

mvdbeek commented Apr 10, 2023