Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coexecution jobs don't work with non-default managers #313

Closed
natefoo opened this issue Feb 3, 2023 · 7 comments
Closed

Coexecution jobs don't work with non-default managers #313

natefoo opened this issue Feb 3, 2023 · 7 comments

Comments

@natefoo
Copy link
Member

natefoo commented Feb 3, 2023

A non-default manager is necessary in this case because it's how we route messages to the correct Pulsar via AMQP.

Upon upgrading to usegalaxy.org, 23.0, Pulsar Kubernetes runner jobs are failing with the following error (where tacc_k8s is the manager defined in the runner plugin and is the destination id:

Traceback (most recent call last):
  File "/usr/local/bin/pulsar-submit", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/pulsar/scripts/submit.py", line 20, in main
    run_server_for_job(args)
  File "/usr/local/lib/python3.7/site-packages/pulsar/scripts/submit_util.py", line 29, in run_server_for_job
    manager, app = manager_from_args(config_builder)
  File "/usr/local/lib/python3.7/site-packages/pulsar/scripts/submit_util.py", line 94, in manager_from_args
    manager = pulsar_app.managers[manager_name]
KeyError: 'tacc_k8s'

usegalaxy.org had been running a custom coexecution image and unfortunately I no longer have any recollection as to why or how that image was built. But the version of Pulsar appears to be from somewhere around d9b7102 and I can't see any differences that suggest I manually hacked a fix. But in local testing I can reproduce this error all the way back to very old 0.14.x versions, so I have been unable to figure out how this was working until now. This image might still work if not for the fact that the client is adding --wait to the pulsar-submit args and the version of pulsar-submit in that image doesn't have --wait.

It may be possible/correct to set a pulsar app config on the Galaxy side that explicitly defines the tacc_k8s manager. The app conf generated for pulsar-submit's --app_conf_base64 option for these jobs contains:

{
  "staging_directory": "/pulsar_staging",
  "message_queue_url": "amqp://user:pass@host:5671//main_pulsar?ssl=1",
  "manager": {
    "type": "coexecution",
    "monitor": "background"
  },
  "persistence_directory": "/pulsar_staging/persisted_data"
}

I just wish I understood how this worked before.

@natefoo
Copy link
Member Author

natefoo commented Feb 3, 2023

Ok, confirmed, if I configure the following in the config file referenced by pulsar_app_config_path in the destination config in Galaxy, then it fixes the issue:

---

message_queue_url: amqp://user:pass@host:5671//main_pulsar?ssl=1
managers:
  tacc_k8s:
    type: coexecution
    monitor: background

Which is maybe not ideal but at least there's a workaround without the change in #314.

natefoo added a commit to galaxyproject/usegalaxy-playbook that referenced this issue Feb 3, 2023
@jmchilton
Copy link
Member

I've tried to get away from having to define managers at all in there and I'd like to stick with the default manager - I think #315 should fix that? You can just define amqp_key_prefix: pulsar_tacc_k8s_ as a destination parameter and it all should just work. I wanted to give you some time to check my work before I try to rebuild and test but I guess I should fast track testing and publishing this huh? I assume ITs are broken on main?

@jmchilton
Copy link
Member

The type coexecution and background things shouldn't be needed anymore either - you should just use the PulsarCoexecutionJobRunner.

@jmchilton
Copy link
Member

@jmchilton
Copy link
Member

Okay @natefoo - I think this is all ready to go on the dev side - the TL;DR:

@natefoo
Copy link
Member Author

natefoo commented Feb 10, 2023

This looks perfect, thanks, I'll try it out. Yes, the reason I had a named manager was for AMQP purposes, having a way to specify the AMQP exchange but still use the default manager seems to me like the best solution for the coexecution case where named managers don't make sense.

@mvdbeek
Copy link
Member

mvdbeek commented Apr 10, 2023

In the absence of news I assume this is working now, but please re-open if that's not the case @natefoo

@mvdbeek mvdbeek closed this as completed Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants