New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run Galaxy fully under uWSGI, including job handlers #4475

Merged
merged 66 commits into from Nov 28, 2017

Conversation

Projects
None yet
4 participants
@natefoo
Member

natefoo commented Aug 22, 2017

Rationale

Previously, if running Galaxy under uWSGI, it was necessary to configure and start separate Paste or application-only (scripts/galaxy-main) Galaxy servers for job handling in order to avoid race conditions with job handler selection, and to separate web workers from job handlers for performance/scalability. This also meant needing some sort of process management (e.g. supervisor) to manage all of these individual processes.

uWSGI Mules are processes forked from the uWSGI master process after the application is loaded. Mules can also exec() specified arbitrary code, and come with some very nice features:

  1. They can receive messages from uWSGI worker processes.
  2. They can be grouped together in to "farms" such that messages sent to a farm are received only by mules in that farm
  3. They are controlled by the uWSGI master process and can be stopped and started all from a single command line.

Usage

This PR introduces the ability to run Galaxy job handlers as mules. In the simplest form, you can:

$ GALAXY_UWSGI=1 sh run.sh

This will run with a command line like:

$ uwsgi  --virtualenv /home/nate/work/galaxy/.venv --ini-paste config/galaxy.ini --processes 1 --threads 4 --http localhost:8080 --pythonpath lib --master --static-map /static/style=/home/nate/work/galaxy/static/style/blue --static-map /static=/home/nate/work/galaxy/static --paste-logger --die-on-term --enable-threads --py-call-osafterfork

You can override these defaults (other than the booleans like --master and --enable-threads) with a [uwsgi] section in galaxy.ini, or just configure in galaxy.ini and run uwsgi directly.

By default, with no job_conf.xml, jobs will be run in uWSGI web worker processes, as they were with Paste. This is to keep things simple at first. To run jobs in mules, you only need to start them and add them to the correct farm, which must be named job-handlers. Be aware that there are some caveats (below) if you have a job_conf.xml. Mules are added in any of the following ways (command line, ini config file, yaml config file):

$ GALAXY_UWSGI=1 sh run.sh --mule=lib/galaxy/main.py --mule=lib/galaxy/main.py --farm=job-handlers:1,2
[uwsgi]
mule = lib/galaxy/main.py
mule = lib/galaxy/main.py
farm = job-handlers:1,2
uwsgi:
    mule: lib/galaxy/main.py
    mule: lib/galaxy/main.py
    farm: job-handlers:1,2

For more handlers, simply add additional mule options and add their IDs to the farm option.

Design

Where possible, I have tried to make this as stack-agnostic and purpose-inspecific as possible. There is a bit of ugliness around how mules are designated as job handlers (they have to be in a farm named job-handlers), but the goal is to make it easy for anyone going forward to send tasks to mules for asynchronous execution. You'll see a few references to "pools," which is a sort of stack-agnostic abstraction of uWSGI Farms.

For most other functions you might want to push out to mules, it should be as simple as:

  1. Add a new message class as in galaxy.web.stack.message
  2. Create a message handler and register it with app.application_stack.register_message_handler
  3. Send messages to mules with app.application_stack.send_message

For jobs, messages are only being used for handler selection. We create Jobs in the web workers at tool execution time just as we did before, but they are committed to the database with a null in the handler field, where before they always had to have a handler set at creation time. Mule messages only include the target message handler function, task to perform (setup) and job ID of the job. A mule will receive the message and write its server_name to the handler field, and then pick the job up as they did before without any further jobs code modification.

Server names

Under uWSGI, server names are manipulated using the template {server_name}.{pool_name}.{instance_id} where:

  • {server_name} is the original server_name, configurable in the app config (or with Paste/webless, on the command line with --server-name), by default this is main
  • {pool_name} is the worker pool or farm name, for web workers (the processes forked based on the processes uWSGI option) this is web, for mules this is the farm name, e.g. job-handlers
  • {instance_id} is the 1-based index of the server in its pool, for web workers this is its uWSGI-assigned worker id (an integer starting at 1), for mules this is its 1-indexed position in the farm argument

So in the following galaxy.yml:

uwsgi:
    processes: 4
    mule: lib/galaxy/main.py
    mule: lib/galaxy/main.py
    mule: lib/galaxy/main.py
    farm: job-handlers:2,3
    farm: something-else:1
galaxy:
    server_name: galaxy

uWSGI starts 4 web workers, 2 job handlers, and another mule with server_names:

galaxy.web.1
galaxy.web.2
galaxy.web.3
galaxy.web.4
galaxy.job-handlers.1
galaxy.job-handlers.2
galaxy.something-else.1

This information is important when you want to statically or dynamically map handlers rather than use the default.

Caveats

In order to attempt to support existing job_conf.xml files that have a default <handlers> block, jobs are mapped to handlers in the following manner:

  • If you do not have a job_conf.xml, or have a job_conf.xml with no <handlers> block:
    • If started without a configured job-handlers farm or a non-uWSGI server: web workers are job handlers
    • If started with a job-handlers farm: mules are job handlers
  • If you have a <handlers> block and do not have a default= set in <handlers>:
    • Works the same as if you have no job_conf.xml, except explicitly specified static/dynamic handler mappings will result in the specified handler being assigned
  • If you have a <handlers> block and do have a default= set in <handlers>:
    • The default handler or explicit static/dynamic handler specified is assigned

As before, if a specified handler is assigned and the specified handler is a tag, a handler with that tag is chosen at random. If a handler is assigned due to an explicit static/dynamic mapping, mule messages are not used, the specified handler ID is simply set on the job record in the database.

One way to mix automatic web/mule handling with mapped handling is to define multiple <handler>s but not a default, since by default, jobs will be sent to the web/mule handlers, and only tools specifically mapped to handlers will be sent to the named handlers. It is possible to map tools to mule handlers in job_conf.xml as well, using server_names main.job-handlers.1, main.job-handlers.2, ...

This is complicated and perhaps we should do things less magically, but as usually for Galaxy, I am trying to take the approach of least intervention by admins.

There is some functionality included for templating the server name for greater control - e.g. if you run Galaxy on multiple servers, the server_name (which is persisted in the database and used for job recovery) would need to include some identifier unique for each host. However, configuration for this is not exposed. In the short term, people in that situation (are there any other than me?) can always continue running handlers externally.

Zerg mode is untested and you would have the potential to encounter race conditions during restarts, especially with respect to job recovery.

Configurability

I went through multiple iterations on how to make things configurable. For example:

---
stack:
  workers:
    - name: default-job-handlers
      purpose: handle_jobs    # this essentially controls the role of the mule, what portions of the application it loads, etc.
      processes: 4
      server_name: "{server_name}.{pool_name}.{process_num}"
    - name: special-job-handlers
      purpose: handle_jobs
      processes: 2
      server_name: "{server_name}.{pool_name}.{process_num}"
    - name: spam-spam-spam
      # default purpose, just run the galaxy application
    - name: eggs
      type: standalone    # "webless" galaxy process started externally

This would translate in to a command line like:

$ uwsgi ... --mule=lib/galaxy/main.py --mule=lib/galaxy/main.py \
    --mule=lib/galaxy/main.py --mule=lib/galaxy/main.py \
    --mule=lib/galaxy/main.py --mule=lib/galaxy/main.py \
    --mule=lib/galaxy/main.py
    --farm=default-job-handlers:1,2,3,4 \
    --farm=special-job-handlers:5,6 \
    --farm=spam-spam-spam:7

Prior to #3179, I'd made a separate YAML config for the containers interface. These configs use defaults set as class attributes on the container classes, and those defaults are merged recursively down the class inheritance chain.

I wanted to do the same for the stack config, but with #3179, we can start merging YAML configs into the main Galaxy config. Ultimately (after some discussion on the Gitter channel) I've stripped the configurability out until we settle on whether or not and (if yes) how to support the hierarchical configs/defaults in a way compatible with the model @jmchilton created in that excellent PR.

Invocations

You can start under uWSGI using a variety of methods:

ini-paste:

$ uwsgi --ini-paste config/galaxy.ini ...

ini

$ uwsgi --ini config/galaxy.ini --module 'galaxy.webapps.galaxy.buildapp:uwsgi_app_factory()' ...

yaml

$ uwsgi --yaml config/galaxy.yml --module 'galaxy.webapps.galaxy.buildapp:uwsgi_app_factory()' ...

separate app config

Galaxy config file (ini or yaml) separate from the uWSGI config file (also ini or yaml):

$ uwsgi --<ini|yaml> config/galaxy.<ini|yml> --set galaxy_config_file=config/galaxy.<ini|yml> ...

no config file

(For example):

$ uwsgi --virtualenv /home/nate/work/galaxy/.venv --http localhost:8192 --die-on-term --enable-threads --py-call-osafterfork --master --processes 2 --threads 4 --pythonpath lib --static-map /static/style=/home/nate/work/galaxy/static/style/blue --static-map /static=/home/nate/work/galaxy/static --module 'galaxy.webapps.galaxy.buildapp:uwsgi_app_factory()' --set galaxy_config_file=config/galaxy.yml --mule=lib/galaxy/main.py --mule=lib/galaxy/main.py --farm=job-handlers:1,2

Logging

By default, everything logs to one stream, and you can't tell which messages come from which process. This isn't bad with one mule, with more it's unmanageable. You can fix this with the following logging config, which includes the use of custom filters in this PR that log the uWSGI worker and mule IDs in the log message:

[loggers]
keys = root, galaxy

[handlers]
keys = console

[formatters]
keys = generic

[logger_root]
level = INFO
handlers = console

[logger_galaxy]
level = DEBUG
handlers = console
qualname = galaxy
propagate = 0

[handler_console]
class = StreamHandler
args = (sys.stderr,)
level = DEBUG
formatter = generic

[formatter_generic]
format = %(name)s %(levelname)-5.5s %(asctime)s [p:%(process)s,w:%(worker_id)s,m:%(mule_id)s] [%(threadName)s] %(message)s

Or, even better, if you switch to a YAML Galaxy application config, you can use a logging.config.dictConfig dict with a special filename_template parameter on any logging.handlers.*FileHandler handlers that will be templated with galaxy.util.facts, e.g.:

---
galaxy:
    debug: yes
    use_interactive: yes
    logging:
        version: 1
        root:
            # root logger
            level: INFO
            handlers:
                - console
                - files
        loggers:
            galaxy:
                level: DEBUG
                handlers:
                    - console
                    - files
                qualname: galaxy
                propagate: 0
        handlers:
            console:
                class: logging.StreamHandler
                level: DEBUG
                formatter: generic
                stream: ext://sys.stderr
            files:
                class: logging.FileHandler
                level: DEBUG
                formatter: generic
                filename: galaxy_default.log
                filename_template: galaxy_{pool_name}_{server_id}.log
        formatters:
            generic:
                format: "%(name)s %(levelname)-5.5s %(asctime)s [p:%(process)s,w:%(worker_id)s,m:%(mule_id)s] [%(threadName)s] %(message)s"

This will result in 3 log files, galaxy_web_0.log (actually contains log messages for all workers), galaxy_job-handlers_1.log and galaxy_job-handlers_2.log.

TODO

For this PR:

  • Support handler assignment for workflow invocations
  • Figure out separate file logging for the workers and each mule
  • Make sure get_uwsgi_args.py works with a variety of [uwsgi] (or lack of) settings in the config
  • Finish refactoring and cleaning
  • Fix linting, tests, etc., whitelist stricter code guidelines
  • Document logging configuration improvements
  • Squash WIP commits I think I want to preserve these
  • Tests would be great ¯\_(ツ)_/¯ thanks @jmchilton! (ノ^_^)ノ
  • Default to running jobs in worker(s) and don't start mules if there is no galaxy config file and no job_conf.xml and no mules configured.
  • Fix thread join failure on shutdown
  • Implement correct handler assignment (either web workers or mules) when a default job_conf.xml is in place
  • Correct server_name templating for mules, ideally: {server_name}.{pool_name}.{pool_index} where pool_name is the farm name and pool_index is the mule's index in the farm, or at least {server_name}.{pool_name}.{server_id} where server_id is the mule_id
  • Correct the information in this PR description
  • Also, it shouldn't block merging, but I need to improve the way that a Galaxy determines whether stack messaging should be used for handler selection, and whether a mule is a handler.

Future TODO

For a post-PR issue:

  • Include a default config for proxying GIEs (work started in #2385)
  • If possible, mules should probably give up the message lock after some timeout. This may only be possible with signals since uwsgi.get_farm_msg() does not have a timeout param.
  • Default to uWSGI
  • Document recommended job config changes
  • Add configurability as described above
  • Add configurability for workflow scheduling handlers
  • Support Zerg mode
  • get_uwsgi_args.py won't play nice with mule and farm settings in the uWSGI config file ([uwsgi] section in galaxy.ini or wherever it is in your case)
  • Run handlers without exec(): unbit/uwsgi#1608
  • Run multiple handler pools for mapping jobs in different ways
  • To my knowledge, no other WSGI application stacks support fork/execing asynchronous workers, let alone messaging them. However, I would like to incorporate non-uWSGI worker messaging into the stack code so we could at least send messages to webless Galaxy processes. My initial thought on doing this is to add a stack transport (see galaxy.web.stack.transport) that interfaces with the AMQP support already in galaxy.queues. But alternatively, maybe the stack messaging stuff should be decoupled from the stack entirely and merged directly in to galaxy.queues.
'process_num': 1,
'id': 1,
'hostname': socket.gethostname().split('.', 1)[0],
'fqdn': socket.getfqdn(),

This comment has been minimized.

@jmchilton

jmchilton Aug 22, 2017

Member

Should this be configurable? Should we synchronize this with galaxy_infrastructure_url configuration in some way? I'm not sure the context so happy to accept no as an answer.

This comment has been minimized.

@natefoo

natefoo Aug 22, 2017

Member

In general, yes, this should be more configurable. server_name manipulation is not really exposed though at the moment due to the configurability stuff. And yeah, I don't like this hardcoded set of template vars.

Not sure what you mean with synchronizing with galaxy_infrastructure_url though.

This comment has been minimized.

@natefoo

natefoo Aug 22, 2017

Member

Ah, you meant the FQDN specifically. I think I'd rather leave fqdn as it is and add an infrastructure_hostname that would be parsed out of the infrastructure URL. FQDN is going to be more useful in cases like mine where you are running one Galaxy instance across multiple hosts. The infrastructure hostname is going to be the same on all of those.

@nsoranzo

This comment has been minimized.

Member

nsoranzo commented Aug 22, 2017

@natefoo What about the workflow schedulers which can presently be configured with config/workflow_schedulers_conf.xml ?

@natefoo

This comment has been minimized.

Member

natefoo commented Aug 22, 2017

@nsoranzo Good thinking, shouldn't be too hard to incorporate.

@natefoo

This comment has been minimized.

Member

natefoo commented Aug 22, 2017

@nsoranzo, ok handler selection for workflow scheduling is added in c5bd3d4. It's not perfect (it uses the same pool as job handlers) but that'll be addressed when I deal with the configurability items on the TODO list.

@nsoranzo

This comment has been minimized.

Member

nsoranzo commented Aug 22, 2017

Thanks @natefoo!

@natefoo

This comment has been minimized.

Member

natefoo commented Aug 24, 2017

Separating the uWSGI master/worker logs from the mules is proving to be fairly difficult.

I have been working on moving the server_name formatting dict to a more generalized version that could be reused by other components, with the idea that I could also use it to template filenames in the log handler configuration, since I am not aware of any supported way to dynamically select log files for different application instances running from the same logging.config.fileConfig file (i.e. a single galaxy.ini).

Having done that, I am not sure that it'll be possible to template the log filename using fileConfig. Mules are started with lib/galaxy/main.py so it may be possible to monkeypatch or wrap fileConfig there to provide an alternate args (say, a template_args option under [handler_foo]) to a handler config that templates the filename beforehand. This would be pretty hacky, though. And the same wouldn't be possible with uWSGI's paste logging support since it imports paste.script.util.logging_conf directly in C and sets it up before we have any control in Python.

Using logging.config.dictConfig should give us more options since we could easily manipulate the data structure prior to logging configuration (which would always be performed internally, instead of through PasteScript), but wouldn't be supported with the INI config. However, we could just note that split file logging is not supported unless you switch to the YAML Galaxy app config.

The alternative is uWSGI log routing, except that uWSGI does not, by default, build with PCRE support, and thus does not have log routing capability. We could probably fix this in our wheel, but dividing up the log messages is done with regex, which is less than ideal, and it'd be dependent on the configured log format.

Anyone know of any other solutions?

@jmchilton

This comment has been minimized.

Member

jmchilton commented Aug 28, 2017

There is a binary file in there - lib/galaxy/web/framework/.webapp.py.swp that I don't think belongs there.

Update: Actually a couple different such files - added in 84f5a5b.

WIP: support log configuration from yaml/json app config with
logging.config.dictConfig. Still working on yaml/json config loading in
the webless app.
@natefoo

This comment has been minimized.

Member

natefoo commented Aug 28, 2017

@jmchilton Thanks, fixed.

@natefoo

This comment has been minimized.

Member

natefoo commented Nov 17, 2017

I waffled on the job_handler_count option in the Galaxy config, which was only supposed to be temporary anyway, and removed it. I did this because it causes too much confusion with job_conf.xml, which is where all job config-related stuff is supposed to go.

Also, by default, we won't start any mules, and web workers will just handle jobs. In the future we might want to default to starting a mule, but for now that makes an already big change even bigger and more likely to encounter issues.

So as of now, if you want mules, you must explicitly specify them on the command line, or in the uWSGI config like so (yes, the clobbering key syntax is correct in all cases):

$ GALAXY_UWSGI=1 sh run.sh --mule=lib/galaxy/main.py --mule=lib/galaxy/main.py --farm=job-handlers:1,2
[uwsgi]
mule = lib/galaxy/main.py
mule = lib/galaxy/main.py
farm = job-handlers:1,2
uwsgi:
    mule: lib/galaxy/main.py
    mule: lib/galaxy/main.py
    farm: job-handlers:1,2

As for how handler mapping works (or should work):

  • If you do not have a job-handlers farm
    • If you have no job_conf.xml, the web workers will automatically be used as handlers
    • If you have a job_conf.xml and
      • It has no <handlers> section, we should use the web workers as handlers (TODO)
      • It has a <handlers> section but it only has 1 handler and that handler is main, we should use the web workers as handlers (TODO)
      • It has any other <handlers> config, jobs are assigned to handlers as per that config, web workers will not run jobs
  • If you do have a job-handlers farm
    • If you have no job_conf.xml, the mules will automatically be used as handlers (and web workers will not be used)
    • If you have a job_conf.xml and
      • It has no <handlers> section, we should use the mules (TODO)
      • It has a <handlers> section, jobs are assigned to handlers as per that config, mules will not run jobs unless their server names match a handler id in the job config.
@natefoo

This comment has been minimized.

Member

natefoo commented Nov 20, 2017

I've updated the top comment with the description for the actual implementation of job/handler mapping.

@natefoo natefoo added status/review and removed status/WIP labels Nov 21, 2017

@natefoo natefoo changed the title from [WIP] Run Galaxy fully under uWSGI, including job handlers to Run Galaxy fully under uWSGI, including job handlers Nov 21, 2017

@natefoo

This comment has been minimized.

Member

natefoo commented Nov 21, 2017

:shipit:

@natefoo

This comment has been minimized.

Member

natefoo commented Nov 27, 2017

This is ready for new reviews.

@@ -349,7 +342,6 @@ def __init__(self, **kwargs):
self.smtp_password = kwargs.get('smtp_password', None)
self.smtp_ssl = kwargs.get('smtp_ssl', None)
self.track_jobs_in_database = string_as_bool(kwargs.get('track_jobs_in_database', 'True'))
self.start_job_runners = listify(kwargs.get('start_job_runners', ''))

This comment has been minimized.

@jmchilton

jmchilton Nov 28, 2017

Member

The config linting framework has the ability to flag such options - we should remember to document these there.

@jmchilton jmchilton merged commit c6c55e1 into galaxyproject:dev Nov 28, 2017

7 checks passed

api test Build finished. 317 tests run, 4 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 163 tests run, 0 skipped, 0 failed.
Details
integration test Build finished. 58 tests run, 0 skipped, 0 failed.
Details
lgtm analysis: JavaScript No alert changes
Details
selenium test Build finished. 100 tests run, 1 skipped, 0 failed.
Details
toolshed test Build finished. 577 tests run, 0 skipped, 0 failed.
Details
@jmchilton

This comment has been minimized.

Member

jmchilton commented Nov 28, 2017

Congrats @natefoo - nice work!

@martenson

This comment has been minimized.

Member

martenson commented Nov 28, 2017

Splendid! Did we also agree on pronunciation of uwsgi? :)

@natefoo

This comment has been minimized.

@natefoo

This comment has been minimized.

Member

natefoo commented Nov 28, 2017

Thanks for merging!

nsoranzo added a commit to nsoranzo/galaxy that referenced this pull request Dec 15, 2017

@jmchilton

This comment has been minimized.

Member

jmchilton commented on scripts/common_startup_functions.sh in 09feaf6 Jan 24, 2018

But run.sh specifies a different log-file argument - so it is duplicated - this line seems to completely break planemo because server_name is ignored and this is always galaxy.log. Any ideas?

This comment has been minimized.

Member

natefoo replied Jan 24, 2018

This is in the dev docs... which I apparently broke. Set $GALAXY_LOG in the environment and it should be honored. I think run.sh only ignores the common_startup.sh args if you're using paste/webless and have set $GALAXY_RUN_ALL?

That said, not entirely sure why I made this change.

@natefoo natefoo referenced this pull request Jan 25, 2018

Closed

Replace paste#http with uwsgi #2393

2 of 2 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment