Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add reworked chronos job runner #11228

Merged
merged 6 commits into from
Feb 12, 2021
Merged

add reworked chronos job runner #11228

merged 6 commits into from
Feb 12, 2021

Conversation

mtangaro
Copy link
Member

@mtangaro mtangaro commented Jan 26, 2021

Dear all,
during the last Europe Biohackathon (2020) we reworked the Galaxy Mesos (Chronos) job runner, allowing it to properly work.
Meanwhile, I also did some other improvements and tests.
This is a joint effort by me, @gmauro, @pmandreoli, @maricaantonacci, @aknijn and, of course, @bgruening.

Brief changelog:

  1. Fix the working directory inside the container: the working directory in the container was not right, resulting in the impossibility to retrieve the output files.

  2. GALAXY_SLOTS and GALAXY_MEMORY_MB added to the json and available within the Container.
    GALAXY_SLOTS set to docker_cpu
    GALAXY_MEMORY_MB set to docker_memory
    It has been tested using this documentation page.

  3. A new command script, named chronos_{{job number}}.sh is created instead passing the command in the container preventing wrong characters parsing.
    This solve the following issue: Command badly rendered for chronos.py job runner on galaxy 20.05 #10677
    Not sure here if this is the right way to implement this, since I'm not using a template and it exploits the $SHELL variable to run it. This variable is correctly handled with biocontainers and, for example, ubuntu images. Any suggestion is more than welcome.

  4. Now there is the possibility to mount multiple path, e.g. galaxy database path and CVMFS reference data, in the Container, with proper permissions.

  5. I added the fail_job function to the runner to correctly show tools stderr in Galaxy. The stderr and stdout are correctly saved in the corresponding file under “outputs”. But, in case of job failure, during the “stop_job” step DB entry with the stderr is overwritten (with nothing).
    To fix this, basically, I call the “self._finish_or_resubmit_job” after the “job_state.job_wrapper.fail”, otherwise the last one overwrite the stderr entry in the DB. Please let me know if there is a better way to do this.

  6. Minor fix to job_conf.xml comments and dependencies list.

@gmauro and @maricaantonacci also setup a docker compose here to have it continuously tested. I hope we can do soon a PR for it.

@pmandreoli tested it with the following tools: bowtie2, trimmomatic, trigalore, sort, fastqc, using the galaxyproject CVMFS for the reference data.

To Do list:
this is a first working version, there is still room for improvements. For example, in case of many retries of the same job, the "stop_job" step is not correctly handled.

Marco

@gmauro
Copy link
Member

gmauro commented Jan 26, 2021

Super!

@gmauro
Copy link
Member

gmauro commented Jan 26, 2021

I created a PR for the docker-compose setup here

@jmchilton
Copy link
Member

Can you fix the linting issues here:

./lib/galaxy/jobs/runners/chronos.py:65:5: E731 do not assign a lambda expression, use a def
./lib/galaxy/jobs/runners/chronos.py:65:20: E231 missing whitespace after ','
./lib/galaxy/jobs/runners/chronos.py:65:22: E231 missing whitespace after ','
./lib/galaxy/jobs/runners/chronos.py:67:14: E231 missing whitespace after ','
./lib/galaxy/jobs/runners/chronos.py:67:20: E231 missing whitespace after ','
./lib/galaxy/jobs/runners/chronos.py:68:42: E231 missing whitespace after ','
./lib/galaxy/jobs/runners/chronos.py:68:48: E231 missing whitespace after ','
./lib/galaxy/jobs/runners/chronos.py:72:43: E231 missing whitespace after ','
./lib/galaxy/jobs/runners/chronos.py:73:3: E114 indentation is not a multiple of four (comment)
./lib/galaxy/jobs/runners/chronos.py:74:3: E114 indentation is not a multiple of four (comment)
./lib/galaxy/jobs/runners/chronos.py:75:3: E114 indentation is not a multiple of four (comment)
./lib/galaxy/jobs/runners/chronos.py:76:3: E731 do not assign a lambda expression, use a def
./lib/galaxy/jobs/runners/chronos.py:76:3: E111 indentation is not a multiple of four
./lib/galaxy/jobs/runners/chronos.py:76:21: E231 missing whitespace after ','
./lib/galaxy/jobs/runners/chronos.py:76:26: E201 whitespace after '{'
./lib/galaxy/jobs/runners/chronos.py:76:48: E202 whitespace before '}'
./lib/galaxy/jobs/runners/chronos.py:77:3: E111 indentation is not a multiple of four
./lib/galaxy/jobs/runners/chronos.py:77:7: E222 multiple spaces after operator
./lib/galaxy/jobs/runners/chronos.py:77:10: E201 whitespace after '['
./lib/galaxy/jobs/runners/chronos.py:77:77: E202 whitespace before ']'
./lib/galaxy/jobs/runners/chronos.py:78:3: E111 indentation is not a multiple of four
./lib/galaxy/jobs/runners/chronos.py:80:1: E302 expected 2 blank lines, found 1
./lib/galaxy/jobs/runners/chronos.py:212:51: E231 missing whitespace after ','
./lib/galaxy/jobs/runners/chronos.py:212:54: E231 missing whitespace after ','
./lib/galaxy/jobs/runners/chronos.py:217:5: E303 too many blank lines (2)
./lib/galaxy/jobs/runners/chronos.py:232:19: E111 indentation is not a multiple of four
./lib/galaxy/jobs/runners/chronos.py:234:19: E111 indentation is not a multiple of four
./lib/galaxy/jobs/runners/chronos.py:280:5: E115 expected an indented block (comment)
./lib/galaxy/jobs/runners/chronos.py:281:5: E115 expected an indented block (comment)
./lib/galaxy/jobs/runners/chronos.py:286:13: E225 missing whitespace around operator
./lib/galaxy/jobs/runners/chronos.py:300:13: E265 block comment should start with '# '
./lib/galaxy/jobs/runners/chronos.py:319:105: E202 whitespace before '}'

@mtangaro
Copy link
Member Author

mtangaro commented Feb 9, 2021

Hi @jmchilton, I've fixed the linting issues.
I have two more issues, which actually I had not before.
The one from "Integration / Test (3.7, kubernetes" seems not related to the runner.
While I can't see any log from the "get_code_and_test" test.
Can you have a look?

@nsoranzo
Copy link
Member

nsoranzo commented Feb 9, 2021

@mtangaro The kubernetes issue should be fixed if you rebase or merge dev.

@mtangaro
Copy link
Member Author

mtangaro commented Feb 9, 2021

Thanks a lot @nsoranzo. Everything is ok now :)

<param id="max_retries">2</param>
directories. Other directories of the data used by tools can
be mounted as well, separated by commas.-->
<param id="max_retries">0</param>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance I can get you to update the corresponding docs in the sample YAML file (https://github.com/galaxyproject/galaxy/blob/dev/test/unit/jobs/job_conf.sample_advanced.yml). It copied a lot of the comments out of this sample XML but I haven't gotten around to replacing the XML version with the YAML version - but the YAML will be the go to at some point soon.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes without saying! Next time I will update it by default.

@jmchilton
Copy link
Member

Thanks so much!

@github-actions
Copy link

This PR was merged without a "kind/" label, please correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants