galaxyproject / tools-iuc Public
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add kraken2 data manager #2340
Add kraken2 data manager #2340
Conversation
I think you could save a lot of code duplication if you reduced things to only two data managers. The minikraken DM may be different enough to deserve being kept separate, but the other three should really be merged into one.
I imagine this to be rather straightforward with an initial select box in the combined xml that asks for the mode, then presents adjusted configuration options inside a conditional.
On the python side, you could keep your actual worker functions, but combine them into one file, and merge the argument parsers into one.
Even between the minikraken and the merged toolwrapper you could reduce code duplication by putting things into shared macros (definitely requirements, version command, citations should live there).
data_managers/data_manager_build_kraken2_database/data_manager/kraken2_build_custom.py
Outdated
Show resolved
Hide resolved
data_managers/data_manager_build_kraken2_database/data_manager/kraken2_build_custom.py
Outdated
Show resolved
Hide resolved
data_managers/data_manager_build_kraken2_database/data_manager/kraken2_build_custom.py
Outdated
Show resolved
Hide resolved
|
|
||
| # build the index | ||
| kraken2_build( | ||
| data_manager_dict, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be clearer if the function simply returned a new dict, instead of modifying an exisitng one as a side-effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I've made this change.
|
|
||
| def main(): | ||
| parser = argparse.ArgumentParser() | ||
| parser.add_argument('params') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes for a really strange command line interface. Inside the tool wrapper you should have access to the target_directory as $out_file.extra_files_path, so you could just pass that name on to here like all other parameters. No need to read the json back in then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about this suggestion. The galaxy documentation on data managers suggest reading the json file.
https://galaxyproject.org/admin/tools/data-managers/how-to/define/
It's also the way that other data managers in this repo seem to find the target_directory:
tools-iuc/data_managers/data_manager_build_kraken_database/data_manager/make_json.py
Line 14 in 2f544e3
| target_directory = params['output_data'][0]['extra_files_path'] |
| target_directory = params['output_data'][0]['extra_files_path'] |
...though I haven't checked all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've decided to keep this consistent with other data managers in the repo.
| } | ||
|
|
||
| params = json.loads(open(args.params).read()) | ||
| target_directory = params['output_data'][0]['extra_files_path'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comment above
data_managers/data_manager_build_kraken2_database/data_manager/kraken2_build_custom.xml
Outdated
Show resolved
Hide resolved
| 'https://ccb.jhu.edu/software/kraken2/dl/minikraken2_' + minikraken2_version + '_8GB.tgz' | ||
| ] | ||
|
|
||
| run(['wget'] + args, target_directory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why download this to a permanent place? Can't you just put it into the temporary job working directory? IIUC, you only want to keep the unpacked data, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I only want to keep the unpacked data. I may need some help with this. I don't have a clear idea of how to download to the temporary job working directory or how to move the unpacked data to the appropriate directory. I've made an attempt in this commit but I'm not sure that it's correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I've used the job working directory as suggested.
data_managers/data_manager_build_kraken2_database/data_manager/kraken2_build_minikraken.py
Outdated
Show resolved
Hide resolved
| return data_manager_dict | ||
|
|
||
|
|
||
| def main(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comments on the build_custom main apply here, too.
|
Thanks for the suggestions @wm75. Will implement them ASAP. If others can do it before me, they're welcome to send a PR to my |
Downloads the DB archive to the job working directory, then extracts its contents to the data manager target directory using Python stdlib functionality only.
|
I've started merging these data managers into a single data manager. My plan is to add a new data manager tool xml file and python script called |
|
As described above, I've added a new data manager |
|
I've merged the four separate data managers into one. I've tested that they all at least start to build a kraken database. I've been testing in a docker container so it's a bit underpowered to actually finish building some of these databases. I think there are probably some opportunities to remove some redundancy in |
|
I've found a more powerful system to test this on. Seems to be working! |
|
Yes @bgruening , I'd still call it a work-in-progress for now. I am interested in adding some tests if possible but haven't looked at how they work. I've confirmed that the minikraken and at least one of the 'special' databases (greengenes) build correctly. I set up a 'standard' database build near the end of day on Friday but I'm not able to access my testing server until Monday to check if it completed. |
|
I've taken a quick look at the tests on the |
|
I'd say https://github.com/galaxyproject/tools-iuc/tree/master/data_managers/data_manager_bowtie2_index_builder is probably the best template at this point for writing data manager tests. You can use all the tricks that you can use for tool output checks, for instance you can use assert_contents to make sure that the json output matches what you would expect. If you're having trouble just ping me, I can help with the test. |
|
I noticed that there's a new feature in When testing in planemo I'm getting an error: ...but otherwise it seems to be running correctly. |
|
If you want to test this locally you need to target the actual tool file, |
|
Thanks @mvdbeek, that worked for me. I'm seeing this both locally and in the TravisCI log: I don't see any difference. The file in the |
|
data managers don't write out a newline. You can strip it with |
|
One last thing I'd like to check before this is merged: Is there a convention for setting the version numbers on these tools? I'm not sure that I've done that consistently here. I'll take a look at some other tools in this repo. |
|
@dfornika no convention so far specifically for DMs. But following the tool convention, same version as the underlying tool version, is probably a good idea. |
|
I don't have any other plans to change this data manager now. If anyone has suggestions for further changes then please let me know here. Otherwise I'll request that this be merged please. Thanks @bgruening @mvdbeek @wm75 for your help & guidance. |
|
Thanks @dfornika! |
FOR CONTRIBUTOR:
The text was updated successfully, but these errors were encountered: