Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kraken2 data manager #2340

Merged
merged 28 commits into from May 30, 2019
Merged

Add kraken2 data manager #2340

merged 28 commits into from May 30, 2019

Conversation

@dfornika
Copy link
Contributor

@dfornika dfornika commented Mar 26, 2019

FOR CONTRIBUTOR:

  • - I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
  • - License permits unrestricted use (educational + commercial)
  • - This PR adds a new tool or tool collection
  • - This PR updates an existing tool or tool collection
  • - This PR does something else (explain below)
Copy link
Contributor

@wm75 wm75 left a comment

I think you could save a lot of code duplication if you reduced things to only two data managers. The minikraken DM may be different enough to deserve being kept separate, but the other three should really be merged into one.

I imagine this to be rather straightforward with an initial select box in the combined xml that asks for the mode, then presents adjusted configuration options inside a conditional.
On the python side, you could keep your actual worker functions, but combine them into one file, and merge the argument parsers into one.

Even between the minikraken and the merged toolwrapper you could reduce code duplication by putting things into shared macros (definitely requirements, version command, citations should live there).

Loading

Loading
Loading
Loading

# build the index
kraken2_build(
data_manager_dict,
Copy link
Contributor

@wm75 wm75 Mar 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be clearer if the function simply returned a new dict, instead of modifying an exisitng one as a side-effect.

Loading

Copy link
Contributor Author

@dfornika dfornika May 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've made this change.

Loading


def main():
parser = argparse.ArgumentParser()
parser.add_argument('params')
Copy link
Contributor

@wm75 wm75 Mar 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes for a really strange command line interface. Inside the tool wrapper you should have access to the target_directory as $out_file.extra_files_path, so you could just pass that name on to here like all other parameters. No need to read the json back in then.

Loading

Copy link
Contributor Author

@dfornika dfornika May 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about this suggestion. The galaxy documentation on data managers suggest reading the json file.

https://galaxyproject.org/admin/tools/data-managers/how-to/define/

It's also the way that other data managers in this repo seem to find the target_directory:

target_directory = params['output_data'][0]['extra_files_path']

target_directory = params['output_data'][0]['extra_files_path']

...though I haven't checked all of them.

Loading

Copy link
Contributor Author

@dfornika dfornika May 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've decided to keep this consistent with other data managers in the repo.

Loading

}

params = json.loads(open(args.params).read())
target_directory = params['output_data'][0]['extra_files_path']
Copy link
Contributor

@wm75 wm75 Mar 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment above

Loading

Loading
'https://ccb.jhu.edu/software/kraken2/dl/minikraken2_' + minikraken2_version + '_8GB.tgz'
]

run(['wget'] + args, target_directory)
Copy link
Contributor

@wm75 wm75 Mar 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why download this to a permanent place? Can't you just put it into the temporary job working directory? IIUC, you only want to keep the unpacked data, right?

Loading

Copy link
Contributor Author

@dfornika dfornika Mar 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I only want to keep the unpacked data. I may need some help with this. I don't have a clear idea of how to download to the temporary job working directory or how to move the unpacked data to the appropriate directory. I've made an attempt in this commit but I'm not sure that it's correct.

Loading

Copy link
Contributor Author

@dfornika dfornika May 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've used the job working directory as suggested.

Loading

Loading
return data_manager_dict


def main():
Copy link
Contributor

@wm75 wm75 Mar 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments on the build_custom main apply here, too.

Loading

@dfornika
Copy link
Contributor Author

@dfornika dfornika commented Mar 27, 2019

Thanks for the suggestions @wm75. Will implement them ASAP. If others can do it before me, they're welcome to send a PR to my kraken2-data-manager branch.

Loading

@dfornika
Copy link
Contributor Author

@dfornika dfornika commented May 1, 2019

I've started merging these data managers into a single data manager. My plan is to add a new data manager tool xml file and python script called kraken2_build_database and pull functionality from the existing kraken2_build_standard, kraken2_build_custom, kraken2_build_minikraken and kraken2_build_special into that new tool. I'll hold off on removing those four existing tools until I've done a bit more testing, but I do plan on doing that.

Loading

@dfornika
Copy link
Contributor Author

@dfornika dfornika commented May 2, 2019

As described above, I've added a new data manager kraken2_build_database that currently includes two modes: minikraken and standard. The standard mode is currently broken due to a change in NCBI download URLs that has been fixed in kraken2-2.0.8_beta, which isn't currently available in bioconda.

Loading

@dfornika
Copy link
Contributor Author

@dfornika dfornika commented May 4, 2019

I've merged the four separate data managers into one. I've tested that they all at least start to build a kraken database. I've been testing in a docker container so it's a bit underpowered to actually finish building some of these databases.

I think there are probably some opportunities to remove some redundancy in kraken2_build_database.py and there is still some testing & design decisions to be done.

Loading

@dfornika
Copy link
Contributor Author

@dfornika dfornika commented May 11, 2019

I've found a more powerful system to test this on. Seems to be working!

Loading

@bgruening
Copy link
Member

@bgruening bgruening commented May 11, 2019

@dfornika is this still WIP? Do you want to include @mvdbeek new data-manager tests?

Thanks!

Loading

@dfornika
Copy link
Contributor Author

@dfornika dfornika commented May 11, 2019

Yes @bgruening , I'd still call it a work-in-progress for now. I am interested in adding some tests if possible but haven't looked at how they work. I've confirmed that the minikraken and at least one of the 'special' databases (greengenes) build correctly. I set up a 'standard' database build near the end of day on Friday but I'm not able to access my testing server until Monday to check if it completed.

Loading

@dfornika
Copy link
Contributor Author

@dfornika dfornika commented May 12, 2019

I've taken a quick look at the tests on the data_manager_example_blastdb_ncbi_update_blastdb tool. I'm not sure that I'll be able to make the tests work with this data manager unless I make some changes to what is saved to the Tool Data Table. I'm currently including a timestamp in the Tool Data Table, I thought that might be useful provenance info. I'm open to suggestions on how that could be re-designed if necessary.

Loading

@mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented May 12, 2019

I'd say https://github.com/galaxyproject/tools-iuc/tree/master/data_managers/data_manager_bowtie2_index_builder is probably the best template at this point for writing data manager tests. You can use all the tricks that you can use for tool output checks, for instance you can use assert_contents to make sure that the json output matches what you would expect. If you're having trouble just ping me, I can help with the test.

Loading

@dfornika
Copy link
Contributor Author

@dfornika dfornika commented May 23, 2019

I noticed that there's a new feature in v2.0.8_beta to skip downloading a large portion of the taxonomy data during database construction (the --skip-maps option). This makes at least one test much more manageable. I've added the test.

When testing in planemo I'm getting an error:

2019-05-23 12:07:58,027 ERROR [galaxy.jobs.runners.local] Job wrapper finish method failed
Traceback (most recent call last):
  File "/tmp/tmpv3vizz7u/galaxy-dev/lib/galaxy/jobs/runners/local.py", line 152, in queue_job
    self._finish_or_resubmit_job(job_state, stdout, stderr, exit_code)
  File "/tmp/tmpv3vizz7u/galaxy-dev/lib/galaxy/jobs/runners/__init__.py", line 459, in _finish_or_resubmit_job
    job_state.job_wrapper.finish(stdout, stderr, exit_code, check_output_detected_state=check_output_detected_state)
  File "/tmp/tmpv3vizz7u/galaxy-dev/lib/galaxy/jobs/__init__.py", line 1464, in finish
    self.tool.exec_after_process(self.app, inp_data, out_data, param_dict, job=job)
  File "/tmp/tmpv3vizz7u/galaxy-dev/lib/galaxy/tools/__init__.py", line 2317, in exec_after_process
    assert data_manager is not None, "Invalid data manager (%s) requested. It may have been removed before the job completed." % (data_manager_id)
AssertionError: Invalid data manager (kraken2_build_database) requested. It may have been removed before the job completed.
FAIL

...but otherwise it seems to be running correctly.

Loading

@dfornika dfornika changed the title [WIP] Add kraken2 data manager Add kraken2 data manager May 23, 2019
@mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented May 23, 2019

If you want to test this locally you need to target the actual tool file, planemo test data_manager/kraken2_build_database.xml should work

Loading

@dfornika
Copy link
Contributor Author

@dfornika dfornika commented May 23, 2019

Thanks @mvdbeek, that worked for me. I'm seeing this both locally and in the TravisCI log:

History item  different than expected, difference (using diff):
( /tmp/tmp1PTVg_kraken2_custom_data_manager.json v. /tmp/tmp_Q3mjikraken2_custom_data_manager.json )
--- local_file
+++ history_data
@@ -1 +1 @@
-{"data_tables": {"kraken2_databases": [{"path": "database", "name": "database", "value": "database"}]}}
+{"data_tables": {"kraken2_databases": [{"path": "database", "name": "database", "value": "database"}]}}

I don't see any difference. The file in the test-data dir is UTF-8 encoded. I'm not sure what else might be different?

Loading

@mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented May 23, 2019

data managers don't write out a newline. You can strip it with head -c -1 kraken2_custom_data_manager.json

Loading

@dfornika
Copy link
Contributor Author

@dfornika dfornika commented May 23, 2019

One last thing I'd like to check before this is merged: Is there a convention for setting the version numbers on these tools? I'm not sure that I've done that consistently here. I'll take a look at some other tools in this repo.

Loading

@bgruening
Copy link
Member

@bgruening bgruening commented May 23, 2019

@dfornika no convention so far specifically for DMs. But following the tool convention, same version as the underlying tool version, is probably a good idea.

👍 from my site.

Loading

@dfornika
Copy link
Contributor Author

@dfornika dfornika commented May 24, 2019

I don't have any other plans to change this data manager now. If anyone has suggestions for further changes then please let me know here. Otherwise I'll request that this be merged please.

Thanks @bgruening @mvdbeek @wm75 for your help & guidance.

Loading

@bgruening bgruening merged commit 30d1a86 into galaxyproject:master May 30, 2019
1 check passed
Loading
@bgruening
Copy link
Member

@bgruening bgruening commented May 30, 2019

Thanks @dfornika!

Loading

@dfornika dfornika deleted the kraken2-data-manager branch Aug 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants