New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fast5_archive datatype #4569

Merged
merged 7 commits into from Sep 12, 2017

Conversation

Projects
None yet
4 participants
@jvolkening
Contributor

jvolkening commented Sep 7, 2017

FAST5 files are HDF5-based sequence containers produced by Oxford Nanopore sequencers. Typically one file is produced per read, so hundreds of thousands to millions of files can be produced in a run. To make management easier, sets of related FAST5 files can be compressed and distributed as tarballs.

I'm wrapping some nanopore-related software and need to support FAST5 input/output. This PR adds a fast5_archive datatype (inspired by the searchgui_archive type from GalaxyP) which is simply a tarball containing FAST5 files in any directory structure which can be sniffed but is marked to not be automatically decompressed. I considered using collections of individual FAST5 files but this approach seemed easier to manage and easier on the backend database.

jvolkening added some commits Sep 5, 2017

@galaxybot galaxybot added the triage label Sep 7, 2017

@galaxybot galaxybot added this to the 17.09 milestone Sep 7, 2017

@jvolkening

This comment has been minimized.

Contributor

jvolkening commented Sep 7, 2017

Incidentally, it would be nice if the proper file extension (tar, tar.gz, tar.bz2 was appended on download, rather than the fast5_archive extension which is meaningless outside Galaxy. Is this possible? It can't be hard-coded since any of the above are accepted. I could check the magic bits to determine the correct download extension dynamically if I knew where to do this in the code.

@mvdbeek

Looks nice, I like the approach.

if filename and tarfile.is_tarfile( filename ):
temptar = tarfile.open( filename, 'r' )
contents = temptar.getnames()
r = re.compile(".+\.fast5$")

This comment has been minimized.

@mvdbeek

mvdbeek Sep 7, 2017

Member

I think return any((c for c in contents if c.endswith('.fast5'))) would be preferable here, or if we're talking millions of files then

with open('filename', 'r') as temptar:
    contents = temptar.getnames()
    for content in contents:
        if content.endswith('.fast5'):
            return True

might have better performance than filter on the whole list.

if dataset and tarfile.is_tarfile( dataset.file_name ):
temptar = tarfile.open( dataset.file_name, 'r' )
contents = temptar.getnames()
r = re.compile(".+\.fast5$")

This comment has been minimized.

@mvdbeek

mvdbeek Sep 7, 2017

Member

Can we replace the regex with .endswith('.fast5')?
Also filter returns an iterator on python3, so I guess the most efficient way to get the count would be sum(1 for content in contents if content.endswith('.fast5'))

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Sep 7, 2017

, it would be nice if the proper file extension (tar, tar.gz, tar.bz2 was appended on download,

The tar members may be compressed in either of these formats? One option would be to have subclasses for each of the compression types that append the extension. I don't think we have a dynamic way of setting the extension at the moment.

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Sep 7, 2017

(also travis complains about the whitespace around parens )

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Sep 7, 2017

@galaxybot test this

@jvolkening

This comment has been minimized.

Contributor

jvolkening commented Sep 7, 2017

@mvdbeek Thanks for the comments. I will implement your suggestions, fix the formatting, and re-test.

@nsoranzo nsoranzo modified the milestones: 18.01, 17.09 Sep 7, 2017

@jvolkening

This comment has been minimized.

Contributor

jvolkening commented Sep 7, 2017

@mvdbeek I implemented your optimizations. Actually, I took the sniffing one step further and rather than slurping all filenames I just iterate over them and return at the first .fast5 file that is seen, which is very fast. There are still possible performance issues if anyone adds another datatype involving tarballs of many files which is sniffed first, since it will look through the whole tarball and never find any .fast5 files, but that is a future issue to be addressed.

@jvolkening

This comment has been minimized.

Contributor

jvolkening commented Sep 7, 2017

One option would be to have subclasses for each of the compression types that append the extension

I tried to implement this -- please review the changes. However, it does not seem to be working as expected. If I manually call the sniff() method of the various subclasses on different tarballs from within a shell, they return the expected values. However, if I fire up Galaxy and try to upload a fast5.tar.gz file, for instance, it is detected as a plain fast5.tar and given the wrong extension. I suspect that this has to do with the order the sniffers are called in, but I tried to modify datatypes_conf.xml as per the documentation (also in this PR).

@nsoranzo

This comment has been minimized.

Member

nsoranzo commented Sep 7, 2017

@jvolkening Does the archive potentially contain non-fast5 files? If not, you can just return False immediately.

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Sep 7, 2017

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Sep 7, 2017

However, if I fire up Galaxy and try to upload a fast5.tar.gz file, for instance, it is detected as a plain fast5.tar

Is that with datatype set to auto ? I'm afraid if you don't set the compressed variant of this we're just going to unpack, but I'm not sure that's the problem.

@jvolkening

This comment has been minimized.

Contributor

jvolkening commented Sep 7, 2017

Does the archive potentially contain non-fast5 files? If not, you can just return False immediately.

It can contain contain any number and depth of directories, which I believe would be listed by python's tarfile.

@jvolkening

This comment has been minimized.

Contributor

jvolkening commented Sep 7, 2017

Is that with datatype set to auto ? I'm afraid if you don't set the compressed variant of this we're just going to unpack, but I'm not sure that's the problem.

Yes, this behavior is with auto-detect. The subclasses just check the first few bytes for gzip or bzip2 magic numbers and then pass the rest of the sniffing off to the parental class. If the file is being decompressed before the sniffer sees it, that would explain the problem.

@jvolkening

This comment has been minimized.

Contributor

jvolkening commented Sep 7, 2017

Could you include a small example of each dataype

Done.

@jvolkening

This comment has been minimized.

Contributor

jvolkening commented Sep 7, 2017

If the subclassing is too much of a hassle, I will just go back to the previous behavior, which was working fine with my tools. The only real issue here is the naming of downloaded files, which is a fairly minor annoyance.

@jvolkening

This comment has been minimized.

Contributor

jvolkening commented Sep 7, 2017

Does the archive potentially contain non-fast5 files? If not, you can just return False immediately.

@nsoranzo Here is my solution stemming from your comment:

for c in temptar:
    if not c.isfile():
        continue
    if c.name.endswith('.fast5'):
        return True
    else:
        return False

I believe this should short-circuit quickly in all situations.

@nsoranzo

This comment has been minimized.

Member

nsoranzo commented Sep 7, 2017

@jvolkening That looks good to me.

@jvolkening

This comment has been minimized.

Contributor

jvolkening commented Sep 8, 2017

if I fire up Galaxy and try to upload a fast5.tar.gz file, for instance, it is detected as a plain fast5.tar and given the wrong extension. I suspect that this has to do with the order the sniffers are called in, but I tried to modify datatypes_conf.xml as per the documentation (also in this PR).

I believe this is fixed -- it appears the sniff order is dependent on the order in which register_sniffable_binary_format() is called, rather than the order given in datatypes_conf.xml

After further testing and cleaning up the lint error from Travis CI, I will push a (hopefully final) commit.

@jvolkening

This comment has been minimized.

Contributor

jvolkening commented Sep 8, 2017

we could add a doctest as in https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/datatypes/sequence.py#L603

@mvdbeek Is there a way to run these tests locally before I push another commit?

Edit: nevermind, I figured out nosetests --with-doctest

jvolkening added some commits Sep 8, 2017

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Sep 11, 2017

Edit: nevermind, I figured out nosetests --with-doctest

Right, you can also use tox -e p27-unit, which will setup a new clean environment.

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Sep 11, 2017

@galaxybot test this

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Sep 12, 2017

@nsoranzo is this OK for you as well ?

@nsoranzo

This comment has been minimized.

Member

nsoranzo commented Sep 12, 2017

Looks good to me.

@mvdbeek mvdbeek merged commit 36a3a6b into galaxyproject:dev Sep 12, 2017

6 checks passed

api test Build finished. 290 tests run, 4 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 161 tests run, 0 skipped, 0 failed.
Details
integration test Build finished. 45 tests run, 0 skipped, 0 failed.
Details
lgtm analysis: JavaScript No alert changes
Details
toolshed test Build finished. 579 tests run, 0 skipped, 0 failed.
Details
@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Sep 12, 2017

Thank you very much @jvolkening, this is wonderful. Looking forward to nanopore tools in galaxy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment