Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for gzipped fastq formats #3145

Merged
merged 15 commits into from Jan 12, 2017

Conversation

Projects
None yet
9 participants
@mvdbeek
Copy link
Member

commented Nov 10, 2016

This is a collection of work from @ashvark, @pvanheus and perhaps others that I'm not aware of, and I think this has been implemented one way or the other by different galaxy admins (I think @hrhotz mentioned this, and various admins in france) and I think it's time we get one implementation into galaxy so that we can implement and rely on compressed fastq handling in tools.

The approach is that for each fastq variant we have the corresponding gzip compressed format, and for each compressed variant we have one converter (we may be able to compact this into one).

With the current implementation, users can upload a gzip-compressed fastq file, which will be properly sniffed as fastq.gz and will not be uncompressed. Tools that can deal with compressed fastq files can allow fastq.gz as input format, in which case the compressed fastq file can be used without decompression.
If the tool only allows uncompressed fastq, the file will go through the dataset converter.
By selecting on of the uncompressed fastq formats at upload, the files will still be unpacked.

I think this is not a big departure from the current user experience, and should help scale Galaxy in terms of storage use. I'm happy to make changes, if we can agree on one direction.

@pvanheus

This comment has been minimized.

Copy link
Contributor

commented Nov 10, 2016

This is, in my mind, a better approach than have a "compressed" metadata flag. @zipho has done work at SANBI on supporting native compressed fastq. This can probably be contributed towards this proposed design.

@bgruening

This comment has been minimized.

Copy link
Member

commented Nov 10, 2016

@mvdbeek a tool test that triggers the automatic conversion would be neat.

@zipho

This comment has been minimized.

Copy link
Contributor

commented Nov 11, 2016

@mvdbeek We have for the most part, locally at SANBI (UWC) implemented similar functionality by creating a (fastq) Gz datatype and for our purposes, made also some changes in the upload.py (tools/data_source) to handle uploading of compressed fastq files.

One issue though has been that some tools evaluate the extension of a file (.gz) to determine if that file is compressed or not and therefore storing datasets internally as .dat proved to present some challenges for certain tools.

@mvdbeek

This comment has been minimized.

Copy link
Member Author

commented Nov 11, 2016

Hi @zipho

made also some changes in the upload.py (tools/data_source) to handle uploading of compressed fastq files.

That's an implementation detail I have solved by inheriting from Binary and registering fastq.gz as a compressed datatype, which is enough for fastq.gz to not be uncompressed.

One issue though has been that some tools evaluate the extension of a file (.gz) to determine if that file is compressed or not and therefore storing datasets internally as .dat proved to present some challenges for certain tools.

There is quite a number of tools that require a specific extension, but I think current iuc practice is to use symlinks in the tools command attribute to make these tools happy. Having a separate fastq.gz allows to ask whether a file is compressed and to adjust the command section accordingly.

In a future step I think galaxy could default to linking in files based on their datatypes extension, so that this becomes unnecessary, but that would be a bigger change, I think.
I have also come across tools that even need a specific filename (sigh), so I think some symlinking from the tool author side may be necessary even if we are going that direction.

@zipho

This comment has been minimized.

Copy link
Contributor

commented Nov 11, 2016

@mvdbeek I agree.

There is quite a number of tools that require a specific extension, but I think current iuc practice is to use symlinks in the tools command attribute to make these tools happy. Having a separate fastq.gz allows to ask whether a file is compressed and to adjust the command section accordingly.

We ended up following the same practise as iuc on many of our tools. The winner certainly would be for galaxy to default to linking files based on extension.

@mvdbeek

This comment has been minimized.

Copy link
Member Author

commented Nov 11, 2016

@mvdbeek a tool test that triggers the automatic conversion would be neat.

Thanks for asking for this, I discovered that while fastqsanger.gz is automatically converted to fastqsanger, fastq.gz is also valid fastq and is therefor not converted. I guess I need to create a BaseFastq class from which Fastq and FastqGz can inherit.

Create a BaseFastq class and add test tools
Having a BaseFastq class allows Fastq and FastqGz classes to inherit from
BaseFastq. FastqGz is therfor not a Fastq datatype and goes through the
converter.

Also add 2 tools to demonstrate that compressed fastq will be converted to
uncompressed fastq if the tool specifies `format="fastq"`, while
`format="fastq.gz"` leaves files compressed.

The tool test can be run with:
```
planemo test --galaxy_root .  test/functional/tools/compressed_fastq_no_conversion.xml
planemo test --galaxy_root .  test/functional/tools/compressed_fastq_conversion.xml
```

Note that fastq conversion appears to required a galaxy user session, and
planemo testing fails. Interactive testing with `planemo serve` is not affected
and works fine.
@bgruening
Copy link
Member

left a comment

@mvdbeek I like this very much! Thanks!

@@ -0,0 +1,13 @@
<tool id="CONVERTER_fastqcssangergz_to_fastqcssanger" name="Convert fastqcssanger.gz files to fastqcssanger" version="1.0.0" hidden="true">
<command>gzip -dcf $input1 > $output1</command>

This comment has been minimized.

Copy link
@bgruening

bgruening Nov 12, 2016

Member

Should we add a requirement for all these tools?

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Nov 12, 2016

Author Member

Hmm, even the busybox and ubuntu docker base images come with gzip, os x has a system gzip ... I doubt we could satisfy a dependency or run galaxy without gzip.
Alternatively I could write a python script to do the decompression, if that's a concern.

This comment has been minimized.

Copy link
@bgruening

bgruening Nov 12, 2016

Member

Ok fair enough, I was more coming from the best-practice side, where I think all tools should be annotated even the python tools.

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Nov 12, 2016

Author Member

I understand, and normally I agree, but it's not so easy to get a requirement for gzip.

@@ -0,0 +1,13 @@
<tool id="CONVERTER_fastqgz_to_fastq" name="Convert fastq.gz files to fastq" version="1.0.0" hidden="true">
<command>gzip -dcf $input1 > $output1</command>

This comment has been minimized.

Copy link
@bgruening

bgruening Nov 12, 2016

Member

quoting all paths

if compressed:
in_file = gzip.GzipFile(fname, 'r')
else:
in_file = open(fname, 'rt')

This comment has been minimized.

Copy link
@bgruening

bgruening Nov 12, 2016

Member

Do we need to close this file handle?

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Nov 12, 2016

Author Member

quickly scanning through the datatypes, some functions do and some don't. I'll add this.

@mvdbeek

This comment has been minimized.

Copy link
Member Author

commented Nov 12, 2016

As a side note, there are many more data conversion tools that use unquoted paths, those should probably all be converted to single quoted paths.

<tool id="CONVERTER_fastqcssangergz_to_fastqcssanger" name="Convert fastqcssanger.gz files to fastqcssanger" version="1.0.0" hidden="true">
<command>gzip -dcf '$input1' > '$output1'</command>
<inputs>
<page>

This comment has been minimized.

Copy link
@bgruening

bgruening Nov 12, 2016

Member

One more thing, why does this needs to have the <page> tag? It is deprecated isn't it?
ping @blankenberg

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Nov 12, 2016

Author Member

I had just taken this from the previous commits without further checking.
Conversion works fine without that tag.

This comment has been minimized.

Copy link
@blankenberg

blankenberg Nov 22, 2016

Member

Yeah, don't need, and shouldn't use the page tag any more.

@mvdbeek

This comment has been minimized.

Copy link
Member Author

commented Nov 12, 2016

@galaxybot test this

@mvdbeek mvdbeek changed the title RFC: Support for gzipped fastq formats Support for gzipped fastq formats Nov 14, 2016

@mvdbeek

This comment has been minimized.

Copy link
Member Author

commented Nov 23, 2016

Alright, anyone else any thoughts on this? Are we ready to merge?

@abretaud

This comment has been minimized.

Copy link
Contributor

commented Nov 23, 2016

+1 for me
I wonder if doing the same for bz2 files would be useful for anyone else than me?

@mvdbeek

This comment has been minimized.

Copy link
Member Author

commented Nov 23, 2016

I wonder if doing the same for bz2 files would be useful for anyone else than me?

I'm sure it would be, at least having a converter from bzip2 to fastq. I think there are less tools that natively support bz2 though.

@mvdbeek

This comment has been minimized.

Copy link
Member Author

commented Nov 24, 2016

awesome, thanks a lot @abretaud! (I love how a few days became 1)
@bgruening we still have your 👍 ?

@abretaud

This comment has been minimized.

Copy link
Contributor

commented Nov 24, 2016

(I love how a few days became 1)

I needed to remove the post-it from my desk :)

abretaud added some commits Nov 24, 2016

Fix indentation and import (#2)
* fix indentation and import

* import fix

* refix indentation

* better looking imports

mvdbeek added a commit to mvdbeek/galaxy that referenced this pull request Nov 24, 2016

[WIP] Symlink files into working directories
In the context of galaxyproject#3145, it would be nice if tool authors could rely on
the datatype extension in their tool. With this commit galaxy will
symlink input files into the current working directory as
<input_variable>.<datatype_extension>, or if an extension is not
defined, simply as <input_variable>. This should prevent a lot of
cheeath code for determining e.g is a fastq file is compressed or not.

Unfortunately this doesn't work yet when iterating over repeats with
Cheetah :(.

mvdbeek added a commit to mvdbeek/galaxy that referenced this pull request Nov 25, 2016

Link datasets w/ extension into working directory
In the context of galaxyproject#3145, it would be nice if tool wrapper authors could
rely on the datatype extension for their tools. Many tools require a
specific filename extension. Currently, tool authors need to symlink the
input files with the correct extension in the tool wrapper.

With this commit galaxy will symlink input files into the current
working directory as input_<dataset_id>.<datatype_extension>, if an
extension is defined, the datatype is not composite and the tool does
not make use of parallelism.

I believe this makes writing wrappers easier for tool authors, and
should also prevent resorting to cheetah if a tool accepts different
input formats, where each format has a specific extension (e.g fastq,
fastq.gz. fastq.bz2 etc).

mvdbeek added a commit to bardin-lab/galaxy that referenced this pull request Dec 10, 2016

Link datasets w/ extension into working directory
In the context of galaxyproject#3145, it would be nice if tool wrapper authors could
rely on the datatype extension for their tools. Many tools require a
specific filename extension. Currently, tool authors need to symlink the
input files with the correct extension in the tool wrapper.

With this commit galaxy will symlink input files into the current
working directory as input_<dataset_id>.<datatype_extension>, if an
extension is defined, the datatype is not composite and the tool does
not make use of parallelism.

I believe this makes writing wrappers easier for tool authors, and
should also prevent resorting to cheetah if a tool accepts different
input formats, where each format has a specific extension (e.g fastq,
fastq.gz. fastq.bz2 etc).

mvdbeek added a commit to bardin-lab/galaxy that referenced this pull request Dec 10, 2016

Link datasets w/ extension into working directory
In the context of galaxyproject#3145, it would be nice if tool wrapper authors could
rely on the datatype extension for their tools. Many tools require a
specific filename extension. Currently, tool authors need to symlink the
input files with the correct extension in the tool wrapper.

With this commit galaxy will symlink input files into the current
working directory as input_<dataset_id>.<datatype_extension>, if an
extension is defined, the datatype is not composite and the tool does
not make use of parallelism.

I believe this makes writing wrappers easier for tool authors, and
should also prevent resorting to cheetah if a tool accepts different
input formats, where each format has a specific extension (e.g fastq,
fastq.gz. fastq.bz2 etc).
@bgruening

This comment has been minimized.

Copy link
Member

commented Jan 1, 2017

👍 from my site. It would be great if we can annotate the converters with requirement tags, but I think this can wait.

@jmchilton @dannon what do you think? Can we merge this?

@mvdbeek

This comment has been minimized.

Copy link
Member Author

commented Jan 2, 2017

It would be great if we can annotate the converters with requirement tags, but I think this can wait.

Yep, I think we should do that, and actually test all the converter tools. But let's do that in another PR.

@bgruening

This comment has been minimized.

Copy link
Member

commented Jan 8, 2017

@jmchilton @dannon last call, before I merge this :)

@jmchilton

This comment has been minimized.

Copy link
Member

commented Jan 8, 2017

My silence isn't me ignoring this PR. I've looked at it several times. I really go back and forth on whether I'd like to see it merged or not. The devteam has known this is a real problem for years and years and has never done anything - it has never come up as a priority because of our particular file system on usegalaxy.org - so we should address this. I would just prefer to see it delt with at the framework level - but that would be large project. I think this cover the main use case for sequencing - I just worry about a flood of compressed types for things like proteomics where it is pretty common that tools deal with compressed files.

Please do merge @bgruening.

@jmchilton jmchilton merged commit a87feb1 into galaxyproject:dev Jan 12, 2017

4 checks passed

api test Build finished. 230 tests run, 0 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 126 tests run, 0 skipped, 0 failed.
Details
toolshed test Build finished. 580 tests run, 0 skipped, 0 failed.
Details
@jmchilton

This comment has been minimized.

Copy link
Member

commented Jan 12, 2017

Thanks @mvdbeek!

@mvdbeek

This comment has been minimized.

Copy link
Member Author

commented Jan 12, 2017

Thanks @jmchilton, and everyone involved in the PR!

@peterjc

This comment has been minimized.

Copy link
Contributor

commented Jan 17, 2017

Like @jmchilton I would also have preferred a generic compression flag allowing combination with any file format, but this is a welcome practical feature which ought to please many Galaxy Admins :)

@mvdbeek

This comment has been minimized.

Copy link
Member Author

commented Jan 17, 2017

Like @jmchilton I would also have preferred a generic compression flag allowing combination with any file format, but this is a welcome practical feature which ought to please many Galaxy Admins :)

I think we're all on the same page in that regard ... I still hope we'll see this some day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.