FASTQ datatype enhancements #2

bgruening · 2016-06-24T19:16:06Z

As discussed with @frederikcoppens it would be nice to upload compressed FASTQ-files and to handle them properly in tools.

One way to do this is to enhance the already existing FASTQ datatype with a new metadata element that indicates if it compressed or not. Tools should then be able to recognize this metadata element. GSNAP is one of the tools which would benefit from it.

You can learn about:

datatypes
metadata
tools & metadata

Skills:

python

bwlang · 2016-06-25T14:34:23Z

great idea - could be a big speed improvement for slow IO environments

timothom · 2016-06-25T14:41:11Z

So are you talking about uploading and storeing/manageing the new uploaded data as compressed data? Or just uploading a compressed fastq file and then extacting it once the upload is complete server-side?

Or do you want the data to remain compressed in storage on the server?

apetkau · 2016-06-25T15:11:36Z

This is a good idea. @timothom I think Galaxy already decompresses fastq files that are uploaded, so I assume this would involve storing as compressed data? Would this involve disabling the decompression of uploaded fastq files?

It would be nice for us in our lab to be able to operate directly on compressed fastq files though.

dpryan79 · 2016-06-25T15:16:44Z

I'd love if even linking in compressed files worked seemlessly. That wouldn't require mucking with the upload tool to not autodecompress stuff.

abretaud · 2016-06-25T15:17:55Z

This would make very happy several french galaxy admins!
We have a kind of patch/hack that works on our instances, but I'm sure there would be a more elegant way to do it: https://www.e-biogenouest.org/wiki/ManArchiveGalaxy

mvdbeek · 2016-06-25T15:22:45Z

There has also been some work by @yhoogstrate in galaxyproject#2535 with a different approach.

pvanheus · 2016-06-28T18:50:33Z

@mvdbeek: @ashvark and I reviewed @yhoogstrate's work and it requires this squashfs thing to be installed all over the cluster, no?

pvanheus · 2016-06-28T18:54:18Z

@ashvark has started some work on a compressed Fastq type, see #38. @bgruening: how would this work with tools that do not support compressed fastq? And how would compressing existing datasets work - would set_meta() compress / decompress if that key changed?

Finally, see the issue @frederikcoppens mentioned on #38 - something to look out for.

frederikcoppens · 2016-06-28T18:58:10Z

@pvanheus With a new compressed fastq datatype, this would require updating the wrappers to also allow this datatype I assume? Then tools that do not support it require a conversion to use it as input.
Would adding a "convert" tool to uncompress (and compress) be an option?

ashvark · 2016-06-28T22:17:11Z

Yes. I am planning to trying to add converters but i am afraid that would not be good idea for larger fastq files

bgruening · 2016-06-29T04:01:12Z

Why a new format, just annotate the old format and convert tools that do not support compressed fastq to react on the metadata. This should be compatible and doable without much effort. I'm assuming here that most of the tools already have native support for gzipped fastq.

pvanheus · 2016-06-29T18:45:53Z

@bgruening because metadata is per-user not per-dataset. However, how about we make a new type: uncompressed fastq. So Fastq is compressed fastq. I'm just thinking of a way to convert existing datasets... @natefoo also pointed out to me that the correct way to handle tools that depend on .gz extension is that at job run time the dataset is linked in with the extension as per datatypes_conf.xml.

ashvark · 2016-06-29T21:47:41Z

@bgruening and @pvanheus . I have created a separate branch (https://github.com/ashvark/galaxy/tree/fastq_enhancements) in my repository for the enhancement of fastq datatype to handle gzipped fastq files as such. I have tested this only with simpe testcases. Below is the explanation of the changes

added metadata element 'is_gzipped' for the Fastq datatype in the file datatypes/Sequence.py
modified get_headers() method in datatypes/sniff.py to handle zipped file.
added a condition in upload.py to avoid the decompression of gzipped fastq files during upload

TO DO

test with various scenarios so that it does not disturb any other functionalities

I would like to know your suggestions and improvements.

yhoogstrate · 2016-06-30T10:52:47Z

+ref: galaxyproject/tools-iuc#354

Merge pull request #2 from galaxyproject/dev

zipho · 2016-07-28T10:55:53Z

@yhoogstrate that pull request remains open and seems no further development has been done against it.

Another discussing is here: #38

Perhaps we should a combined efforts around this.

@ashvark I briefly tested your changes locally and worked ok.

The other issue is file/dataset extension that sometimes tools use to determine the format of the file, is there any reasons why Galaxy forces the .dat extension. I know it will be a big change, but can files be stored and tracked in their original extension in Galaxy?

martenson mentioned this issue Jun 25, 2016

Compression of existing FASTQ datasets #15

Open

mvdbeek pushed a commit that referenced this issue Jul 22, 2016

Merge pull request #2 from galaxyproject/dev

d123b40

Merge pull request #2 from galaxyproject/dev

zipho mentioned this issue Jul 28, 2016

allow gzipped fastq files galaxyproject/tools-iuc#354

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FASTQ datatype enhancements #2

FASTQ datatype enhancements #2

bgruening commented Jun 24, 2016

bwlang commented Jun 25, 2016

timothom commented Jun 25, 2016

apetkau commented Jun 25, 2016

dpryan79 commented Jun 25, 2016

abretaud commented Jun 25, 2016

mvdbeek commented Jun 25, 2016 •

edited

Loading

pvanheus commented Jun 28, 2016

pvanheus commented Jun 28, 2016

frederikcoppens commented Jun 28, 2016

ashvark commented Jun 28, 2016

bgruening commented Jun 29, 2016

pvanheus commented Jun 29, 2016

ashvark commented Jun 29, 2016

yhoogstrate commented Jun 30, 2016 •

edited

Loading

zipho commented Jul 28, 2016

FASTQ datatype enhancements #2

FASTQ datatype enhancements #2

Comments

bgruening commented Jun 24, 2016

bwlang commented Jun 25, 2016

timothom commented Jun 25, 2016

apetkau commented Jun 25, 2016

dpryan79 commented Jun 25, 2016

abretaud commented Jun 25, 2016

mvdbeek commented Jun 25, 2016 • edited Loading

pvanheus commented Jun 28, 2016

pvanheus commented Jun 28, 2016

frederikcoppens commented Jun 28, 2016

ashvark commented Jun 28, 2016

bgruening commented Jun 29, 2016

pvanheus commented Jun 29, 2016

ashvark commented Jun 29, 2016

yhoogstrate commented Jun 30, 2016 • edited Loading

zipho commented Jul 28, 2016

mvdbeek commented Jun 25, 2016 •

edited

Loading

yhoogstrate commented Jun 30, 2016 •

edited

Loading