RFC: File compression using squashfs in galaxy #2535

yhoogstrate · 2016-06-24T15:23:17Z

File compression using squashfs in galaxy

Many data formats used in the field of bioinformatics are ASCII-based or text based formats resulting in data files that exceed gigabytes. In particular the FASTA and FASTQ formats are files that are unneccesary large, although using binary equivalents would complicate their usage. To implement a compression/decompression system for data files in galaxy, I think it would be desired to have the following features/requirements:

Because on the other hand there are file formats that are really compact, like BAM, compression should be optional per file type (fasta/fastq: yes, bam/bcf: no) + configurable in the galaxy.ini and by default disabled
For usability and reproducibility, tools should not be aware of compression and should not use different arguments (--extract=zcat) to make use of such functionality
No compression AND decompression for every time a tool runs (overhead of disks)

My proposed solution would be to make virtual containers of each to-be-compressed dataset, using squashfs. Using squashfs containers for compression is a similar approach as used for snappy/snapd, the novel ubuntu packaging system. Squashfs can make a read-only filesystem while compressing the data. Decompression is a virtual mount that does not copy all data to disk, but only keeps it in RAM. The current PR is just a working prototype and work in progress but should give a direction in which I would like to continue. Before I spend too much time on it while this wouldn't be an ideal solution after all, I would like to receive some feedback first :).

Implementation

The way the submitted code works:

After a fastq file's metadata is being set, the following command is executed: mksquashfs dataset_001.dat dataset_001.dat.img -b 1048576 -comp xz -Xdict-size 100%
This makes a squashfs archive (dataset_001.dat -> dataset_001.dat.img).
The original file will be temporarily backed up: mv dataset_001.dat dataset_001.dat.bak
The mount point, a directory, will be created: mkdir dataset_001.dat.mnt
The mount point will be mounted: squashfuse dataset_001.dat.img dataset_001.dat.mnt
Then the original file will be replaced by a symlink to the mounted file: ln -s dataset_001.dat.mnt/dataset_001.dat dataset_001.dat
If the symlink is valid, the backup can be removed: rm dataset_001.dat.bak, otherwise it will be reverted.

TODO's

Currently the compress function is within the Fastq datatype object. This should be moved into a higher level datatype object, preferable the highest such that all data types could make use of it.
After a reboot of the computer, each squashfs archive is unmounted. All files should be (re-)mounted either at galaxy startup or, preferable, at file request.
Dependencies (only tested on linux). The current implementation requires tools mksquashfs and squashfuse
In the destructor of galaxy or destructor of a tool, all mounted archives should be closed/unmounted.
Benchmark performance and compression ratio and figuring out optimal compression settings

nsoranzo · 2016-06-24T21:14:13Z

Interesting, was discussing today this problem with @frederikcoppens! Thanks for the contribution!

frederikcoppens · 2016-06-24T21:20:19Z

Interesting approach. Downside is that linking data won't work, causing duplication (depending on your setup), but this is at least better then having to decompress it.

jmchilton · 2016-07-21T19:52:20Z

Keep in mind that you may not be able to write a file right next to that path. With something like the S3 object store I don't think this approach would work the way it is implemented currently. Still - it is an interesting idea and approach.

I'm going to mark this PR as a work in progress.

File compression using squashfs in galaxy

0ef8d38

galaxybot added the triage label Jun 24, 2016

galaxybot added this to the 16.07 milestone Jun 24, 2016

bgruening changed the title ~~File compression using squashfs in galaxy~~ RFC: File compression using squashfs in galaxy Jun 24, 2016

bgruening added area/datatypes area/scripts labels Jun 24, 2016

mvdbeek mentioned this pull request Jun 25, 2016

FASTQ datatype enhancements bxlab/galaxy-hackathon#2

Open

martenson mentioned this pull request Jun 25, 2016

Compression of existing FASTQ datasets bxlab/galaxy-hackathon#15

Open

ashvark mentioned this pull request Jun 26, 2016

Make fastq datasets compressible on the fly. bxlab/galaxy-hackathon#38

Open

jmchilton added status/WIP and removed triage labels Jul 21, 2016

jmchilton removed this from the 16.07 milestone Jul 21, 2016

yhoogstrate closed this Jan 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: File compression using squashfs in galaxy #2535

RFC: File compression using squashfs in galaxy #2535

yhoogstrate commented Jun 24, 2016 •

edited

Loading

nsoranzo commented Jun 24, 2016

frederikcoppens commented Jun 24, 2016 •

edited

Loading

jmchilton commented Jul 21, 2016

RFC: File compression using squashfs in galaxy #2535

RFC: File compression using squashfs in galaxy #2535

Conversation

yhoogstrate commented Jun 24, 2016 • edited Loading

File compression using squashfs in galaxy

Implementation

nsoranzo commented Jun 24, 2016

frederikcoppens commented Jun 24, 2016 • edited Loading

jmchilton commented Jul 21, 2016

yhoogstrate commented Jun 24, 2016 •

edited

Loading

frederikcoppens commented Jun 24, 2016 •

edited

Loading