Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: File compression using squashfs in galaxy #2535

Closed

Conversation

yhoogstrate
Copy link
Member

@yhoogstrate yhoogstrate commented Jun 24, 2016

File compression using squashfs in galaxy

Many data formats used in the field of bioinformatics are ASCII-based or text based formats resulting in data files that exceed gigabytes. In particular the FASTA and FASTQ formats are files that are unneccesary large, although using binary equivalents would complicate their usage. To implement a compression/decompression system for data files in galaxy, I think it would be desired to have the following features/requirements:

  • Because on the other hand there are file formats that are really compact, like BAM, compression should be optional per file type (fasta/fastq: yes, bam/bcf: no) + configurable in the galaxy.ini and by default disabled
  • For usability and reproducibility, tools should not be aware of compression and should not use different arguments (--extract=zcat) to make use of such functionality
  • No compression AND decompression for every time a tool runs (overhead of disks)

My proposed solution would be to make virtual containers of each to-be-compressed dataset, using squashfs. Using squashfs containers for compression is a similar approach as used for snappy/snapd, the novel ubuntu packaging system. Squashfs can make a read-only filesystem while compressing the data. Decompression is a virtual mount that does not copy all data to disk, but only keeps it in RAM. The current PR is just a working prototype and work in progress but should give a direction in which I would like to continue. Before I spend too much time on it while this wouldn't be an ideal solution after all, I would like to receive some feedback first :).

Implementation

The way the submitted code works:

  • After a fastq file's metadata is being set, the following command is executed: mksquashfs dataset_001.dat dataset_001.dat.img -b 1048576 -comp xz -Xdict-size 100%
  • This makes a squashfs archive (dataset_001.dat -> dataset_001.dat.img).
  • The original file will be temporarily backed up: mv dataset_001.dat dataset_001.dat.bak
  • The mount point, a directory, will be created: mkdir dataset_001.dat.mnt
  • The mount point will be mounted: squashfuse dataset_001.dat.img dataset_001.dat.mnt
  • Then the original file will be replaced by a symlink to the mounted file: ln -s dataset_001.dat.mnt/dataset_001.dat dataset_001.dat
  • If the symlink is valid, the backup can be removed: rm dataset_001.dat.bak, otherwise it will be reverted.

TODO's

  • Currently the compress function is within the Fastq datatype object. This should be moved into a higher level datatype object, preferable the highest such that all data types could make use of it.
  • After a reboot of the computer, each squashfs archive is unmounted. All files should be (re-)mounted either at galaxy startup or, preferable, at file request.
  • Dependencies (only tested on linux). The current implementation requires tools mksquashfs and squashfuse
  • In the destructor of galaxy or destructor of a tool, all mounted archives should be closed/unmounted.
  • Benchmark performance and compression ratio and figuring out optimal compression settings

@galaxybot galaxybot added this to the 16.07 milestone Jun 24, 2016
@nsoranzo
Copy link
Member

Interesting, was discussing today this problem with @frederikcoppens! Thanks for the contribution!

@bgruening bgruening changed the title File compression using squashfs in galaxy RFC: File compression using squashfs in galaxy Jun 24, 2016
@frederikcoppens
Copy link
Member

frederikcoppens commented Jun 24, 2016

Interesting approach. Downside is that linking data won't work, causing duplication (depending on your setup), but this is at least better then having to decompress it.

@jmchilton
Copy link
Member

Keep in mind that you may not be able to write a file right next to that path. With something like the S3 object store I don't think this approach would work the way it is implemented currently. Still - it is an interesting idea and approach.

I'm going to mark this PR as a work in progress.

@jmchilton jmchilton removed this from the 16.07 milestone Jul 21, 2016
@yhoogstrate yhoogstrate closed this Jan 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants