RFC: File compression using squashfs in galaxy #2535
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
File compression using squashfs in galaxy
Many data formats used in the field of bioinformatics are ASCII-based or text based formats resulting in data files that exceed gigabytes. In particular the FASTA and FASTQ formats are files that are unneccesary large, although using binary equivalents would complicate their usage. To implement a compression/decompression system for data files in galaxy, I think it would be desired to have the following features/requirements:
My proposed solution would be to make virtual containers of each to-be-compressed dataset, using squashfs. Using squashfs containers for compression is a similar approach as used for snappy/snapd, the novel ubuntu packaging system. Squashfs can make a read-only filesystem while compressing the data. Decompression is a virtual mount that does not copy all data to disk, but only keeps it in RAM. The current PR is just a working prototype and work in progress but should give a direction in which I would like to continue. Before I spend too much time on it while this wouldn't be an ideal solution after all, I would like to receive some feedback first :).
Implementation
The way the submitted code works:
mksquashfs dataset_001.dat dataset_001.dat.img -b 1048576 -comp xz -Xdict-size 100%
dataset_001.dat
->dataset_001.dat.img
).mv dataset_001.dat dataset_001.dat.bak
mkdir dataset_001.dat.mnt
squashfuse dataset_001.dat.img dataset_001.dat.mnt
ln -s dataset_001.dat.mnt/dataset_001.dat dataset_001.dat
rm dataset_001.dat.bak
, otherwise it will be reverted.TODO's
mksquashfs
andsquashfuse