Efficient storage of processed versions of fastq files
Python Shell
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
diff_match_patch
fq_delta
scripts
COPYING.txt
README.md
setup.py

README.md

Fq_delta - Efficient storage of processed versions of fastq files.

Fq_delta is a python module and shell script that enables the storage of processed versions of fastq files generated by DNA and RNA sequencing technologies. By using Myer's diff algorithm that allows per-character comparisons of two strings, we generate delta files from the original and edited fastq file. While the delta files generated by fq_delta can be used to fully reconstruct the original files, they are only a fraction (0.1 – 2%) of the original size. Depending on the number of processing steps, implementation of this module will lead to a significant reduction in storage required for processing sequence data.

A technical note on fq_delta has been published in EMBnet.journal.

Installation

First clone the git repository with,

git clone git://github.com/averaart/fq_delta.git

Enter the fq_delta repository directory

cd fq_delta

Use python setup.py to install the module and scripts

sudo python setup.py install

Two packages are installed in the site-packages folder of you current python installation: fq_delta and diff_match_patch. The former is dependent on the latter.

Two scrips are installed in your /usr/local/bin/ or equivalent folder: delta and rebuild. Both scripts can be called with the option -h to display options.

Examples

To create a delta file from two fastq files:

delta original.fastq processed.fastq

This will result in a delta file with the name processed.delta.zip

Rebuilding the processed file to standard out works as follows:

rebuild original.fastq processed.delta.zip

To rebuild to a fastq file, simply add the new file name as third argument.

rebuild original.fastq processed.delta.zip rebuilt_processed.fastq

Both delta and rebuild are able to work with standard in and standard out, allowing the user to chain several processes.

cat -e sample.fastq | \
sed  's/M\-\^A//' | \
delta sample.fastq sample.step1 -si 2 -so | \
sed 's/1{//' | \
delta sample.fastq sample.step2 -si 2 -so | \
sed 's/1\^K\^B//' | \
delta sample.fastq sample.step3 -si 2 -so | \
sed 's/\$//' | \
delta sample.fastq sample.step4 -si 2 -so > sample.processed.fastq