Skip to content
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Filter to Unique FASTQ


It's common to filter repeated reads from a fastq or fasta file. With recent high-throughput sequencing output reaching ... hi throughput, it's no longer feasible to hold all reads in an in-memory datastructure to determine uniqueness. is an example of using a bloom-filter as a pre-filter for a more memory intensive python set. It filters to unique sequences and prints information on the number of false positives in the bloom filter.

$ time python < /usr/local/src/methylcode/data/dme/en.wt.1.fastq > unique.fastq
36754782 records in file
checking 2525155 potential duplicates in a python set
4699 false-positive duplicates in the bloom filter

real    11m31.077s
user    11m1.937s
sys 0m23.093s
Something went wrong with that request. Please try again.