czid-dedup

czid-dedup reads single- or paired-end FASTA or FASTQ files and outputs versions of those files with duplicate reads removed. A duplicate read in this case is a read that is either identical to another read or shares a prefix of length -l with another read. Paired reads are only considered identical if both reads (or read prefixes, specified by -l) are duplicates to both reads in a previous pair.

In addition to the de-duplicated FASTA or FASTQ outputs, czid-dedup also outputs a cluster file which makes it possible to identify clusters of duplicate reads. The file lists the representative cluster read ID for each initial read ID, where the representative cluster read ID is the read ID that makes it into the output file. If a read is found to be a duplicate of a previous read, it will be filtered out of the FASTA/FASTQ output and paired with the read ID of the previous duplicate read in the cluster output file. Representative cluster read IDs are paired with themselves. The order of the input files is preserved. The representative read will always be the first read of its type.

FASTA/FASTQ parsing provided by rust-bio.

Installation

Binary

We release binaries for Linux, MacOS, and Windows. To install one, download the appropriate binary for your operating system from one of our releases.

From Source

Install rust/cargo if you haven't already
git clone https://github.com/chanzuckerberg/czid-dedup.git
cd czid-dedup
cargo build --release
Your executable will be at czid-dedup/target/release/czid-dedup (with .exe if you're on windows)

Usage

Run:

czid-dedup --help

for usage information:

USAGE:
    czid-dedup [OPTIONS] --deduped-outputs <deduped-outputs>... --inputs <inputs>...

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -c, --cluster-output <cluster-output>         Output cluster file [default: clusters.csv]
    -o, --deduped-outputs <deduped-outputs>...    Output deduped FASTQ
    -i, --inputs <inputs>...                      Input FASTQ
    -l, --prefix-length <prefix-length>           Length of the prefix to consider

Example Usage

Deduplicate a single-end FASTA:

czid-dedup -i my-fasta.fasta -o my-deduped-fasta.fasta

Deduplicate a single-end FASTQ (same as FASTA)

czid-dedup -i my-fastq.fastq -o my-deduped-fastq.fastq

Deduplicate paired-end reads (note, inputs are paired to outputs by order not name):

czid-dedup \
	-i my-fasta-r1.fasta \
	-i my-fasta-r2.fasta \
	-o my-deduped-fasta-r1.fasta \
	-o my-deduped-fasta-r2.fasta

Deduplicate only considering a prefix of length 70:

czid-dedup -l 70 -i my-fasta.fasta -o my-deduped-fasta.fasta

Custom cluster file name:

czid-dedup -i my-fasta.fasta -o my-deduped-fasta.fasta -c custom-cluster.csv

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

czid-dedup

Installation

Binary

From Source

Usage

Example Usage

About

Releases 3

Packages

Contributors 2

Languages

License

chanzuckerberg/czid-dedup

Folders and files

Latest commit

History

Repository files navigation

czid-dedup

Installation

Binary

From Source

Usage

Example Usage

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 2

Languages

Packages