fastcat is a single executable with subcommands for FASTQ and BAM summary/statistics workflows.
It was originally written to santize and concatenate .fastq(.gz) files, hence the name.
mamba create -n fastcat -c conda-forge -c bioconda -c nanoporetech fastcatWhen building from source with:
make fastcatseveral libraries are expected to be present on the system.
Enquire within the Makefile for more details.
The command-line interface is split into multiple sub-commands with distinct tasks:
$ ./fastcat --help
Usage: ./fastcat <subcommand> [options]
Subcommands:
fastq FASTQ concatenation/statistics/filtering mode (fastcat)
bam BAM stats/filtering mode (bamstats)
lint Standalone FASTQ low-complexity filter (fastlint)
index BAM chunk index tools (bamindex)
Try './fastcat <subcommand> --help' for subcommand options.Concatenate and optionally filter FASTQ (or unaligned BAM via --input-fmt bam), optionally demultiplex by barcode, and write summary outputs.
$ ./fastcat fastq --help
Usage: fastq [OPTION...] dir-with-reads ...
fastcat -- concatenate and summarise .fastq(.gz) files.
Output options:
General options:
-B, --bam-out Output data as unaligned BAM.
-c, --reads-per-file=NUM Split reads into files with a set number of reads
(default: single file).
-d, --demultiplex Separate barcoded samples using fastq header
information; outputs go under
<output>/barcodeXXXX/.
-e, --force-error Exit with non-zero status if any files, or
records, contained errors.
--fofn Treat each input as a file-of-filenames source
(use '-' to read that list from stdin).
--input-fmt=fastq|bam
Input format for fastq mode (default: fastq). Use
'bam' for unaligned BAM input.
-o, --output=DIRECTORY Output root directory (default: fastcat-qc). Must
not already exist.
-p, --per-read Write per-read TSV (large). Disabled by default.
-s, --sample=SAMPLE NAME Sample name (if given, adds a 'sample_name'
column).
-t, --threads=THREADS Number of threads for output compression (only
with --bam-out).
-v, --verbose Verbose output.
-x, --recurse Search directories recursively for files matching
the selected input format.
Read filtering options:
-a, --min-length=MIN READ LENGTH
Minimum read length to output (excluded reads
remain listed in summaries).
-b, --max-length=MAX READ LENGTH
Maximum read length to output (excluded reads
remain listed in summaries).
--dust Enable DUST filtering of reads (default:
disabled).
--dust-t=DUST T Threshold for DUST filtering (default: 20).
--dust-w=DUST W Window size for DUST filtering (default: 64).
-g, --read-group=RG Only process reads from given read group.
--haplotype=VAL Only process reads from a given haplotype.
Equivalent to --tag-name HP --tag-value VAL.
--max-dust=MAX DUST Maximum proportion of low-complexity regions to
allow in reads (default: 0.95).
-q, --min-qscore=MIN READ QSCORE
Minimum read Qscore to output (excluded reads
remain listed in summaries).
--recalc-qual Force recomputing mean quality, else use 'qs' tag
if present.
--tag-name=TN Only process reads with a given tag (see
--tag-value).
--tag-value=VAL Only process reads with a given tag value.
Logging options:
--log-file=PATH Write logs to PATH instead of stderr.
--log-level=LEVEL Log level: error, warn, info, debug (default:
info).
--log-mono Force monochrome logging output (disable ANSI
colors).
-?, --help Give this help list
--usage Give a short usage message
-V, --version Print program versionThe program writes the input sequences to stdout by default in .fastq format, which can be recompressed with gzip
(or more usefully bgzip).
Ancilliary summary outputs are written to the <output> directory; these are described below.
If the --demultiplex option is provided the output to stdout is not produced and instead outputs for each observed barcode (sample), is produced in <output>/{unclassified,barcodeXXXX/}.
In this case the ancilliary outputs are produced per barcode.
Additionally as its a common thing to want to do, the program will create two files:
histograms/length.hist- read length histogram, andhistograms/quality.hist- read mean base-quality score histogram.
When data is demultiplexed one such file will be written to the demultiplexed
samples' directories. When demultiplexing is not enabled the files will be
placed in a directory according to the --histograms option. The format of the
histogram files is a tab-separated file of sparse, ordered intervals [lower, uppper):
lower upper count
The final bin may be unbounded, which is signified by a 0 entry for the upper
bin edge.
The per_file.txt is also a tab-separated file with columns:
filename n_seqs n_bases min_length max_length mean_quality f_file_ok f_stream_error f_qual_missing f_qual_truncated f_unknown_error r_record_seen r_record_ok r_too_long r_too_short r_low_quality r_dust_masked
test/data/bc0.fastq.gz 10 4599 191 965 9.86 1 0 0 0 0 10 10 0 0 0 0
...
where the mean_quality column is the mean of the per-read mean_quality values.
The f_* fields detail file-level parse errors.
r_record_seen is records parsed/considered, while r_record_ok is records that passed filters and were used downstream.
The basecallers.txt and runids.tsv files contain per-file counts of unique basecallers and MinKNOW run identifiers observed in the input files.
Their structures are, respectively:
filename basecaller count
test/data/bc0.fastq.gz 10
...
and
filename run_id count
test/data/bc0.fastq.gz 5a21d8a6996146deceeaea3784244c52741cae93 10
The (optional) per_read.txt is a tab-separated file with columns:
read_id filename runid read_length mean_quality channel read_number start_time
32e13a1c-4171-4706-b6ce-a32c0f65fa16 test/data/bc0.fastq.gz 5a21d8a6996146deceeaea3784244c52741cae93 326 12.31 282 9 2021-04-20T17:00:40Z
b87f011e-b802-4993-8f56-fd240b2e784f test/data/bc0.fastq.gz 5a21d8a6996146deceeaea3784244c52741cae93 407 9.13 213 19 2021-04-20T17:00:41Z
6f64aedb-bb8e-4777-b494-43e661841e06 test/data/bc0.fastq.gz 5a21d8a6996146deceeaea3784244c52741cae93 355 9.98 67 13 2021-04-20T17:00:41Z
c372fb2c-dd45-4feb-81b2-c167c3d1ce93 test/data/bc0.fastq.gz 5a21d8a6996146deceeaea3784244c52741cae93 317 8.14 337 18 2021-04-20T17:00:41Z
18d04e8d-2816-4986-8e1b-e5be676837fc test/data/bc0.fastq.gz 5a21d8a6996146deceeaea3784244c52741cae93 965 7.57 507 18 2021-04-20T17:00:41Z
...
The mean quality is defined as:
-10 * log10(mean(10^(Q/-10)))
where Q are the set of all per-base quality scores for the read.
In many cases (where relevant metadata is found in the read header) the mean quality will not be recomputed but taken from the basecaller's value, which may be
subtly different.
Recalculation of the mean quality can be forced with the --recalc-qual.
Compute BAM-derived summaries, filtering, optional reference coverage, and optional per-read tables.
The outputs are largely the same as for the fastcat fastq command.
$ ./fastcat bam --help
Usage: bam [OPTION...] <reads.bam>
bamstats -- summarise reads/alignments in an input BAM file.
Output options:
General options:
-b, --bed=BEDFILE BED file for regions to process.
-d, --demultiplex Separate outputs per barcode under
<output>/barcodeXXXX/.
-e, --force-error Exit with non-zero status if parsing/filtering
errors occur.
-o, --output=DIRECTORY Output root directory (default: fastcat-qc). Must
not already exist.
-p, --per-read Write per-read TSV (large). Disabled by default.
-r, --region=chr:start-end Genomic region to process.
-s, --sample=SAMPLE NAME Sample name (if given, adds a 'sample_name'
column).
-t, --threads=THREADS Number of threads (for BAM decompression and BED
output compression).
-v, --verbose Verbose output.
Read filtering options:
-a, --min-length=MIN READ LENGTH
Minimum read length to output (excluded reads
remain listed in summaries).
--dust Enable DUST filtering of reads (default:
disabled).
--dust-t=DUST T Threshold for DUST filtering (default: 20).
--dust-w=DUST W Window size for DUST filtering (default: 64).
-g, --read-group=RG Only process reads from given read group.
--haplotype=VAL Only process reads from a given haplotype.
Equivalent to --tag-name HP --tag-value VAL.
--max-dust=MAX DUST Maximum proportion of low-complexity regions to
allow in reads (default: 0.95).
--max-length=MAX READ LENGTH
Maximum read length to output (excluded reads
remain listed in summaries).
-q, --min-qscore=MIN READ QSCORE
Minimum read Qscore to output (excluded reads
remain listed in summaries).
--recalc-qual Force recomputing mean quality, else use 'qs' tag
in BAM if present.
--tag-name=TN Only process reads with a given tag (see
--tag-value).
--tag-value=VAL Only process reads with a given tag value.
Coverage calculation options:
--coverage Enable coverage calculations.
--coverage-beds=BEDFILE ...
BED file(s) for calculating coverage
(space-separated list).
--coverage-names=NAME ... Name(s) for coverage BED file(s)
(space-separated list).
--segments=LENGTH ... Segment length(s) for which to produce outputs
(space-separated integers).
--thresholds=VALUES ... Coverage thresholds to produce sparse
cumulative distribution (space-separated
integers).
Poly-A options:
--poly-a Enable poly-A tail length histogram.
--poly-a-cover=PCT_COVERAGE
Reference alignment coverage for acceptance of
read (default: 95).
--poly-a-qual=QUAL Read mean Q score for acceptance of read (default:
10).
--poly-a-rev Allow reverse alignments (useful for cDNA; default
is direct RNA).
Logging options:
--log-file=PATH Write logs to PATH instead of stderr.
--log-level=LEVEL Log level: error, warn, info, debug (default:
info).
--log-mono Force monochrome logging output (disable ANSI
colors).
-?, --help Give this help list
--usage Give a short usage message
-V, --version Print program versionThe outputs are arranged as for the fastcat fastq sub-command.
Additional outputs are provided expressing quantities concerned with the alignments of reads to reference sequences.
Four histogram outputs are provided (two in common with fastcat fastq):
length.hist- read length histogram,quality.hist- read mean base-quality score histogram,accuracy.hist- read alignment accuracy histogram, andcoverage.hist- read alignment coverage histogram (the proportion of each read spanned by the alignment and not clipped).
The file flagstats.tsv provides similar output to samtools flagstats, enumerating counts for different SAM flags for each reference sequence:
ref sample_name total primary secondary supplementary unmapped qcfail duplicate duplex duplex_forming
cneofor_1 0 0 0 0 0 0 0 0 0
cneofor_2 0 0 0 0 0 0 0 0 0
cneofor_3 1 0 0 1 0 0 0 0 0
ecoli1 373 347 5 21 0 0 0 0 0
ecoli2 0 0 0 0 0 0 0 0 0
The program can optionally output coverage information (provided the input BAM is coordinate sorted). The outputs are inspired by those from mosdepth but have been adapted to be more readily interpretable in line with user expectations. A set of files is produced for each user-provided BED file, as well as for the whole genome. The files are:
{name}.bed.gz- a bgzip-compressed BED file with position-level accuracy.{name}.fwd.bed.gz- as above but only for forward strand alignments.{name}.rev.bed.gz- as above but only for reverse strand alignments.{name}.summary.txt- a tab-separated summary of coverage for each region in the BED file. And a final total line for all regions in the BED.{name}.dist.txt- a tab-separated file listing the cumulative distribution function of coverage across all regions in the BED file (i.e. "what proportion of positions are covered at at least X depth?").
The summary file contains user-configurable entries enumerating a sparse coverage CDF: the default is to output the proportion of positions covered at 1x, 5x, 10x, 20x, 30x, 40x, and 50x. This is done independently for all regions in the BED, as well as the final total line.
Standalone SDUST-based low-complexity FASTQ filtering.
$ ./fastcat lint --help
Usage: lint [OPTION...] <reads.fastq>
fastlint -- apply sdust algorithm to input files.
General options:
-p, --max-proportion=PROPORTION
Maximum allowable proportion of masked bases in a
read to keep the read (default: 0.95).
-t, --threshold=THRESHOLD Threshold for repetition (default: 20).
-w, --window=WINDOW Window size (default: 64).
-?, --help Give this help list
--usage Give a short usage message
-V, --version Print program version
Chunk index tooling for BAM.
$ ./fastcat index --help
* bamindex help Print general help or help about a subcommand.
* bamindex build Build a BAM index.
* bamindex fetch Fetch records from a BAM using an index.
* bamindex dump Dump an index fetch to text.
$ ./fastcat index build --help
Usage: build [OPTION...] <reads.bam>
bamindex build -- create a BAM index corresponding to batches of records.
General options:
-c, --chunk-size=SIZE Number of records in a chunk.
-t, --threads=THREADS Number of threads for BAM processing.
-?, --help Give this help list
--usage Give a short usage message
-V, --version Print program version
$ ./fastcat index fetch --help
Usage: fetch [OPTION...] <reads.bam.bci>
bamindex fetch -- fetch records from a BAM according to an index.
General options:
-c, --chunk=SIZE Chunk index to retrieve.
-t, --threads=THREADS Number of threads for BAM processing.
-?, --help Give this help list
--usage Give a short usage message
-V, --version Print program version
$ ./fastcat index dump --help
Usage: dump [OPTION...] <reads.bam.bci>
bamindex dump -- dump a BAM chunk index to stdout as text.
-?, --help Give this help list
--usage Give a short usage message
-V, --version Print program version