Skip to content

epi2me-labs/fastcat

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
src
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

fastcat

A set of simply utilities for creating summaries from standard bioinformatics formats.

Installation

All tools are distributed in a single package from our conda channel, they can be installed into an isolated conda environment with:

mamba create -n fastcat -c conda-forge -c bioconda -c epi2melabs fastcat

Compilation

Although not recommended, compilation from source is via make:

make fastcat bamstats

Several libraries are assumed to be present on the system for linking.

fastcat

This eponymous tool concatenates .fastq(.gz) files whilst creating a summary of the sequences. Can also demultiplex reads according to Guppy/MinKNOW .fastq record headers.

Usage: fastcat [OPTION...] reads1.fastq(.gz) reads2.fastq(.gz) ...
fastcat -- concatenate and summarise .fastq(.gz) files.

  -a, --min_length=MIN READ LENGTH
                             minimum read length to output (excluded reads
                             remain listed in summaries).
  -b, --max_length=MAX READ LENGTH
                             maximum read length to output (excluded reads
                             remain listed in summaries).
  -d, --demultiplex=OUT DIR  Separate barcoded samples using fastq header
                             information. Option value is top-level output
                             directory.
  -f, --file=FILE SUMMARY    Per-file summary output
  -q, --min_qscore=MIN READ QSCOROE
                             minimum read Qscore to output (excluded reads
                             remain listed in summaries).
  -r, --read=READ SUMMARY    Per-read summary output
  -s, --sample=SAMPLE NAME   Sample name (if given adds a 'sample_name'
                             column).
  -x, --recurse              Search directories recursively for '.fastq',
                             '.fq', '.fastq.gz', and '.fq.gz' files.
  -?, --help                 Give this help list
      --usage                Give a short usage message
  -V, --version              Print program version

Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.

Input files may be given on stdin by specifing the input as '-'.  When the -x
option is given inputs may be directories.

Report bugs to chris.wright@nanoporetech.com.

The program writes the input sequences to stdout in .fastq format to be recompressed with gzip (or more usefully bgzip).

The per-read.txt is a tab-separated file with columns:

read_id        filename                read_length  mean_quality
SRR12447496.1  SRR12447496_1.fastq.gz  531          14.03
SRR12447496.2  SRR12447496_1.fastq.gz  513          13.91
SRR12447496.3  SRR12447496_1.fastq.gz  473          14.70
...

The mean quality is defined as:

-10 * log10(mean(10^(Q/-10)))

where Q are the set of all per-base quality scores for the read.

The per-file.txt is also a tab-separated file with columns:

filename                n_seqs  n_bases  min_length  max_length  mean_quality
SRR12447496_1.fastq.gz  16048   8090160  434         697         13.10
SRR12447498_1.fastq.gz  16203   8049713  421         697         13.25
SRR12447499_1.fastq.gz  15484   7812439  424         612         13.16
...

where the mean_quality column is the mean of the per-read mean_quality values.

bamstats

The bamstats utility is a re-implementation of the stats_from_bam program from pomoxis. It creates read-level summary statistics of alignments found in a BAM file and reports these in a TSV file.

Usage: bamstats [OPTION...] <reads.bam>
bamstats -- summarise rears/alignments in one or more BAM files.

 General options:
  -r, --region=chr:start-end Genomic region to process.
  -t, --threads=THREADS      Number of threads for BAM processing.

 Read filtering options:

  -g, --read_group=RG        Only process reads from given read group.
      --haplotype=VAL        Only process reads from a given haplotype.
                             Equivalent to --tag_name HP --tag_value VAL.
      --tag_name=TN          Only process reads with a given tag (see
                             --tag_value).
      --tag_value=VAL        Only process reads with a given tag value.

  -?, --help                 Give this help list
      --usage                Give a short usage message
  -V, --version              Print program version

Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.

The program creates a simple TSV file containing statistics for each primary
alignment stored within the input BAM files.

Report bugs to chris.wright@nanoporetech.com.

bamindex

The bamindex program is a rather curious program that will create a positional index of alignments in a BAM file. It is intended to be used within workflows to allow parallel processing of records in a BAM file, each worker processing a contiguous chunk of the file. This is most useful with unaligned BAM files.

The program was insired by bri by Jared Simpson at OICR; which is far cooler.

There are three subcommands:

bamindex index

$ ./bamindex build --help
Usage: build [OPTION...] <reads.bam>
bamindex build -- create a BAM index corresponding to batches of records.

 General options:
  -c, --chunk_size=SIZE      Number of records in a chunk.
  -t, --threads=THREADS      Number of threads for BAM processing.

  -?, --help                 Give this help list
      --usage                Give a short usage message
  -V, --version              Print program version

Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.

The program creates a simple index of file offsets for each of every (n * M)th
alignment record. No care is taken to keep records corresponding to the same
query together, or any other such niceities. Its intended to be used simply
with unaligned, unsorted BAMs.

bamindex fetch

$ ./bamindex fetch --help
Usage: fetch [OPTION...] <reads.bam.bci>
bamindex fetch -- fetch records from a BAM according to an index.

 General options:
  -c, --chunk=SIZE           Chunk index to retrieve.
  -t, --threads=THREADS      Number of threads for BAM processing.

  -?, --help                 Give this help list
      --usage                Give a short usage message
  -V, --version              Print program version

Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.

The program simply will fetch a batch of records from a BAM fileusing and index
and a chunk ID.

bamindex dump

$ ./bamindex dump --help 
Usage: dump [OPTION...] <reads.bam.bci>
bamindex dump -- dump a BAM chunk index to stdout as text.

  -?, --help                 Give this help list
      --usage                Give a short usage message
  -V, --version              Print program version

The program simply writes the contents of an index to stdout for human
inspection. It has no other purpose.

About

Simple utility to concatenate .fastq(.gz) files whilst creating a summary of the sequences.

Resources

License

Stars

Watchers

Forks