gofasta is a command-line utility developed to handle SARS-CoV-2 alignments, but should be useful more generally for handling consensus genomes of any microbial pathogen. It has been used on datasets of millions of sequences, including by the United Kingdom's daily SARS-CoV-2 genome processing pipelines, Pangolin and Civet.
If you use gofasta in your work, please cite https://doi.org/10.1093/bioinformatics/btac424.
Binaries are available for Mac OSX and Linux under the latest release,
or you can install gofasta from Bioconda:
conda install bioconda::gofasta
If you have Go installed, you can run go install github.com/virus-evolution/gofasta@latest
to build a binary of the latest release locally (maybe in ~/go/bin/
). Branches, commits or tags other than latest
can be built in the same way.
You can also build the contents of this repository:
git clone https://github.com/virus-evolution/gofasta.git
cd gofasta
go build -o gofasta
For a full list of commands and options, you can run gofasta
with the -h
flag at the command line: gofasta -h
, gofasta sam -h
, gofasta sam variants -h
, etc.
e.g., for gofasta updown list -h
:
Click to see the help message
❯ gofasta updown list -h
Generate input CSV files for gofasta updown topranking
Example usage:
gofasta updown list -r reference.fasta -q alignment.fasta -o mutationlist.csv
Non-ATGC nucleotides are not recommended in the --reference, and --reference and --query must
be aligned to the same thing.
--outfile is a CSV-format file with the columns: query,SNPs,ambiguities,SNPcount,ambcount. There is one row
for each sequence in --query. SNPs is a "|"-delimited list of SNPs relative to --reference. ambiguities is
a "|"-delimited list of ranges (1-based, inclusive) of tracts of ambiguities (anything that isn't ATGC).
Usage:
gofasta updown list [flags]
Flags:
-q, --query string Alignment of sequences to parse, in fasta format (default "stdin")
-o, --outfile string Output to write (default "stdout")
-h, --help help for list
Global Flags:
-r, --reference string Reference sequence, in fasta format, which is treated as the root of the imaginary tree
minimap2 provides fast and reliable pairwise alignments between SARS-CoV-2 consensus genomes. gofasta provides utility functions to convert the sam format alignments it outputs into fasta format.
Pairwise alignments are less computationally demanding than multiple sequence alignment-proper and they offer an efficient approximation to it when sequences are closely related.
Making a single alignment with multiple sequences
This is the pipeline currently used by Pangolin:
minimap2 -a -x asm20 --sam-hit-only --secondary=no --score-N=0 MN908947.fa unaligned.consensus.fasta -o aligned.sam
gofasta sam toMultiAlign -s aligned.sam --start 266 --end 29674 --pad -o aligned.fasta
MN908947.fa
is the reference genome for SARS-CoV-2 (here is its Genbank accession), and unaligned.consensus.fasta
contains all your consensus genomes (in one file). In this case we've clipped the alignment to just the coding sequence (--start 266 --end 29674
) but replaced the trimmed regions with N
s to retain the reference length in the output (--pad
). Insertions relative to the reference are discarded from the output file so everything is the same (== reference) length.
minimap2 and gofasta can both make use of multiple threads. Additionally, minimap2 writes to stdout by default and gofasta reads from stdin if we don't specify a file for -s
, so we can avoid writing the intermediate sam to disk entirely. Both of these should speed things up:
minimap2 -t8 -a -x asm20 --score-N=0 MN908947.fa unaligned.consensus.fasta | gofasta sam toma -t2 > aligned.fasta
We give minimap2 more threads here because it's doing more work. toma
is just an alias for toMultiAlign
.
But I don't want to have to write all this code every time I want to align something. That's understandable. In which case you could define a shell function in your ~/.zshrc
or ~/.bashrc
file, something like:
function align() {
minimap2 -a -x asm20 --score-N=0 ~/path/to/MN908947.fa $1 | gofasta sam toma
}
and then you can run align unaligned.consensus.fasta > aligned.fasta
in your terminal.
Making pairwise alignments (including insertions)
gofasta will also convert the minimap2 alignment into pairwise alignments, writing one file per consensus genome, including insertions relative to the reference and writing the reference itself to the output files:
minimap2 -a -x asm20 --score-N=0 MN908947.fa unaligned.consensus.fasta |\
gofasta sam toPairAlign -r MN908947.fa -o outputdir
outputdir
is a directory which will be created if it does not already exist. Individual fasta files will be written to outputdir/*fasta
. Filenames are derived from the fasta descriptions (with /
replaced by _
).
Just as with toMultiAlign
, you can clip the alignments to coordinates of your choosing. For example, if you are only interested in the Spike region:
minimap2 -a -x asm20 --score-N=0 MN908947.fa unaligned.consensus.fasta |\
gofasta sam topa -r MN908947.fa --start 21563 --end 25384 -o outputdir
--start
and --end
are 1-based inclusive, and they are in reference coordinates. So you will get the region corresponding to the reference Spike regardless of insertions or deletions anywhere in your query genome.
There is no --pad
option for toPairAlign
because we don't expect all the sequences to be reference length.
gofasta provides two utilities to search for the closest genetic neighbours of a number of query sequences among a set of target sequences. In both cases the queries are loaded into memory and the targets are streamed from disk, so the target file can be arbitrarily large.
Traditional distance measures
Use gofasta closest
to find the closest neighbours by traditional genetic distance measures:
❯ gofasta closest --help | sed -n '/Flags/,$p'
Flags:
-t, --threads int Number of CPUs to use (Default: all available CPUs)
--query string Alignment of sequences to find neighbours for, in fasta format
--target string Alignment of sequences to search for neighbours in, in fasta format
-m, --measure string which distance measure to use (raw, snp or tn93) (default "raw")
-n, --number int (Optional) the closest n sequences to each query will be returned
-d, --max-dist string (Optional) return all sequences less than or equal to this distance away
-o, --outfile string The output file to write (default "stdout")
--table write a long-form table of the output
-h, --help help for closest
The available distance measures are raw
(the default) which is the number of nucleotide differences per site ; snp
, which is the total number of nucleotide differences, and tn93
, which is Tamura and Nei's (1993) evolutionary distance.
raw distance is calculated as: l_d / (l_d + l_s)
, where l_d
is the number of sites which are certainly different between two sequences and l_s
is the number of sites which are certainly the same.
snp distance = l_d
. For both the raw and snp distances, ambiguous IUPAC codes are treated as the set of bases that they represent.
tn93 distance is calculated according to equation (7) in the paper. Only ATGC
bases are considered when calculating this measure.
The routine is parallelised across queries, so there is no point setting -t
greater than the number of sequences in --query
.
Directional snp-distance
Use gofasta updown topranking
to find the closest neighbours by (directional) snp distance. This is the routine used by Civet3:
❯ gofasta updown topranking -h | sed -n '/Flags/,$p'
Flags:
-q, --query string File with sequences to find neighbours for. Either the CSV output of gofasta updown list, or an alignment in fasta format
-t, --target string File of sequences to look for neighbours in. Either the CSV output of gofasta updown list, or an alignment in fasta format
-o, --outfile string CSV-format file of closest neighbours to write (default "stdout")
--table write a long-form table of the output
--ignore string Optional plain text file of IDs to ignore in the target file when searching for neighbours
--dist-all int Maximum allowed SNP-distance between target and query sequence in any direction. Overrides the settings below
--dist-up int Maximum allowed SNP-distance from query for sequences in the parent bin
--dist-down int Maximum allowed SNP-distance from query for sequences in the child bin
--dist-side int Maximum allowed SNP-distance from query for sequences in the sibling bin
--size-total int Max number of neighbours to find (attempts to split equally between same/up/down/side). A hard limit
--size-up int Max number of closest parent sequences to find, if size-total not specified. A soft limit unless --no-fill
--size-down int Max number of closest child sequences to find, if size-total not specified. A soft limit unless --no-fill
--size-side int Max number of closest sibling sequences to find, if size-total not specified. A soft limit unless --no-fill
--size-same int Max number of identical sequences to find, if size-total not specified. A soft limit unless --no-fill
--threshold-pair float32 Up to this proportion of consequential sites is allowed to be ambiguous in either sequence for each pairwise comparison (default 0.1)
--threshold-target int Target can have at most this number of ambiguities to be considered (default 10000)
--dist-push int Push the --dist boundaries outwards so that bins have at least these many closest SNP-distances for which there are neighbours, where possible
--no-fill Don't make up for a shortfall in any of --size-up, -down, -side or -same by increasing the count for other bins
-h, --help help for topranking
Global Flags:
-r, --reference string Reference sequence, in fasta format, which is treated as the root of the imaginary tree
This routine has the potential to be faster than traditional genetic distance measures. It uses patterns of derived mutations to make inferences about the likely phylogenetic relationships between closely related sequences. Briefly, sequences are compressed so that they are represented by only their ATGC nucleotide differences from a common reference sequence, which is treated like the root of an imaginary tree. SNPs can either be unique to the query sequence, unique to the target sequence, or present in the intersection of the two. SNPs present in the intersection are interpreted as representing shared ancestry between query and target, whereas an excess of SNPs in either the query or target set can be interpreted to give directionality relative to a root sequence. See the Civet preprint for a detailed explanation of the scheme (Figure 2 and Figure S1).
The input --query
and --target
files can either be alignments in fasta format, or they can be csv-format files produced by gofasta updown list
(or one of each). Using the csv-format files should be faster to the extent that they are quicker to read from disk compared to alignments, which initially contain the information for every site.
An example of command-line use and more explanation is available by running gofasta updown topranking --help
.
Use gofasta snps
to extract nucleotide changes relative to a reference sequence from a multiple sequence alignment, and gofasta variants
and gofasta sam variants
to extract amino acid, indel and nucleotide changes relative to an annotated reference sequence from alignments in fasta and sam format, respectively.
Nucleotide changes
gofasta snps
will list all the nucleotide changes in a multiple sequence alignment (--query
) compared to a reference sequence which is provided in a separate file (--reference
). The two files have to be the same width (i.e. aligned to the same thing).
❯ gofasta snps -h | sed -n '/Flags/,$p'
Flags:
-r, --reference string Reference sequence, in fasta format
-q, --query string Alignment of sequences to find snps in, in fasta format (default "stdin")
-o, --outfile string Output to write (default "stdout")
--hard-gaps Don't treat alignment gaps as missing data
--aggregate Report the proportions of each change
--threshold float If --aggregate, only report snps with a freq greater than or equal to this value
-h, --help help for snps
The basic usage creates a csv file with a header and a line for each sequence in --query
. The first column is the sequence name, and the second column is a '|'-delimited list of nucleotide changes who format is: reference allele, 1-based position in alignment coordinates, query allele.
IUPAC ambiguity codes are treated as the set of bases that they represent, and only certainly-different changes are reported. For example an output of A101S
is possible, but A101W
is not. Alignment gaps (-
) are treated like N
s (aNy base) unless you use --hard-gaps
.
❯ gofasta snps -r MN908947.fa -q aligned.fasta -o snps.csv
❯
❯ head -n2 snps.csv
query,SNPs
query1,T670G|G4184A|C4321T|C9344T|A9424G|C9534T|C10198T|G10447A|C10449A|G12160A|C12880T|C14408T|C15714T|C17410T|C19955T|A20055G|T21570G|C21618T|G21987A|T22200G|G22578A|T22679C|C22686T|A22688G|A23403G|C23525T|T23599G|C23604A|C23854A|G23948T|T24469A|C25000T|C26060T|C26270T|G27382C|A27383T|T27384C|G27788T|C27807T|A28271T|C28311T|C28724T|G28881A|G28882A|G28883C|A29510C
If you invoke --aggregate
, the proportion of each change in the whole alignment is written.
❯ gofasta snps -r MN908947.fa -q aligned.fasta --aggregate
SNP,frequency
C44T,0.250000000
C241T,0.750000000
T670G,0.916666667
C1314T,0.083333333
C1613A,0.083333333
C1684T,0.083333333
C2790T,0.833333333
C3037T,0.833333333
...
Amino acid, indel and neutral nucleotide changes
If you provide an annotation, gofasta can also annotate amino acid changes relative to a reference sequence. The annotation can be provided in genbank flat file format or gff version 3 format. Examples of both for SARS-CoV-2 are available under resources/
in this repository.
The two relevant routines are gofasta variants
(for annotating mutations in alignments in fasta format) and gofasta sam variants
(for annotating mutations in alignments in sam format). They should give the same output for the same alignment and the same annotation. Multiple sequence alignments in fasta format don't need to be in reference coordinates for gofasta variants
, but if they aren't, a sequence in the same space as the annotation must be present in the alignment. If the alignment is being read from stdin, this sequence must be the first sequence in the alignment, but doesn't have to be if the file is being read from disk. The reference sequence in fasta format needs to be provided to gofasta sam variants
unless it is present in your annotation. As usual, run either command with the -h
flag for example command lines and detailed help.
For a genbank format annotation, the annotation will be parsed such that the genome is split into protein-coding regions based on CDS features, and intergenic regions (everything that isn't in CDS). Mutations are then annotated with ins
(insertion), del
(deletion), aa
(amino acid change) or nuc
(a nucleotide change that isn't in a codon that is represented by an amino acid change) - for the last one these can be in intergenic sequence, or they can be synonymous changes in CDS.
gff format annotation gives you more flexibility for naming amino acid changes. Currently, the annotation will be parsed such that the genome is split into protein-coding regions based on feature lines whose type
(in column 3) is either CDS
or mature_protein_region_of_CDS
, and intergenic regions (everything else). For the purposes of annotating amino acids, CDS
or mature_protein_region_of_CDS
feature lines that have a Name=something
tag,value pair in the attributes column (column 9) will be represented in the output. Thus you can define regions as protein-coding using a CDS
feature line (for example orf1a in SARS-CoV-2) but annotate amino acid changes in its constituent protein products using mature_protein_region_of_CDS
feature lines with Name=
attributes. See the example
Examples of the output formats:
ins:2028:3 - a 3-base insertion immediately after (1-based) position 2028 in reference coordinates
del:11288:9 - a 9-base deletion whose first missing nucleotide is at (1-based) position 11288 in reference coordinates
aa:S:D614G - the amino acid at (1-based) residue 614 in the S gene is a D in the reference and a G in this sequence
aa:nsp12:P323L - the amino acid at (1-based) residue 323 in the rdrp gene is a P in the reference and an L in this sequence
nuc:C3037T - the nucleotide at (1-based) position 3037 in reference coordinates is a C in the reference and a T in this sequence
As with gofasta snps
the default mode writes a csv with one line per query sequence, and each sequence's mutations in the second column. Use --aggregate
to get the overall frequencies of mutations in the alignment(s).
So, for example, you can find the frequencies of all the amino acid changes at residue 681 in the Spike gene, and the nucleotide changes underlying them, from the sample of SARS-CoV-2 sequences in aligned.fasta
like:
❯ gofasta variants --msa aligned.fasta --annotation MN908947.gb --aggregate --append-snps | grep "^aa:S:P681"
aa:S:P681H(nuc:C23604A),0.004000000
aa:S:P681R(nuc:C23604G),0.983000000
or find which sequences have P681H
:
❯ gofasta variants --msa aligned.fasta --annotation MN908947.gb | grep "S:P681H" | cut -d, -f1
COGUK/PHEC-XXXX107/PHEC
COGUK/PHEC-XXXX003/PHEC
COGUK/PHEC-XXXX996/PHEC
COGUK/PHEC-XXXX544/PHEC
Alternatives to minimap2 for pairwise viral genome alignment exist. Notably, Nextalign (Aksamentov et al. 2021) can use a genome annotation to apply a reading-frame-aware gap penalty, and will perform translation and amino acid alignment to call amino acid mutations.
gofasta’s functions for finding the closest genetic neighbours to a set of query sequences were designed for the specific use-case of querying a large dataset of target genomes by genetic distance to place a smaller set of focal sequences in context, without having to infer all the phylogenetic relationships between hundreds of thousands or millions of sequences. An alternative approach is currently provided by an up-to-date a maximum parsimony phylogenetic tree consisting of publicly available SARS-CoV-2 genomes (McBroome et al. 2021) and associated software which can be used for placing new samples and querying the dataset (Turakhia et al. 2021).
Other software exists for calculating pairwise genetic distances, including tn93 and pairsnp.
gofasta incorporates bíogo, which is distributed under licence. Its licence is reproduced under THIRD_PARTY_LICENCES/biogo
or run gofasta licences
to print it.
gofasta uses a slightly modified version of the bit-level coding scheme for nucleotides by Emmanuel Paradis, which is used in ape. For a description of it, see here.
The functions with SAM files as input were written to handle pairwise alignments between assembled SARS-CoV-2 genomes from minimap2.