Skip to content

skani advanced usage guide

Jim Shaw edited this page Jul 18, 2024 · 48 revisions

Adjusting memory/speed/accuracy tradeoffs

When comparing

  • Bacterial
  • Archaeal
  • Small eukaryotic genomes

when in doubt, use the default parameters.

We list below certain cases (e.g. viruses, small contigs, MAGs) you may want to test out different parameters or fine-tune.

skani bias, accuracy, and AF with the -c parameter

  1. Lowering -c sometimes gives a more accurate ANI estimate, but not always. In fact, there are cases where it gives a worse estimate.
  2. However, we found that lowering the -c parameter almost always makes the AF calculation more accurate, so consider this if AF is something you care about. For default c = 125, AF gets less accurate as ANI decreases; you should probably lower -c for ANI < 95% to get more accurate AFs.
  3. Lowering -c is recommended for comparing small genomes (e.g. viruses) or contigs.

Note

See aligned fraction vs -c graphs here and here. ANI vs -c graph can be found in Supp. Fig. 6 and 10 in the Supp. Materials.

Pre-set values for c

The default value of c is 125. See skani dist -h for information on the pre-set values of --slow, --medium, --fast corresponding to c = 30, 70, and 200.

  1. DEFAULT (-c 125) is somewhere between --medium and --fast.
  2. --slow (-c 30) gives good AFs but has worse accuracy unless the genomes are very fragmented (< 3kb N50) or very distant (low ANI).
  3. --medium (-c 70) seems to work okay for decently fragmented MAGs and more distant genomes; may slightly underestimate ANI on higher-quality genomes though.
  4. --fast (-c 200) is twice as fast and memory-efficient. Works okay for higher N50 and higher ANI (> 10kb N50, > 95% ANI) and if AF isn't as important.

ANI calculations for smaller sequences

Tip

Since v0.2.2, skani now has the --small-genomes option, which is an alias for -c 30 -m 200 --faster-small. You can try -c 30 -m 100 --faster-small for even more sensitive results.

skani is not necessarily designed extremely small sequences, but it works when the sequences are long enough (> 2kb at least).

For small sequences, here are some suggestions:

  1. Make sure to use the --qi or --ri or -i options for skani triangle/dist/search if your contigs/reads are all in one file.
  2. skani dist or skani triangle will be much faster than skani search since the bottleneck will be loading genomes into memory.

For parameters:

  1. The default marker size -m is set to 1000, so we take one marker per every 1000 k-mers. This default is not good for very small sequences. 20 markers on average works well, so set -m < avg_genome_length / 20.
  2. Setting -c to smaller values makes AF a lot more accurate for small genomes.
  3. Try using --faster-small for lots of small contigs -- this speeds up viral/plasmid all-to-all comparisons drastically.

Important

The --faster-small option filters small, distant contigs for ANI calculation. It can dramatically speed up comparisons for > 100,000 small genomes. However, this will be much less sensitive if -m is not appropriately set. Set -m to be at least target_genome_length / 20. E.g. if you care about viruses > 3kbp, set -m 150.

Screening: comparing only high ANI genomes with -s to speed up computations

The option -s controls for an APPROXIMATE ANI cutoff. Computations proceed only if the putative ANI (obtained by k-mer max-containment index) is higher than -s. By default, this is 80 (80%) for ANI.

A value of -s below 80% will not work well unless using the --slow option (or lower value of -c). With default parameters, lowering -s will likely do nothing except make your calculations slower.

You can use a higher value of -s if you want to compare only similar strains/species. **You may need enough marker k-mers, i.e. -m to be set high enough. See ANI calculations for small sequences on how to set -m.

Important

The -s cutoff is only an approximate screening method. If the putative ANI is smaller than -s, the calculation does not proceed, EVEN if skani's ANI is > -s. So if you want 95% ANIs, set -s to 90 rather than 95 to avoid sensitivity loss.

Inverted Indexing and filtering of genomes with -m

skani uses an inverted index for k-mer filtering. This loads all marker k-mers into a hash table. This is turned on automatically if:

  1. more than 100 query files are input OR
  2. the --qi option is enabled.

Otherwise, it is disabled and we do a linear filtering step instead.

If the inverted index is enabled, make sure your genomes (or contigs if using --qi) have enough marker k-mers available, otherwise the genomes/contigs may get filtered out if no markers are shared between the genomes.

By default, -m is 1000, so there is 1 marker per 1000 bases. 20 markers per genome is a reasonable value for comparing at the species level, so consider decreasing -m is you're using --qi with small contigs.

Building the table can take up to a minute (for large databases), and the table itself is ~10 GB for gtdb-r207 (65k genomes) and ~20 GB for gtdb-r214. Consider changing the -m option, which is inversely proportional to the memory of this table, if memory is an issue. Alternatively, consider the --no-marker-index option if you want to be careful but at much slower.

Comparing lower ANI genomes.

skani's ANI calculations are the most accurate for genomes with > 85% ANI, although with default parameters results down to 82% ANI will usually be shown (sometimes even lower). We only output results where the aligned fraction for the query or the reference is > 15% by default. This can be changed with the --min-af option, but low aligned fraction results are not accurate.

To get more accurate results for low ANI values, one should use a lower value for c and s, and then possibly adjust the --min-af option.

For example, the supplied genome refs/MN-03.fa in the GitHub repository is a Klebsiella Pneumoniae genome and running skani dist refs/MN-03.fa refs/e.coli-K12.fa returns nothing because the alignment is poor with default params. However, skani dist refs/MN-03.fa refs/e.coli-K12.fa -c 30 -s 75 returns an ANI estimate of ~79%.

For distant genomes, the aligned fraction output becomes more accurate as c gets smaller. However, decreasing c may not necessarily make high ANI calculations more accurate. Nevertheless, I would not recommend ANI comparisons for genomes with < 75% ANI using skani.

Robust and Median options

The --median and --robust options bias the ANI upwards a decent amount (95% -> 96% happens, for example). They may sometimes give more concordant clusterings, but I recommend only using these options if you are fine with a decent amount of bias.

ANI debiasing by trained regression

Since v0.1.0, skani outputs a more accurate ANI by debiasing an initial ANI estimate using a trained regression model. This model is trained on bacterial MAGs, but it seems to work quite well on even complete genomes, and eukaryotes as well.

The debiasing step is turned on when there are > 150,000 bases mapped between the genomes and if c >= 70. In particular, the default parameters enable the debiasing step. We've found that this step is helpful most of the time, but turning off ANI debiasing using the --no-learned-ani command may be beneficial in edge cases.