Skip to content

skani cookbook

Jim Shaw edited this page Jul 18, 2024 · 9 revisions

skani cookbook - common use cases

This cookbook presents some examples of common use cases for skani and how to set parameters.

This is not a definitive guide but may be helpful for further investigation. See the basic or advanced guides for documentation.

General rule for parameter tuning

  1. For bacterial/archaeal/eukaryotic genomes: when in doubt, use the default parameters.
  2. For smaller genomes (plasmids, viruses, etc): you may need some tuning -- see advanced guide and below

Searching bacterial/archaeal genomes against a large database

skani sketch -t 10 -l list_of_genome_names.txt -o database
skani search genomes_in_a_folder/* -d database > results.tsv

Important points

  • skani's defaults usually work fine for bacterial/archaeal genomes
  • the -l option takes each genome file as a line in a text file
  • search uses less memory and is fast for querying a few genomes. Use dist for querying many genomes or contigs.

All-to-all bacterial/archaeal species-level (>95% ANI) comparison for dereplication

skani triangle -s 90 my_genome_folder/* -t (threads) -E > results.tsv 

# OR

skani triangle -s 93 my_genome_folder/* -t (threads) -E --medium > results.tsv 

Important points

  • triangle sets better defaults than dist for all-to-all comparison
  • -s 93 or -s 90 means skani performs ANI computation only if the ANI is approximately > 93/90%, speeding up computation. This ensure genomes with close to 95% ANI get compared. If you set -s to 95, you may screen out genomes with over 95 percent ANI.
  • -E outputs results in a tsv format instead of a matrix format.
  • --medium may give slightly more accurate results for very fragmented genomes or lower ANI genomes (~90%) at the cost of speed, but usually not a huge deal.

All-to-all comparison with lots of small contigs (e.g. viruses, plasmids)

Tip

Since v0.2.2, skani has the --small-genomes option equivalent to -c 30 -m 200 --faster-small.

skani triangle viruses.fna -i -m 200 --slow (OR --medium) -t (threads) -E --faster-small -s 90 > results.tsv

Important points

  • -i uses contigs within the fasta file for comparison
  • -m 200 sets marker k-mers to appear 1/200 bases. Genome length / -m should ideally be > 20. Larger contigs -> set this higher. Smaller contigs -> set this smaller.
  • small genomes may benefit from the --slow or --medium options. This sets -c to be smaller and gives better AFs, and sometimes (but not always!) better ANIs.
  • --faster-small makes skani faster by using more aggressive ANI filtering for very small genomes. This increases speed for large data sets (> 10k sequences) but loses a bit of sensitivity.
  • -s 90 sets skani to screen comparisons for only approximately > 90% ANI. Feel free to set this higher or lower. Do not expect filtering to be accurate for small genomes and < 85% ANI.

Searching contigs/long-reads against genomes

skani sketch database_genomes/* -o database
skani dist --qi -q my_contigs_or_reads.fasta -r database_genomes/* -t (THREADS) --faster-small -m 300 --medium (OR default) > results.tsv

Important points

  • contigs/reads < 500bp are ignored. short-reads do not work.
  • when searching many small contigs or reads, dist is faster than search, but this depends on how large your database is and how many contigs you have.
  • --qi makes your query files -q use individual sequences/contigs instead
  • --faster-small makes screening more aggressive but loses sensitivity on very small reads/contigs.
  • -m 300 gives better screening for small contigs/reads. Contig length / -m should ideally be > 20.
  • consider --medium or even --slow if your reads/contigs are small. Default may also be ok.