skani cookbook

skani cookbook - common use cases

This cookbook presents some examples of common use cases for skani and how to set parameters.

This is not a definitive guide but may be helpful for further investigation. See the basic or advanced guides for documentation.

General rule for parameter tuning

For bacterial/archaeal/eukaryotic genomes: when in doubt, use the default parameters.
For smaller genomes (plasmids, viruses, etc): you may need some tuning -- see advanced guide and below

Searching bacterial/archaeal genomes against a large database

skani sketch -t 10 -l list_of_genome_names.txt -o database
skani search genomes_in_a_folder/* -d database > results.tsv

Important points

skani's defaults usually work fine for bacterial/archaeal genomes
the -l option takes each genome file as a line in a text file
search uses less memory and is fast for querying a few genomes. Use dist for querying many genomes or contigs.

All-to-all bacterial/archaeal species-level (>95% ANI) comparison for dereplication

skani triangle -s 90 my_genome_folder/* -t (threads) -E > results.tsv 

# OR

skani triangle -s 93 my_genome_folder/* -t (threads) -E --medium > results.tsv

Important points

triangle sets better defaults than dist for all-to-all comparison
-s 93 or -s 90 means skani performs ANI computation only if the ANI is approximately > 93/90%, speeding up computation. This ensure genomes with close to 95% ANI get compared. If you set -s to 95, you may screen out genomes with over 95 percent ANI.
-E outputs results in a tsv format instead of a matrix format.
--medium may give slightly more accurate results for very fragmented genomes or lower ANI genomes (~90%) at the cost of speed, but usually not a huge deal.

All-to-all comparison with lots of small contigs (e.g. viruses, plasmids)

Tip

Since v0.2.2, skani has the --small-genomes option equivalent to -c 30 -m 200 --faster-small.

skani triangle viruses.fna -i -m 200 --slow (OR --medium) -t (threads) -E --faster-small -s 90 > results.tsv

Important points

-i uses contigs within the fasta file for comparison
-m 200 sets marker k-mers to appear 1/200 bases. Genome length / -m should ideally be > 20. Larger contigs -> set this higher. Smaller contigs -> set this smaller.
small genomes may benefit from the --slow or --medium options. This sets -c to be smaller and gives better AFs, and sometimes (but not always!) better ANIs.
--faster-small makes skani faster by using more aggressive ANI filtering for very small genomes. This increases speed for large data sets (> 10k sequences) but loses a bit of sensitivity.
-s 90 sets skani to screen comparisons for only approximately > 90% ANI. Feel free to set this higher or lower. Do not expect filtering to be accurate for small genomes and < 85% ANI.

Searching contigs/long-reads against genomes

skani sketch database_genomes/* -o database
skani dist --qi -q my_contigs_or_reads.fasta -r database_genomes/* -t (THREADS) --faster-small -m 300 --medium (OR default) > results.tsv

Important points

contigs/reads < 500bp are ignored. short-reads do not work.
when searching many small contigs or reads, dist is faster than search, but this depends on how large your database is and how many contigs you have.
--qi makes your query files -q use individual sequences/contigs instead
--faster-small makes screening more aggressive but loses sensitivity on very small reads/contigs.
-m 300 gives better screening for small contigs/reads. Contig length / -m should ideally be > 20.
consider --medium or even --slow if your reads/contigs are small. Default may also be ok.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

skani cookbook

skani cookbook - common use cases

General rule for parameter tuning

Searching bacterial/archaeal genomes against a large database

All-to-all bacterial/archaeal species-level (>95% ANI) comparison for dereplication

All-to-all comparison with lots of small contigs (e.g. viruses, plasmids)

Searching contigs/long-reads against genomes

Clone this wiki locally