Sketching Formats

Introduction

Dashing2 is the second version of the Dashing sequence sketching and comparison system.

There have been several major changes, but you can still can get a quick start to compare a group of sequence collections here. For instructions on parsing, see Parsing below. Installation instructions can be found below.

New Features

tl;dr --

Faster and more accurate distances.
Weight-informed sketching.
Near-linear K-nearest neighbor and thresholded neighbor graphs via LSH tables.
Minimizer sequence transduction.
Faster/easier installation than Dashing1
Compilation in 45s in parallel, 2 min serial

Input Formats -- See inputs below for details.

Fastq/Fasta -- enhanced alphabet support
Default -- ACGT
Protein encoding - 20 characters (--protein)
Reduced alphabets 1. Protein - 14, 8, and 6-character alphabets for long-distance homology.
Optional -- generating 128-bit k-mers (--long-kmers)
All k-mer parsing can be winnowed by setting --window-size to be > k
Seeds can be spaced by providing a --spacing option, which provides the number of ignored characters in between used characters for seeds.
Exact multiset comparisons (--countdict)
Weight-aware sketching -- multiset and probability distribution 1. We use BagMinHash for weighted sets (--bagminhash or --multiset), and ProbMinHash for discrete probability distributions (--prob)
Minimizer sequence transduction 1. By enabling --seq, a sequence of minimizer values are emitted as a string. 2. This can be used for simple minimizer generation, or these minimizer sequences's edit distances can be compared in downstream analysis.
Splicing data
LeafCutter splicing output files --leafcutter
Interval Sets
BigWig files (--bigwig) [Weighted Interval Set]
BED files (--bed)
Binary dumps
Binary files made from 64-bit identifiers
Optional: paired weights for each feature
Can be sketched as sets, weighted sets, and discrete probability distributions.

There's also a two-step process; we can use the LSH index to generate probable candidates, but switch to exact distance comparisons after refinement. This can be enabled with the --refine-exact flag when using filtered exact-set (--set) or exact-multiset (--countdict).

For sketching algorithms that are not OrderMinHash (whether the default SetSketch, BagMinHash or ProbMinHash), this means that candidates are generated with fast, compressed sketches, but final results are produced using the full hash values.

If --edit-distance has been enabled, --refine-exact and --compute-edit-distance cause candidates to be generated by LSH table querying, but whose final distances are computed by exact edit ditsance. This allows one to prune the edit distances with LSH but still get exact results out.

Outputs

All-pairs symmetric (default), resulting in a compressed distance matrix of size N-choose-2 = (N (N - 1)) / 2.
Rectangular comparisons (--qfile/-Q against --ffile/-F) by providing query/reference sets
Filtered output
1. Jaccard thresholded-results (--similarity-threshold) and top-k results (--topk <k>)
2. Dashing2 builds and queries an LSH index to avoid all-pairs comparisons when generating these neighbor-graphs.
3. This affords us near-linear time- and space- comparisons, which we can use for clustering, indexing, and summarization.
Asymmetric all-pairs (--asymmetric-all-pairs), which performs c(A, B) for all-pairs, resulting in an NxN matrix.

All of these can be emitted in binary format --binary-output to avoid parsing/print formatting costs.

The matrices are emitted in flat, row-major format in float32.
1. If human-readable, the all-pairs symmetric is emitted as PHYLIP.
2. All-pairs asymmetric is emitted as a flat tsv.
3. Panel (rectangular) results are emitted as a flat tsv.
Thresholded results are emitted in CSR-format; see dashing2 cmp --help for more details.
1. Human-readable thresholded results are emitted as a tsv, with potentially varying numbers of items per line.
2. Individual entries consist of the distance/similarity and the corresponding entity.

Inputs

Dashing2 sketches input datasets, and then compares resulting sketches.

The formats supported are:

• FastA and FastQ

Uncompressed or gzip-compressed is transparently processed
Note: this supports both nucleotide and protein sketching.
Canonicalization is off by default.

  For k <= 64, DNA/RNA uses exact encoding, and for higher k values, a rolling hash is used.

  If protein is enabled (--enable-protein), then canonicalization does not apply and is not performed.

There is currently no quality filtering; to filter by quality scores, mask bases below desired quality as Ns. • BigWig
Is not stratified by chromosome • BED files
By default, BED files are sketched as sets of reference base/contig pairs.
These rows can be normalized (--normalize-intervals) to treat each interval as having the same weight, or can be treated as-is to treat each contig/reference base pair as an item to sketch.
Note: BED7 files have strand information, but Dashing2 does not attempt to stratify by strand

Convert these files into artificial BED files with the strand information included to work around this for now.

• LeafCutter output files for splicing datasets

Sketching Formats

Sketching Algorithms

By default, we do set sketching (One-Permutation MinHash or SetSketch Minhash), but we can do multiset (BagMinHash) or discrete prob set (ProbMinHash) sketching. BagMinhash is probably most appropriate for genome assemblies (--multiset), and ProbMinhash (--prob) is most appropriate for splicing and expression datasets, due to normalization.

We also have untested support of edit distance LSH. It seems to work, but hasn't been carefully vetted.

Counting

Counting is necessary for both BagMinHash and ProbMinHash. Default behavior, which works with sets and discards quantities, does not require it.

For counting, we either use a (single-row) CountSketch to approximate weighted set comparisons or we compute exact counts using a hash table. Counting is exact by default.

To enable count-sketch approximated counting, use the flag --countsketch-size [size]. The larger this parameter, the closer to exact the weighted sketching will be.

Exact sketching

In addition to creating sketches for m-mer sets, Dashing2 can perform full m-mer sets and m-mer count dictionaries. This can be enabled with --set or --countdict, respectively. This will be slower, but exact.

BED sketching

Normalization

Default normalization treats each base pair in an interval as equally weighted. For SPACE_SET, this doesn't matter, but for weighted sketching, it does.

That is, an interval from 0:100-200 will have twice the weight as 0:100-150. If, instead, you want to treat each interval as weighing the same as another interval, enable --normalize-intervals.

LeafCutter sketching

Similar to BED, except that the normalize flag causes the splicing event weight to be the fraction of reads supporting the junction rather than the absolute number of reads.

IE, 3/5 would have weight 0.6 if --normalize-intervals is enabled and weight 3 otherwise.

Sketches can be: SetSketch (One-permutation minhash or SetSketch) BagMinHash ProbMinHash OrderMinHash

However, OrderMinHash is only available for sequences.

When handling multiplicities, you can use exact counting, which may be slower, or you can approximate the count vectors with a single-row count-sketch by setting opts.cssize_ > 0.

ProbMinHash is usually significantly (2-20+x) faster than BagMinHash, although multiset jaccard may be more appropriate for some problems. For instance, expression and chromatic accessibility might be better considered discrete probability distributions and therefore fit ProbMinHash, whereas genomic sequences better match the multiset concept and benefit from BagMinHash, which is a MinHash algorithm for weighted sets.

QuickStart

We expect most usage to involve the sketch subcommand; to sketch and perform distances, add --cmpout <file>, where is the destination file or '-' to represent stdout.

Use 1 -- Sketch + All-pairs Compare Sequence Collections

dashing2 sketch [options] --cmpout <outfile> genome1.fa genome2.fa <...>
# Alternate, if filenames are in F.txt
# dashing2 sketch [options] --cmpout <outfile> -F F.txt

Full usage is found via dashing2 --help and dashing2 <subcommand> --help, where is one of the dashing2 subcommands.

dashing2 sketch performs sketching/summarization of a set of input files, or sequence-by-sequence processing of one or more sequence files. It also optionally performs comparisons and emits results to --cmpout.

Adding --cache causes Dashing2 to cache sketches to disk adjacent to the input files; this location can be changed with --outprefix.

We support a variety of alphabets -- DNA, Protein, and reduced amino acid alphabets for long-range homology (--protein14, --protein8, --protein6).

Alternate file-types supported include BigWigs (--bigwig), BED (--bed), and LeafCutter outputs (--leafcutter).

Clustering: scipy.cluster.hierarchy and fastcluster.hierarchy yield fast, concise clusterings, if distances are emitted.

To perform query-set vs reference-set comparison, see -Q/--qfile usage -- this yields a full rectangular matrix.

This is particularly useful for asymmetric similarities, such as containment.

Use 2 -- Sketch + Top-k NN graphs

For many applications, the quadratic space and time complexity of pairwise comparisons is prohibitive; we use locality-sensitive hashing (LSH) table to only perform comparisons between candidates highly probable to be nearest neighbors.

To generate a K-NN graph between a collection of sequences, with k = 250:

dashing2 sketch <comparison options...> --cmpout <outfile> --topk 250 -F F.txt

The output file will be a table with the names and similarities for the nearest neighbors so generated; the corresponding binary output is Compressed Sparse Row-notation results.

Clustering: This can be used for spectral clustering community detection algorithms such as Louvain and Leiden.

Use 3 -- Sketch + Jaccard-thresholded similarity graphs

Alternative to --topk [k], once can select a Jaccard similarity threshold below which the algorithm can ignore.

dashing2 sketch <comparison options...> --cmpout <outfile> --similarity-threshold 0.7 -F F.txt

The output file will be a table with the names and similarities for the nearest neighbors so generated; the corresponding binary output is Compressed Sparse Row-notation results.

Clustering: This can be used for spectral clustering community detection algorithms such as Louvain and Leiden.

Use 4 -- Protein sequence similarity search

The feature --parse-by-seq allows us to sketch and compute similarities between collections of sequences in a single file; in particular, this is useful for sequence files of protein sequences.

The default similarity/distance is k-mer set comparisons in sketch space; however, edit distance is perhaps more useful for protein sequences. Also, sketch sizes for proteins should be substantially smaller than for a full genome sequence.

dashing2 sketch -S256 --cmpout prot.k5.table --parse-by-seq -k5 {--protein,--protein14,--protein8,--protein6} uniref50.fa

If --edit-distance is enabled, OrderMinHash (Guillaume Marcais, et al.) is used to build an LSH table, and comparisons are generated between OrderMinHash sketches; this is only allowed in parse-by-seq mode, as it is undefined on a collection of sequences as opposed to a single sequence.

dashing2 sketch --topk 25 --edit-distance --compute-edit-distance -S256 --cmpout prot.top25.k5.table --parse-by-seq -k5 {--protein,--protein14,--protein8,--protein6} uniref50.fa

If either --refine-exact and --compute-edit-distance is enabled, then the final distances will be computed via edit distance.

This is particularly useful if the result is Jaccard or top-k thresholded, allowing linear-time top-k nearest neighbor lists.

Use 5: weighted sketching of feature sets

There's also dashing2 wsketch, which can be used for hashing weighted sets for comparisons. This is for the case where there are a set of integral identifiers for sketching. See dashing2 wsketch --help for usage and examples.

Use 6: Grouping sequences by edit distance

To use OrderMinHash, parsing must by by sequence. Further, --edit-distance tells Dashing2 to use edit distance LSH. --compute-edit-distance instructs dashing2 to compute edit distance between candidate neighbors instead of only comparing the sketches (the LSH register values themselves).

Use this to generate K-Nearest Neighbor graphs for edit distance.

dashing2 sketch -p8 --cmpout knn.edit-distance.tbl -k7 --parse-by-seq --edit-distance --compute-edit-distance input.fasta

Use 7: Generate KNN graph using exact K-mer distances but using LSH pre-filtering.

You can do this for exact k-mer sets:

dashing2 sketch --cache-sketches -p8 -F input_sequence_set.txt -k31  -S1024 --set --topk 25 -o input_sequence_set.topk.tsv

Or for exact weighted k-mer sets:

dashing2 sketch --cache-sketches -p8 -F input_sequence_set.txt -k31  -S1024 --countdict --topk 25 -o input_sequence_set.topk.tsv

For these approaches, Dashing2 will use a bottom-k LSH index to generate candidates.

You can instead generate a list of neighbors filtered by similarity threshold rather than choosing its top-k nearest neighbors.

Simply replace --topk <int> with --similarity-threshold <float>.

Installation

The easiest way to get started is to download a statically-linked binary in the dashing2-binaries repo. There are different folders for osx and linux, so select the latest version from the folder corresponding to your operating system. Dashing2 has not been tested on Windows, and we do not support it.

Alternatively, you can build from source with git clone --recursive https://github.com/dnbaker/dashing2 && cd dashing2 && make -j4.

Dashing2 now requires C++20, and therefore needs a relatively recent compiler, but the binary will be smaller than the statically-linked options provided and the code may be more directly tailored to your architecture.

Versions + Configuration

More than 2^32 items - dashing2 uses 32-bit hash identifiers in LSH tables for speed and memory efficiency. To use more than 4.3 billion, use dashing2-64, which switched to 64-bit identifiers and hashes.

The default version of Dashing2 is dashing2, which uses 32-bit LSH keys and ID types in its NN tables; this is faster and more memory-efficient, but less specific;

Hardware cache size When comparing sketches, computations are grouped for better cache efficiency. A group size is selected to fit as many sketches as possible in cache; The default cache size estimate is 4MB. To change this, set the D2_CACHE_SIZE environment variable.

# set cache size to 64 MB
export D2_CACHE_SIZE=67108864

As an aside, these sketches are stored contiguously to reduce fragmentation compared to Dashing1.

Parsing code

For parsing binary output (--binary-output/--emit-binary), we provide Python code for parsing into NumPy/SciPy matrices. This can save on processing time, and it avoids differences due to formatting/parsing loss.

Name		Name	Last commit message	Last commit date
Latest commit History 311 Commits
bonsai @ 6741de9		bonsai @ 6741de9
fmt @ 2742611		fmt @ 2742611
libBigWig @ 09a1b92		libBigWig @ 09a1b92
python		python
src		src
test		test
.gitmodules		.gitmodules
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

New Features

Outputs

Inputs

Sketching Formats

Sketching Algorithms

Counting

Exact sketching

BED sketching

Normalization

LeafCutter sketching

QuickStart

Installation

Versions + Configuration

Parsing code

About

Releases

Packages

Contributors 2

Languages

License

dnbaker/dashing2

Folders and files

Latest commit

History

Repository files navigation

Introduction

New Features

Outputs

Inputs

Sketching Formats

Sketching Algorithms

Counting

Exact sketching

BED sketching

Normalization

LeafCutter sketching

QuickStart

Installation

Versions + Configuration

Parsing code

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages