Skip to content

Conversation

@haydenm
Copy link
Collaborator

@haydenm haydenm commented Apr 17, 2018

A user can invoke LSH as one of two optional arguments provided to the executable. This can provide significant reductions in runtime and memory requirements, particularly when the input consists of a large number of highly similar sequences.

haydenm added 8 commits April 8, 2018 23:15
Like the duplicate filter, this is not necessary, but running it before the set
cover filter can significantly lower runtime and reduce memory usage by
decreasing the input size (number of candidate probes) to the set cover filter.
This detects and collapses near-duplicate probes using locality-sensitive
hashing with a family of hash functions that works with Hamming distance. An
option to use this filter (in place of the duplicate filter) is provided in the
arguments for design.py. There are two caveats explained in the argument's help
message: (1) this filter may lead to a slightly less-than-optimal solution
(i.e., more probes than needed) because it may remove "better" candidate probes
that are near-duplicates of ones that are kept, and (2) this filter may
(depending on the provided distance threshold) lead to less than desired
coverage of the target genomes because the universe to cover is constructed
from the set of candidate probes, which are collapsed by this filter.
This adds another family of hash functions to use for LSH. In this context,
each sequence can be thought of as a set of k-mers; hash functions from this
family will hash similar sequences to the same value based on their shared
k-mers.
This filter is like the one using LSH based on Hamming distance, except this
uses a family that constructs MinHash signatures. This commit adds an option to
design.py for using this filter. The main advantage of this filter compared to
the Hamming distance filter is that it can be more sensitive in detecting
near-duplicate probes (e.g., ones that are shifted relative to each other). The
main downside is that the distance threshold is based on Jaccard distance and
it is less intuitive to understand how these values translate to probe
hybridization compared to Hamming distance.
@coveralls
Copy link

Coverage Status

Coverage increased (+0.3%) to 94.907% when pulling 5f56741 on lsh into a7f9277 on master.

@haydenm haydenm merged commit fe63b86 into master Apr 17, 2018
@haydenm haydenm deleted the lsh branch April 17, 2018 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants