New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better conceptual definition on minhash/signature/query #616

luizirber opened this Issue Jan 9, 2019 · 1 comment


None yet
2 participants
Copy link

luizirber commented Jan 9, 2019

I think we should separate query, signature and minhash better around the codebase.
It's all pretty entangled, but they are different things!

  • MinHash: the low-level minhash object. Specific k, num/max_hash.
  • Signature: a collection of minhash for one dataset. Many k and num/max_hash combinations, but they are all derived from the same source.
  • query: what we pass for find/search/gather. gather assumes query is a Signature; LCA assumes query is a MinHash.

We can also lift up some functionality from MinHash into Signature: add_sequence can be a Signature method, and will call add_kmer (or equivalent) in all the MinHash defined for that signature. At the moment we do all this parsing during compute (for example), where for each sequence we need to iterate with the appropriate k size and so on.

Another note: Signature is a collection of MinHash at the moment, but would be pretty interesting to allow it to keep HLL/BF/CMS/HistoSketch representations of the data too.

Some examples

  • the signature file we save is a list of Signature, but many methods assume there is only one available (ignoring the other)
  • in feature/bf_query I'm attaching a Nodegraph with the Sig/MH content to the query to make search faster (but it is weird, because Signature shouldn't have a Nodegraph attached to it!)

(things to consider on #556)


This comment has been minimized.

Copy link

ctb commented Jan 9, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment