Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discussion for why modulo hash / scaled signatures are awesome #606

Closed
ctb opened this issue Jan 7, 2019 · 5 comments
Closed

discussion for why modulo hash / scaled signatures are awesome #606

ctb opened this issue Jan 7, 2019 · 5 comments
Labels
revisit_me An issue that needs attention and clarification

Comments

@ctb
Copy link
Contributor

ctb commented Jan 7, 2019

From private conversations with @luizirber @bluegenes @halexand recently -- scaled signatures are different from MinHash because:

  • intended for containment (e.g. vs mash screen, regular minhashes)
  • streaming compatible in the following way: at any point in a stream, scaled signatures are always proper subsets of the scaled signature of the whole data set, unlike MinHashes. specifically, no hash is ever removed from a scaled signature as more data is received. This means that for searches of a database using streamed-in data, all prior matches remain valid (although their significance may change as more data is received). This allows us to place algorithmic guarantees on containment searches.
  • scaled signatures can be "abundance trimmed" while regular minhashes cannot be.
  • in many cases scaled signatures are compatible with regular minhashes and can be converted (see [MRG] add 'sourmash signature' signature manipulation utilities. #587 downsample and discussion in API docs in [MRG] Update the API docs #596), and both could maybe be built simultaneously (see Improvement: allow both num and scaled to be set #538)
  • I strongly suspect that scaled signatures deal much better with sets of different sizes
  • scaled signatures can be downsampled to larger scaled numbers (similar to downsampling MinHash, so not a huge advantage but could be mentioned)

These properties need to be clearly laid out, discussed, evaluated empirically, and (ideally) described theoretically. cc @dkoslicki

@ctb ctb changed the title bullet points for why modulo hash / scaled signatures are awsome discussion for why modulo hash / scaled signatures are awesome Jan 7, 2019
@ctb
Copy link
Contributor Author

ctb commented Jan 7, 2019

(of course, this all needs to be balanced against the point that they can grow indefinitely :)

@ctb
Copy link
Contributor Author

ctb commented Feb 3, 2019

note Richard Durbin's modimizer, which uses similar concepts! https://github.com/richarddurbin/modimizer - the README is informative for this issue.

@ctb
Copy link
Contributor Author

ctb commented Oct 13, 2019

a more succinct way of putting the containment guarantees above are "Containment never decreases as you get more data" (which is nice for streaming esp.)

@ctb
Copy link
Contributor Author

ctb commented Oct 19, 2019

note also that you can subtract and add scaled signatures, and filter them on abundance, and other things, without fear.

@ctb
Copy link
Contributor Author

ctb commented Jan 18, 2022

preprinted and available! see link in #823.

@ctb ctb closed this as completed Jan 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
revisit_me An issue that needs attention and clarification
Projects
None yet
Development

No branches or pull requests

1 participant