-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
settle on core MinHash object API #885
Comments
@luizirber what's the right way to use |
I'm going to try to follow the pandas deprecation policy (also ref their versioning policy) |
We should write Python MinHash object API tests/docs as part of this, esp now that #889 is merged and life is looking a bit better. |
And probably use similar docs for the rust docs, they are pretty empty right now... |
would also be good to revisit issues like #618 to see that we're at least consistent. |
random side thought: I think we should redefine the MinHash constructor API like so:
(ref #338 of course) |
PR #882 raised a question in my mind about the various similarity and comparison methods on MinHash objects. We have
similarity
,compare
, andjaccard
, as well ascontained_by
andcontainment_ignore_maxhash
. Currentlycompare
andjaccard
are identical (and do Jaccard comparisons without hash abundances), whilesimilarity
calculates Jaccard similarity w/o abundances and calculates angular similarity with abundances.The source code for this is a bit of a mess, obviously :).
SEPARATELY, I just realized that the sourmash gather code in
sourmash/search.py
does abundance calculations in Yet Another Way, with what I'll call a projection from an abundance-weighted signature on to a non-abundance-weighted signature (see block in search.py). This code is entirely absent from the MinHash code. sigh.Anyway.
#882 makes the rust code a bit simpler and less repetitive, and also IMO clearer, by adding explicit
angular_similarity
andjaccard
functions (that do those calculations without downsampling), and then doing the downsampling and so on insimilarity
andcompare
. This raises the following questions --Should the Python API match the rust API? (probably yes - right now this means adding
angular_similarity
.)should we get rid of redundant functions (esp ones that no one uses outside of the internal sourmash team :) and document the official API? (probably yes.)
should we put "buyer beware" in around things like angular similarity and projection similarity that we haven't proven work, with either simulations or analytics? (...probably yes.)
related to #866 #624
The text was updated successfully, but these errors were encountered: