-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add jaccard similarity capability to Tuple sketches #349
Conversation
This provides an analog to the current Theta sketch Jaccard Similarity measure, but for Tuple sketches. The new similarity measure only compares the hash table entries, but not the summary values themselves.
We are in the middle of a major release to 2.0.0 right now, so I will have to look at this in that context after the release. |
Added missing finals. Added missing javadoc @param. Reformatted lines that were too long for better readability. Generic descriptors <...> needed a trailing space. Many unnecessary parentheses were removed automatically (likely historical).
I made a number of code style changes reflected above. Nearly all of these can be caught by your IDE if properly configured. I have my line-length set to 110. We require "final" to be specified everywhere it is relevant including method arguments. Although some consider these to be noise, we have actually caught inadvertent bugs with this habit. The javadocs for generic methods should include "at" param tags for the generic parameters. I would do a compare of what I just submitted vs your original PR to see what was changed. |
Thank you for taking the time to go through the branch and make the code style changes @leerho. I think that the use of "final" to prevent inadvertent bugs is a sensible policy, together with the other changes that you have made such as line length. To ensure that I prevent this in a future submission, I've configured my IDE to use the checkstyle plugin together with the I've also spent some time trying to configure the build POM file to enable a user to run
I was unsuccessful at getting this to work, but I'm not sure how useful it would be given that there is an IDE option. On our projects, we enable a style checker on building a PR, but I think that you have configured similar checks already. |
I'll reopen this for review now. |
I will have to look at the checkstyle issues later and consider it for a subsequent PR. |
This provides an analog to the current Theta sketch Jaccard Similarity
measure, but for Tuple sketches. The new similarity measure only
compares the hash table entries, but not the summary values themselves.
Note: This has been discussed on the dev mailing list, and is a draft for consideration. It is currently implemented for generic Tuple sketches, and also contains large amounts of duplication from what has been implemented already for Theta sketches.
Moreover, I am unsure as to whether this functionality is useful in a broader context, but the capability to compare the similarity/dissimilarity between two tuple sketches or between a tuple/theta sketch is absent.
Also, I have not run the linting/style formatters on this branch, and I would be grateful for help should this be an issue.