Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add jaccard similarity capability to Tuple sketches #349

Merged
merged 2 commits into from
Feb 23, 2021

Conversation

davecromberge
Copy link
Member

@davecromberge davecromberge commented Feb 10, 2021

This provides an analog to the current Theta sketch Jaccard Similarity
measure, but for Tuple sketches. The new similarity measure only
compares the hash table entries, but not the summary values themselves.

Note: This has been discussed on the dev mailing list, and is a draft for consideration. It is currently implemented for generic Tuple sketches, and also contains large amounts of duplication from what has been implemented already for Theta sketches.
Moreover, I am unsure as to whether this functionality is useful in a broader context, but the capability to compare the similarity/dissimilarity between two tuple sketches or between a tuple/theta sketch is absent.

Also, I have not run the linting/style formatters on this branch, and I would be grateful for help should this be an issue.

This provides an analog to the current Theta sketch Jaccard Similarity
measure, but for Tuple sketches.  The new similarity measure only
compares the hash table entries, but not the summary values themselves.
@davecromberge davecromberge marked this pull request as draft February 10, 2021 17:32
@leerho
Copy link
Contributor

leerho commented Feb 10, 2021

We are in the middle of a major release to 2.0.0 right now, so I will have to look at this in that context after the release.
Nonetheless, this is a good suggestion and thank you for the PR.

Added missing finals.
Added missing javadoc @param.
Reformatted lines that were too long for better readability.
Generic descriptors <...> needed a trailing space.
Many unnecessary parentheses were removed automatically (likely
historical).
@leerho
Copy link
Contributor

leerho commented Feb 22, 2021

I made a number of code style changes reflected above. Nearly all of these can be caught by your IDE if properly configured. I have my line-length set to 110. We require "final" to be specified everywhere it is relevant including method arguments. Although some consider these to be noise, we have actually caught inadvertent bugs with this habit. The javadocs for generic methods should include "at" param tags for the generic parameters. I would do a compare of what I just submitted vs your original PR to see what was changed.

@davecromberge
Copy link
Member Author

Thank you for taking the time to go through the branch and make the code style changes @leerho.

I think that the use of "final" to prevent inadvertent bugs is a sensible policy, together with the other changes that you have made such as line length. To ensure that I prevent this in a future submission, I've configured my IDE to use the checkstyle plugin together with the tools/SketchesCheckstyle.xml configuration file. This seems to enforce the style rules that you have referred to, such as the 110-character line limit.

I've also spent some time trying to configure the build POM file to enable a user to run mvn checkstyle:checkstyle for the same purpose. However, it required an additional plugin with upgrade similar to this:

<build>
   <plugins>
....
     <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-checkstyle-plugin</artifactId>
          <version>3.1.2</version>
          <dependencies>
            <dependency>
              <groupId>com.puppycrawl.tools</groupId>
              <artifactId>checkstyle</artifactId>
              <version>8.40</version>
            </dependency>
          </dependencies>
          <configuration>
            <configLocation>${project.basedir}/tools/SketchesCheckstyle.xml</configLocation>
          </configuration>
        </plugin>

I was unsuccessful at getting this to work, but I'm not sure how useful it would be given that there is an IDE option. On our projects, we enable a style checker on building a PR, but I think that you have configured similar checks already.

@davecromberge
Copy link
Member Author

I'll reopen this for review now.

@davecromberge davecromberge marked this pull request as ready for review February 22, 2021 21:57
@leerho
Copy link
Contributor

leerho commented Feb 23, 2021

I will have to look at the checkstyle issues later and consider it for a subsequent PR.
Otherwise this PR looks good.
Thank you for submitting!

@leerho leerho merged commit 60625fb into master Feb 23, 2021
@leerho leerho deleted the tuple-jaccard-similarity branch February 23, 2021 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants