Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12933][SQL] Initial implementation of Count-Min sketch #10851

Closed
wants to merge 16 commits into from

Conversation

liancheng
Copy link
Contributor

This PR adds an initial implementation of count min sketch, contained in a new module spark-sketch under common/sketch. The implementation is based on the CountMinSketch class in stream-lib.

As required by the design doc, spark-sketch should have no external dependency.
Two classes, Murmur3_x86_32 and Platform are copied to spark-sketch from spark-unsafe for hashing facilities. They'll also be used in the upcoming bloom filter implementation.

The following features will be added in future follow-up PRs:

  • Serialization support
  • DataFrame API integration

import org.apache.spark.unsafe.Platform;
import org.apache.spark.unsafe.hash.Murmur3_x86_32;

public class CountMinSketchImpl extends CountMinSketch {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add some comment acknowledging stream-lib

@rxin
Copy link
Contributor

rxin commented Jan 20, 2016

do we also need to update the test runner to add this module?

cc @JoshRosen

@liancheng liancheng changed the title [SPARK-12818] Initial implementation of count min sketch [SPARK-12933][SQL] Initial implementation of count min sketch Jan 20, 2016

package org.apache.spark.util.sketch

import scala.reflect.ClassTag
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClassTag is used here for creating arrays. I found using Seq can slow down test execution quite a bit.

@liancheng
Copy link
Contributor Author

@rxin Already added sketch module to sparktestsupport per Josh's suggestion.

@SparkQA
Copy link

SparkQA commented Jan 20, 2016

Test build #49811 has finished for PR 10851 at commit bb7435f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// page 149, right after Proposition 7.
hash += hash >> 32;
hash &= PRIME_MODULUS;
// Doing "%" after (int) conversion is ~2x faster than %'ing longs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kind of black magic...

@rxin
Copy link
Contributor

rxin commented Jan 21, 2016

Somehow there is no timing information for the test cases in this new module. Can you take a look at that? You might need to change the sbt build file.

@JoshRosen
Copy link
Contributor

Oh, I forgot: you also need to update project/SparkBuild.scala to add the new submodule.

@SparkQA
Copy link

SparkQA commented Jan 21, 2016

Test build #49820 has finished for PR 10851 at commit 486414d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 21, 2016

Test build #49824 has finished for PR 10851 at commit 92d3a52.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

import java.io.OutputStream;

/**
* An implementation of Count-Min sketch data structure for the following data types:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an interface?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just start with "A Count-Min sketch is a probabilistic data structure ..."

i.e. your second paragraph.

And then explain the type of data types supported.

@liancheng liancheng changed the title [SPARK-12933][SQL] Initial implementation of count min sketch [SPARK-12933][SQL] Initial implementation of Count-Min sketch Jan 21, 2016
* Note that only Count-Min sketches with the same {@code depth}, {@code width}, and random seed
* can be merged.
*/
public abstract CountMinSketch mergeInPlace(CountMinSketch other);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

declare that this could throw some exception?

@SparkQA
Copy link

SparkQA commented Jan 21, 2016

Test build #49826 has finished for PR 10851 at commit 2bf907a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 21, 2016

Test build #49827 has finished for PR 10851 at commit 7ea22a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 21, 2016

Test build #49848 has finished for PR 10851 at commit a6e7479.

  • This patch fails Java style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 21, 2016

Test build #49882 has finished for PR 10851 at commit e06ff13.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

) = Seq(
"core", "graphx", "mllib", "repl", "network-common", "network-shuffle", "launcher", "unsafe",
"test-tags", "sketch"
).map(ProjectRef(buildLocation, _)) ++ sqlProjects ++ streamingProjects
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to break it down since the original pattern match somehow introduced an implicit tuple containing more than 22 fields after adding the spark-sketch module.

@SparkQA
Copy link

SparkQA commented Jan 22, 2016

Test build #49923 has finished for PR 10851 at commit 4adb57a.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 23, 2016

Test build #49925 has finished for PR 10851 at commit 57a31e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


import sun.misc.Unsafe;

final class Platform {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add some comment here explaining this is just a duplicate and is put here to minimize dependency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments for the duplicated Platform class and Murmur3_x86_32 class.

@rxin
Copy link
Contributor

rxin commented Jan 23, 2016

@liancheng can you make sure the generated javadocs look ok?

@liancheng
Copy link
Contributor Author

I've checked the Javadoc, it looks good.

@rxin
Copy link
Contributor

rxin commented Jan 23, 2016

I looked at this quickly (i.e. didn't do a detail review), but changes lgtm.

@SparkQA
Copy link

SparkQA commented Jan 23, 2016

Test build #2445 has started for PR 10851 at commit 1608ec9.

@SparkQA
Copy link

SparkQA commented Jan 23, 2016

Test build #49929 has finished for PR 10851 at commit 65853ad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jan 23, 2016

Going to merge this. Thanks.

Would be great if @cloud-fan can take another look at the implementation.

@asfgit asfgit closed this in 1c690dd Jan 23, 2016
@liancheng liancheng deleted the count-min-sketch branch January 23, 2016 08:36
CountMinSketchImpl that = (CountMinSketchImpl) other;

if (this.depth != that.depth) {
throw new CMSMergeException("Cannot merge estimators of different depth");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the CMSMergeException is a protected static class, can user catch this exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, actually I've fixed this issue in # 10893.

@liancheng
Copy link
Contributor Author

@rxin The constructor visibility issue would be fixed in #10893.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants