New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12933][SQL] Initial implementation of Count-Min sketch #10851
Conversation
import org.apache.spark.unsafe.Platform; | ||
import org.apache.spark.unsafe.hash.Murmur3_x86_32; | ||
|
||
public class CountMinSketchImpl extends CountMinSketch { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add some comment acknowledging stream-lib
do we also need to update the test runner to add this module? cc @JoshRosen |
|
||
package org.apache.spark.util.sketch | ||
|
||
import scala.reflect.ClassTag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClassTag
is used here for creating arrays. I found using Seq
can slow down test execution quite a bit.
@rxin Already added sketch module to |
4201605
to
486414d
Compare
Test build #49811 has finished for PR 10851 at commit
|
2bf907a
to
7ea22a9
Compare
// page 149, right after Proposition 7. | ||
hash += hash >> 32; | ||
hash &= PRIME_MODULUS; | ||
// Doing "%" after (int) conversion is ~2x faster than %'ing longs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kind of black magic...
Somehow there is no timing information for the test cases in this new module. Can you take a look at that? You might need to change the sbt build file. |
Oh, I forgot: you also need to update |
Test build #49820 has finished for PR 10851 at commit
|
Test build #49824 has finished for PR 10851 at commit
|
import java.io.OutputStream; | ||
|
||
/** | ||
* An implementation of Count-Min sketch data structure for the following data types: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just start with "A Count-Min sketch is a probabilistic data structure ..."
i.e. your second paragraph.
And then explain the type of data types supported.
* Note that only Count-Min sketches with the same {@code depth}, {@code width}, and random seed | ||
* can be merged. | ||
*/ | ||
public abstract CountMinSketch mergeInPlace(CountMinSketch other); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
declare that this could throw some exception?
Test build #49826 has finished for PR 10851 at commit
|
Test build #49827 has finished for PR 10851 at commit
|
Test build #49848 has finished for PR 10851 at commit
|
a6e7479
to
e06ff13
Compare
Test build #49882 has finished for PR 10851 at commit
|
) = Seq( | ||
"core", "graphx", "mllib", "repl", "network-common", "network-shuffle", "launcher", "unsafe", | ||
"test-tags", "sketch" | ||
).map(ProjectRef(buildLocation, _)) ++ sqlProjects ++ streamingProjects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to break it down since the original pattern match somehow introduced an implicit tuple containing more than 22 fields after adding the spark-sketch module.
Test build #49923 has finished for PR 10851 at commit
|
Test build #49925 has finished for PR 10851 at commit
|
|
||
import sun.misc.Unsafe; | ||
|
||
final class Platform { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add some comment here explaining this is just a duplicate and is put here to minimize dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments for the duplicated Platform
class and Murmur3_x86_32
class.
@liancheng can you make sure the generated javadocs look ok? |
I've checked the Javadoc, it looks good. |
I looked at this quickly (i.e. didn't do a detail review), but changes lgtm. |
Test build #2445 has started for PR 10851 at commit |
1608ec9
to
65853ad
Compare
Test build #49929 has finished for PR 10851 at commit
|
Going to merge this. Thanks. Would be great if @cloud-fan can take another look at the implementation. |
CountMinSketchImpl that = (CountMinSketchImpl) other; | ||
|
||
if (this.depth != that.depth) { | ||
throw new CMSMergeException("Cannot merge estimators of different depth"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the CMSMergeException
is a protected static class
, can user catch this exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, actually I've fixed this issue in # 10893.
This PR adds an initial implementation of count min sketch, contained in a new module spark-sketch under
common/sketch
. The implementation is based on theCountMinSketch
class in stream-lib.As required by the design doc, spark-sketch should have no external dependency.
Two classes,
Murmur3_x86_32
andPlatform
are copied to spark-sketch from spark-unsafe for hashing facilities. They'll also be used in the upcoming bloom filter implementation.The following features will be added in future follow-up PRs: