New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12933][SQL] Initial implementation of Count-Min sketch #10851
Closed
Closed
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
f253068
Adds spark-sketch skeleton
liancheng 1ecfd26
Moves sketch to common/sketch
liancheng c9b0924
Copies Murmur3_x86_32 and Platform from unsafe
liancheng d920a3a
Initial version of CountMinSketch
liancheng 36ce077
Tests accuracy for all supported types
liancheng 814317f
Tests mergeInPlace
liancheng 3d3d81c
Adds sketch module to sparktestsupport
liancheng 9dd56e1
Moves Murmur3_x86_32 and Platform to package o.a.s.util.sketch.unsafe
liancheng 1208a0f
Ensures that mergeInPlace merges in place
liancheng 64b2162
Javadoc
liancheng 600e25a
Adds sketch to SparkBuild
liancheng 1f0a120
Moves Platform and Murmur3_x86_32 to o.a.s.u.sketch and makes them pa…
liancheng e06ff13
Removes unused CountMinSketchImpl constructor
liancheng 4adb57a
Fixes SparkBuild
liancheng 57a31e6
Excludes sketch from MIMA check
liancheng 65853ad
Addresses comments
liancheng File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<!-- | ||
~ Licensed to the Apache Software Foundation (ASF) under one or more | ||
~ contributor license agreements. See the NOTICE file distributed with | ||
~ this work for additional information regarding copyright ownership. | ||
~ The ASF licenses this file to You under the Apache License, Version 2.0 | ||
~ (the "License"); you may not use this file except in compliance with | ||
~ the License. You may obtain a copy of the License at | ||
~ | ||
~ http://www.apache.org/licenses/LICENSE-2.0 | ||
~ | ||
~ Unless required by applicable law or agreed to in writing, software | ||
~ distributed under the License is distributed on an "AS IS" BASIS, | ||
~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
~ See the License for the specific language governing permissions and | ||
~ limitations under the License. | ||
--> | ||
|
||
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" | ||
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> | ||
<modelVersion>4.0.0</modelVersion> | ||
<parent> | ||
<groupId>org.apache.spark</groupId> | ||
<artifactId>spark-parent_2.10</artifactId> | ||
<version>2.0.0-SNAPSHOT</version> | ||
<relativePath>../../pom.xml</relativePath> | ||
</parent> | ||
|
||
<groupId>org.apache.spark</groupId> | ||
<artifactId>spark-sketch_2.10</artifactId> | ||
<packaging>jar</packaging> | ||
<name>Spark Project Sketch</name> | ||
<url>http://spark.apache.org/</url> | ||
<properties> | ||
<sbt.project.name>sketch</sbt.project.name> | ||
</properties> | ||
|
||
<build> | ||
<outputDirectory>target/scala-${scala.binary.version}/classes</outputDirectory> | ||
<testOutputDirectory>target/scala-${scala.binary.version}/test-classes</testOutputDirectory> | ||
</build> | ||
</project> |
132 changes: 132 additions & 0 deletions
132
common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketch.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,132 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.util.sketch; | ||
|
||
import java.io.InputStream; | ||
import java.io.OutputStream; | ||
|
||
/** | ||
* A Count-Min sketch is a probabilistic data structure used for summarizing streams of data in | ||
* sub-linear space. Currently, supported data types include: | ||
* <ul> | ||
* <li>{@link Byte}</li> | ||
* <li>{@link Short}</li> | ||
* <li>{@link Integer}</li> | ||
* <li>{@link Long}</li> | ||
* <li>{@link String}</li> | ||
* </ul> | ||
* Each {@link CountMinSketch} is initialized with a random seed, and a pair | ||
* of parameters: | ||
* <ol> | ||
* <li>relative error (or {@code eps}), and | ||
* <li>confidence (or {@code delta}) | ||
* </ol> | ||
* Suppose you want to estimate the number of times an element {@code x} has appeared in a data | ||
* stream so far. With probability {@code delta}, the estimate of this frequency is within the | ||
* range {@code true frequency <= estimate <= true frequency + eps * N}, where {@code N} is the | ||
* total count of items have appeared the the data stream so far. | ||
* | ||
* Under the cover, a {@link CountMinSketch} is essentially a two-dimensional {@code long} array | ||
* with depth {@code d} and width {@code w}, where | ||
* <ul> | ||
* <li>{@code d = ceil(2 / eps)}</li> | ||
* <li>{@code w = ceil(-log(1 - confidence) / log(2))}</li> | ||
* </ul> | ||
* | ||
* See http://www.eecs.harvard.edu/~michaelm/CS222/countmin.pdf for technical details, | ||
* including proofs of the estimates and error bounds used in this implementation. | ||
* | ||
* This implementation is largely based on the {@code CountMinSketch} class from stream-lib. | ||
*/ | ||
abstract public class CountMinSketch { | ||
/** | ||
* Returns the relative error (or {@code eps}) of this {@link CountMinSketch}. | ||
*/ | ||
public abstract double relativeError(); | ||
|
||
/** | ||
* Returns the confidence (or {@code delta}) of this {@link CountMinSketch}. | ||
*/ | ||
public abstract double confidence(); | ||
|
||
/** | ||
* Depth of this {@link CountMinSketch}. | ||
*/ | ||
public abstract int depth(); | ||
|
||
/** | ||
* Width of this {@link CountMinSketch}. | ||
*/ | ||
public abstract int width(); | ||
|
||
/** | ||
* Total count of items added to this {@link CountMinSketch} so far. | ||
*/ | ||
public abstract long totalCount(); | ||
|
||
/** | ||
* Adds 1 to {@code item}. | ||
*/ | ||
public abstract void add(Object item); | ||
|
||
/** | ||
* Adds {@code count} to {@code item}. | ||
*/ | ||
public abstract void add(Object item, long count); | ||
|
||
/** | ||
* Returns the estimated frequency of {@code item}. | ||
*/ | ||
public abstract long estimateCount(Object item); | ||
|
||
/** | ||
* Merges another {@link CountMinSketch} with this one in place. | ||
* | ||
* Note that only Count-Min sketches with the same {@code depth}, {@code width}, and random seed | ||
* can be merged. | ||
*/ | ||
public abstract CountMinSketch mergeInPlace(CountMinSketch other); | ||
|
||
/** | ||
* Writes out this {@link CountMinSketch} to an output stream in binary format. | ||
*/ | ||
public abstract void writeTo(OutputStream out); | ||
|
||
/** | ||
* Reads in a {@link CountMinSketch} from an input stream. | ||
*/ | ||
public static CountMinSketch readFrom(InputStream in) { | ||
throw new UnsupportedOperationException("Not implemented yet"); | ||
} | ||
|
||
/** | ||
* Creates a {@link CountMinSketch} with given {@code depth}, {@code width}, and random | ||
* {@code seed}. | ||
*/ | ||
public static CountMinSketch create(int depth, int width, int seed) { | ||
return new CountMinSketchImpl(depth, width, seed); | ||
} | ||
|
||
/** | ||
* Creates a {@link CountMinSketch} with given relative error ({@code eps}), {@code confidence}, | ||
* and random {@code seed}. | ||
*/ | ||
public static CountMinSketch create(double eps, double confidence, int seed) { | ||
return new CountMinSketchImpl(eps, confidence, seed); | ||
} | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
declare that this could throw some exception?