Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12938][SQL] DataFrame API for Bloom filter #10937

Closed
wants to merge 3 commits into from

Conversation

cloud-fan
Copy link
Contributor

This PR integrates Bloom filter from spark-sketch into DataFrame. This version resorts to RDD.aggregate for building the filter. A more performant UDAF version can be built in future follow-up PRs.

This PR also add 2 specify put version(putBinary and putLong) into BloomFilter, which makes it easier to build a Bloom filter over a DataFrame.

@cloud-fan
Copy link
Contributor Author

cc @rxin @liancheng

@SparkQA
Copy link

SparkQA commented Jan 27, 2016

Test build #50156 has finished for PR 10937 at commit a0dcaa8.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class BloomFilterImpl extends BloomFilter implements Serializable

val seqOp: (BloomFilter, InternalRow) => BloomFilter = if (colType == StringType) {
(filter, row) =>
filter.putBinary(row.getUTF8String(0).getBytes)
filter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add comment to explain the branching at here?

@@ -96,6 +96,16 @@ int getVersionNumber() {
public abstract boolean put(Object item);

/**
* A specific version of {@link #put(Object)}, that can only be used to put byte array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specific -> specialized

version -> variant

@rxin
Copy link
Contributor

rxin commented Jan 27, 2016

Since the two (cms and bf) are implemented by two different persons, it'd be great for one of you to go through both to make sure everything is consistent. We can do that in a follow-up pull request.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 27, 2016

Test build #50208 has finished for PR 10937 at commit bd0671c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class Utils

@rxin
Copy link
Contributor

rxin commented Jan 27, 2016

Thanks - going to merge this.

@asfgit asfgit closed this in 680afab Jan 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants