Skip to content

[SPARK-7242][SQL][MLLIB] Frequent items for DataFrames#5799

Closed
brkyvz wants to merge 8 commits intoapache:masterfrom
brkyvz:freq-items
Closed

[SPARK-7242][SQL][MLLIB] Frequent items for DataFrames#5799
brkyvz wants to merge 8 commits intoapache:masterfrom
brkyvz:freq-items

Conversation

@brkyvz
Copy link
Copy Markdown
Contributor

@brkyvz brkyvz commented Apr 30, 2015

Finding frequent items with possibly false positives, using the algorithm described in http://www.cs.umd.edu/~samir/498/karp.pdf.
public API under:

df.stat.freqItems(cols: Array[String], support: Double = 0.001): DataFrame

The output is a local DataFrame having the input column names with -freqItems appended to it. This is a single pass algorithm that may return false positives, but no false negatives.

cc @mengxr @rxin

Let's get the implementations in, I can add python API in a follow up PR.

implemented frequent items
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work in java?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks like df.stat$.MODULE$.freqItems(). I don't know how we can otherwise make it df.stat.freqItems in scala.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take a look at how we implemented na.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha! I like it

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's put this in execution.stat?

It's annoying to add a top level package because we have rules to specifically exclude existing packages.

@rxin
Copy link
Copy Markdown
Contributor

rxin commented Apr 30, 2015

I'm going to let @mengxr to comment on the actual algorithm implementation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If multiple columns are provided, shall we search the combination of them instead of each individually? For example, if I call

freqItems(Array("gender", "title"), 0.01)

I'm expecting the frequent combinations instead of each of them. The current implementation is more flexible because users can create a struct from multiple columns, and this allows to find frequent items on multiple columns in parallel. But I'm a little worried about what users expect when they call freqItems(Array("gender", "title")) @rxin

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget to add java.util.List ones

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also make sure you add a test to the JavaDataFrameSuite

@SparkQA
Copy link
Copy Markdown

SparkQA commented Apr 30, 2015

Test build #31386 has finished for PR 5799 at commit 8279d4d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@rxin
Copy link
Copy Markdown
Contributor

rxin commented Apr 30, 2015

I think it's better to just have freqItems to on a per column basis, and then I can add a struct expression to data frame so users can easily create composite columns to run freqItems on.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Apr 30, 2015

Test build #31392 has finished for PR 5799 at commit 482e741.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch adds the following new dependencies:
    • jaxb-api-2.2.7.jar
    • jaxb-core-2.2.7.jar
    • jaxb-impl-2.2.7.jar
    • pmml-agent-1.1.15.jar
    • pmml-model-1.1.15.jar
    • pmml-schema-1.1.15.jar
  • This patch removes the following dependencies:
    • activation-1.1.jar
    • jaxb-api-2.2.2.jar
    • jaxb-impl-2.2.3-1.jar

@SparkQA
Copy link
Copy Markdown

SparkQA commented Apr 30, 2015

Test build #31404 has finished for PR 5799 at commit 3a5c177.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch adds the following new dependencies:
    • jaxb-api-2.2.7.jar
    • jaxb-core-2.2.7.jar
    • jaxb-impl-2.2.7.jar
    • pmml-agent-1.1.15.jar
    • pmml-model-1.1.15.jar
    • pmml-schema-1.1.15.jar
  • This patch removes the following dependencies:
    • activation-1.1.jar
    • jaxb-api-2.2.2.jar
    • jaxb-impl-2.2.3-1.jar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants