[SPARK-7242][SQL][MLLIB] Frequent items for DataFrames by brkyvz · Pull Request #5799 · apache/spark

brkyvz · 2015-04-30T05:33:13Z

Finding frequent items with possibly false positives, using the algorithm described in http://www.cs.umd.edu/~samir/498/karp.pdf.
public API under:

df.stat.freqItems(cols: Array[String], support: Double = 0.001): DataFrame

The output is a local DataFrame having the input column names with -freqItems appended to it. This is a single pass algorithm that may return false positives, but no false negatives.

cc @mengxr @rxin

Let's get the implementations in, I can add python API in a follow up PR.

implemented frequent items

rxin · 2015-04-30T05:36:24Z

does this work in java?

I think it looks like df.stat$.MODULE$.freqItems(). I don't know how we can otherwise make it df.stat.freqItems in scala.

take a look at how we implemented na.

aha! I like it

rxin · 2015-04-30T05:37:45Z

let's put this in execution.stat?

It's annoying to add a top level package because we have rules to specifically exclude existing packages.

rxin · 2015-04-30T05:41:46Z

I'm going to let @mengxr to comment on the actual algorithm implementation.

mengxr · 2015-04-30T05:56:34Z

If multiple columns are provided, shall we search the combination of them instead of each individually? For example, if I call

freqItems(Array("gender", "title"), 0.01)

I'm expecting the frequent combinations instead of each of them. The current implementation is more flexible because users can create a struct from multiple columns, and this allows to find frequent items on multiple columns in parallel. But I'm a little worried about what users expect when they call freqItems(Array("gender", "title")) @rxin

rxin · 2015-04-30T06:42:12Z

don't forget to add java.util.List ones

also make sure you add a test to the JavaDataFrameSuite

SparkQA · 2015-04-30T07:19:59Z

Test build #31386 has finished for PR 5799 at commit 8279d4d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

rxin · 2015-04-30T07:57:42Z

I think it's better to just have freqItems to on a per column basis, and then I can add a struct expression to data frame so users can easily create composite columns to run freqItems on.

SparkQA · 2015-04-30T08:42:37Z

Test build #31392 has finished for PR 5799 at commit 482e741.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch adds the following new dependencies:
- jaxb-api-2.2.7.jar
- jaxb-core-2.2.7.jar
- jaxb-impl-2.2.7.jar
- pmml-agent-1.1.15.jar
- pmml-model-1.1.15.jar
- pmml-schema-1.1.15.jar
This patch removes the following dependencies:
- activation-1.1.jar
- jaxb-api-2.2.2.jar
- jaxb-impl-2.2.3-1.jar

SparkQA · 2015-04-30T10:27:48Z

Test build #31404 has finished for PR 5799 at commit 3a5c177.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch adds the following new dependencies:
- jaxb-api-2.2.7.jar
- jaxb-core-2.2.7.jar
- jaxb-impl-2.2.7.jar
- pmml-agent-1.1.15.jar
- pmml-model-1.1.15.jar
- pmml-schema-1.1.15.jar
This patch removes the following dependencies:
- activation-1.1.jar
- jaxb-api-2.2.2.jar
- jaxb-impl-2.2.3-1.jar

made base implementation

3d82168

implemented frequent items

rxin reviewed Apr 30, 2015
View reviewed changes

added default value for support

8279d4d

rxin reviewed Apr 30, 2015
View reviewed changes

mengxr reviewed Apr 30, 2015
View reviewed changes

brkyvz added 2 commits April 29, 2015 23:36

addressed comments v1.0

38e784d

removed old import

482e741

rxin reviewed Apr 30, 2015
View reviewed changes

addressed comments v2.0

3a5c177

brkyvz added 2 commits April 30, 2015 07:35

addressed comments v2.1

0915e23

removed toSeq

39b1bba

Conversation

brkyvz commented Apr 30, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Apr 30, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 30, 2015

Uh oh!

rxin commented Apr 30, 2015

Uh oh!

SparkQA commented Apr 30, 2015

Uh oh!

SparkQA commented Apr 30, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants