[SPARK-7242][SQL][MLLIB] Frequent items for DataFrames#5799
[SPARK-7242][SQL][MLLIB] Frequent items for DataFrames#5799brkyvz wants to merge 8 commits intoapache:masterfrom
Conversation
implemented frequent items
There was a problem hiding this comment.
I think it looks like df.stat$.MODULE$.freqItems(). I don't know how we can otherwise make it df.stat.freqItems in scala.
There was a problem hiding this comment.
take a look at how we implemented na.
There was a problem hiding this comment.
let's put this in execution.stat?
It's annoying to add a top level package because we have rules to specifically exclude existing packages.
|
I'm going to let @mengxr to comment on the actual algorithm implementation. |
There was a problem hiding this comment.
If multiple columns are provided, shall we search the combination of them instead of each individually? For example, if I call
freqItems(Array("gender", "title"), 0.01)I'm expecting the frequent combinations instead of each of them. The current implementation is more flexible because users can create a struct from multiple columns, and this allows to find frequent items on multiple columns in parallel. But I'm a little worried about what users expect when they call freqItems(Array("gender", "title")) @rxin
There was a problem hiding this comment.
don't forget to add java.util.List ones
There was a problem hiding this comment.
also make sure you add a test to the JavaDataFrameSuite
|
Test build #31386 has finished for PR 5799 at commit
|
|
I think it's better to just have freqItems to on a per column basis, and then I can add a struct expression to data frame so users can easily create composite columns to run freqItems on. |
|
Test build #31392 has finished for PR 5799 at commit
|
|
Test build #31404 has finished for PR 5799 at commit
|
Finding frequent items with possibly false positives, using the algorithm described in
http://www.cs.umd.edu/~samir/498/karp.pdf.public API under:
The output is a local DataFrame having the input column names with
-freqItemsappended to it. This is a single pass algorithm that may return false positives, but no false negatives.cc @mengxr @rxin
Let's get the implementations in, I can add python API in a follow up PR.