[SPARK-19635][ML] DataFrame-based API for chi square test #17110

jkbradley · 2017-03-01T02:36:12Z

What changes were proposed in this pull request?

Wrapper taking and return a DataFrame

How was this patch tested?

Copied unit tests from RDD-based API

SparkQA · 2017-03-01T03:30:22Z

Test build #73644 has finished for PR 17110 at commit a9a8225.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

imatiach-msft · 2017-03-01T22:17:19Z

mllib/src/main/scala/org/apache/spark/ml/stat/ChiSquare.scala

+    import spark.implicits._
+
+    SchemaUtils.checkColumnType(dataset.schema, featuresCol, new VectorUDT)
+    SchemaUtils.checkNumericType(dataset.schema, labelCol)


shouldn't chi square test work for binary type as well? or we don't want to support that?

Sounds reasonable, but let's do that in the future; this is already a lot more types than the RDD-based API supports.

imatiach-msft · 2017-03-01T23:05:35Z

mllib/src/main/scala/org/apache/spark/ml/stat/ChiSquare.scala

+    SchemaUtils.checkNumericType(dataset.schema, labelCol)
+    val rdd = dataset.select(col(labelCol).cast("double"), col(featuresCol)).as[(Double, Vector)]
+      .rdd.map { case (label, features) => OldLabeledPoint(label, OldVectors.fromML(features)) }
+    val testResults = OldStatistics.chiSqTest(rdd)


it would be nice to optimize this in the future -- since we have schema, if the label and features have been converted to categorical, we can get the unique values right away instead of having to re-generate the maps for distinct labels and features

Definitely; feel free to make a JIRA for it.

imatiach-msft · 2017-03-01T23:08:56Z

mllib/src/test/scala/org/apache/spark/ml/stat/ChiSquareSuite.scala

+    // Detect continuous features or labels
+    val random = new Random(11L)
+    val continuousLabel =
+      Seq.fill(100000)(LabeledPoint(random.nextDouble(), Vectors.dense(random.nextInt(2))))


can the special value that is above the max categorical limit of 10000 be refactored to a constant?

Good idea, done now.

jkbradley · 2017-03-03T19:28:08Z

Actually, synced with @thunterdb and will update design doc to put everything under a "Statistics" object. I'll wait until #17108 gets merged.

imatiach-msft · 2017-03-06T05:51:39Z

cool, I'll hold off on reviewing this for now then

jkbradley · 2017-03-08T23:12:36Z

I just reversed my opinion about a shared "Statistics" object. See #17108 (comment) for details.

I pushed an update per your review @imatiach-msft

SparkQA · 2017-03-09T00:05:36Z

Test build #74227 has finished for PR 17110 at commit 19fa02a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-03-13T23:31:08Z

Ping @imatiach-msft any more comments after the update?

imatiach-msft · 2017-03-13T23:32:50Z

LGTM! nice addition :)

imatiach-msft · 2017-03-13T23:38:00Z

I guess my only concern would be the testing is a bit sparse, but more tests can be added in the future (especially when the MLlib part is removed). It seems it would be better to move more tests from MLlib -> ML over time.

thunterdb · 2017-03-14T21:12:02Z

@jkbradley LGTM, thanks!

jkbradley · 2017-03-17T00:09:41Z

OK merging with master
Thanks @imatiach-msft and @thunterdb !

@imatiach-msft I agree about sparse testing. This has all of the MLlib tests, but we should add more in the future.

DF-based api for chi square test

a9a8225

imatiach-msft reviewed Mar 1, 2017

View reviewed changes

imatiach-msft approved these changes Mar 1, 2017

View reviewed changes

update max on num categories for chisqtest to be stored as static val

19fa02a

jkbradley mentioned this pull request Mar 8, 2017

[SPARK-19636][ML] Feature parity for correlation statistics in MLlib #17108

Closed

asfgit closed this in 4c32005 Mar 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19635][ML] DataFrame-based API for chi square test #17110

[SPARK-19635][ML] DataFrame-based API for chi square test #17110

jkbradley commented Mar 1, 2017

SparkQA commented Mar 1, 2017

imatiach-msft Mar 1, 2017

jkbradley Mar 3, 2017

imatiach-msft Mar 1, 2017

jkbradley Mar 3, 2017

imatiach-msft Mar 1, 2017

jkbradley Mar 3, 2017

jkbradley commented Mar 3, 2017

imatiach-msft commented Mar 6, 2017

jkbradley commented Mar 8, 2017

SparkQA commented Mar 9, 2017

jkbradley commented Mar 13, 2017

imatiach-msft commented Mar 13, 2017

imatiach-msft commented Mar 13, 2017 •

edited

Loading

thunterdb commented Mar 14, 2017

jkbradley commented Mar 17, 2017

[SPARK-19635][ML] DataFrame-based API for chi square test #17110

[SPARK-19635][ML] DataFrame-based API for chi square test #17110

Conversation

jkbradley commented Mar 1, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 1, 2017

imatiach-msft Mar 1, 2017

Choose a reason for hiding this comment

jkbradley Mar 3, 2017

Choose a reason for hiding this comment

imatiach-msft Mar 1, 2017

Choose a reason for hiding this comment

jkbradley Mar 3, 2017

Choose a reason for hiding this comment

imatiach-msft Mar 1, 2017

Choose a reason for hiding this comment

jkbradley Mar 3, 2017

Choose a reason for hiding this comment

jkbradley commented Mar 3, 2017

imatiach-msft commented Mar 6, 2017

jkbradley commented Mar 8, 2017

SparkQA commented Mar 9, 2017

jkbradley commented Mar 13, 2017

imatiach-msft commented Mar 13, 2017

imatiach-msft commented Mar 13, 2017 • edited Loading

thunterdb commented Mar 14, 2017

jkbradley commented Mar 17, 2017

imatiach-msft commented Mar 13, 2017 •

edited

Loading