Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19635][ML] DataFrame-based API for chi square test #17110

Closed
wants to merge 2 commits into from

Conversation

jkbradley
Copy link
Member

What changes were proposed in this pull request?

Wrapper taking and return a DataFrame

How was this patch tested?

Copied unit tests from RDD-based API

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73644 has finished for PR 17110 at commit a9a8225.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

import spark.implicits._

SchemaUtils.checkColumnType(dataset.schema, featuresCol, new VectorUDT)
SchemaUtils.checkNumericType(dataset.schema, labelCol)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't chi square test work for binary type as well? or we don't want to support that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable, but let's do that in the future; this is already a lot more types than the RDD-based API supports.

SchemaUtils.checkNumericType(dataset.schema, labelCol)
val rdd = dataset.select(col(labelCol).cast("double"), col(featuresCol)).as[(Double, Vector)]
.rdd.map { case (label, features) => OldLabeledPoint(label, OldVectors.fromML(features)) }
val testResults = OldStatistics.chiSqTest(rdd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to optimize this in the future -- since we have schema, if the label and features have been converted to categorical, we can get the unique values right away instead of having to re-generate the maps for distinct labels and features

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely; feel free to make a JIRA for it.

// Detect continuous features or labels
val random = new Random(11L)
val continuousLabel =
Seq.fill(100000)(LabeledPoint(random.nextDouble(), Vectors.dense(random.nextInt(2))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the special value that is above the max categorical limit of 10000 be refactored to a constant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done now.

@jkbradley
Copy link
Member Author

Actually, synced with @thunterdb and will update design doc to put everything under a "Statistics" object. I'll wait until #17108 gets merged.

@imatiach-msft
Copy link
Contributor

cool, I'll hold off on reviewing this for now then

@jkbradley
Copy link
Member Author

I just reversed my opinion about a shared "Statistics" object. See #17108 (comment) for details.

I pushed an update per your review @imatiach-msft

@SparkQA
Copy link

SparkQA commented Mar 9, 2017

Test build #74227 has finished for PR 17110 at commit 19fa02a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

Ping @imatiach-msft any more comments after the update?

@imatiach-msft
Copy link
Contributor

LGTM! nice addition :)

@imatiach-msft
Copy link
Contributor

imatiach-msft commented Mar 13, 2017

I guess my only concern would be the testing is a bit sparse, but more tests can be added in the future (especially when the MLlib part is removed). It seems it would be better to move more tests from MLlib -> ML over time.

@thunterdb
Copy link
Contributor

@jkbradley LGTM, thanks!

@jkbradley
Copy link
Member Author

OK merging with master
Thanks @imatiach-msft and @thunterdb !

@imatiach-msft I agree about sparse testing. This has all of the MLlib tests, but we should add more in the future.

@asfgit asfgit closed this in 4c32005 Mar 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants