-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19635][ML] DataFrame-based API for chi square test #17110
Conversation
Test build #73644 has finished for PR 17110 at commit
|
import spark.implicits._ | ||
|
||
SchemaUtils.checkColumnType(dataset.schema, featuresCol, new VectorUDT) | ||
SchemaUtils.checkNumericType(dataset.schema, labelCol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't chi square test work for binary type as well? or we don't want to support that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable, but let's do that in the future; this is already a lot more types than the RDD-based API supports.
SchemaUtils.checkNumericType(dataset.schema, labelCol) | ||
val rdd = dataset.select(col(labelCol).cast("double"), col(featuresCol)).as[(Double, Vector)] | ||
.rdd.map { case (label, features) => OldLabeledPoint(label, OldVectors.fromML(features)) } | ||
val testResults = OldStatistics.chiSqTest(rdd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to optimize this in the future -- since we have schema, if the label and features have been converted to categorical, we can get the unique values right away instead of having to re-generate the maps for distinct labels and features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely; feel free to make a JIRA for it.
// Detect continuous features or labels | ||
val random = new Random(11L) | ||
val continuousLabel = | ||
Seq.fill(100000)(LabeledPoint(random.nextDouble(), Vectors.dense(random.nextInt(2)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can the special value that is above the max categorical limit of 10000 be refactored to a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done now.
Actually, synced with @thunterdb and will update design doc to put everything under a "Statistics" object. I'll wait until #17108 gets merged. |
cool, I'll hold off on reviewing this for now then |
I just reversed my opinion about a shared "Statistics" object. See #17108 (comment) for details. I pushed an update per your review @imatiach-msft |
Test build #74227 has finished for PR 17110 at commit
|
Ping @imatiach-msft any more comments after the update? |
LGTM! nice addition :) |
I guess my only concern would be the testing is a bit sparse, but more tests can be added in the future (especially when the MLlib part is removed). It seems it would be better to move more tests from MLlib -> ML over time. |
@jkbradley LGTM, thanks! |
OK merging with master @imatiach-msft I agree about sparse testing. This has all of the MLlib tests, but we should add more in the future. |
What changes were proposed in this pull request?
Wrapper taking and return a DataFrame
How was this patch tested?
Copied unit tests from RDD-based API