-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22881][ML][TEST] ML regression package testsuite add StructuredStreaming test #19979
Conversation
Test build #84911 has finished for PR 19979 at commit
|
Thanks! Let's track these tasks in new JIRAs. I made one for regression just now: https://issues.apache.org/jira/browse/SPARK-22881 |
Test build #85376 has finished for PR 19979 at commit
|
Test build #85374 has finished for PR 19979 at commit
|
testTransformerByGlobalCheckFunc.
@WeichenXu123 it looks like |
…g data failed. ## What changes were proposed in this pull request? Fix OneVsRestModel transform on streaming data failed. ## How was this patch tested? UT will be added soon, once apache#19979 merged. (Need a helper test method there) Author: WeichenXu <weichen.xu@databricks.com> Closes apache#20077 from WeichenXu123/fix_ovs_model_transform.
Actually, going further than what Bago said: All of the places which use globalCheckFunction assume that Dataset.collect() returns the Rows in a fixed order. We should really fix those unit tests to check values row-by-row. As a side effect, that would allow us to eliminate globalCheckFunction. |
I think this will be guaranteed. And many test cases rely on this. Will this be broken ? There're two cases which can use
If remove |
@MrBago Merge your code suggestion. Thanks! |
Test build #85469 has finished for PR 19979 at commit
|
I'm quite sure that:
Basically, we should try to avoid design patterns which assume fixed Row orders. It may be safe sometimes, but the assumption can lead to mistakes.
For test statistics, globalCheckFunction makes sense.
For comparing results with expected values, I much prefer for those values to be in a column in the original input dataset. That has 2 benefits:
|
Agreed. But we can make sure it generate fix order from the last shuffle position in the physical plan RDD lineage. Those model which works like
This is also used in some tests, such as "predictRaw and predictProbability" testcase in `DecisionTreeClassifierSuite"
Agreed. |
val expectedVariances = Array(0.667, 0.667, 0.667, 2.667, 2.667, 2.667) | ||
calculatedVariances.zip(expectedVariances).foreach { case (actual, expected) => | ||
assert(actual ~== expected absTol 1e-3) | ||
testTransformerByGlobalCheckFunc[(Vector, Double)](varianceDF, dt.fit(varianceDF), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The varianceDF
generated by TreeTests.setMetadata
, how to add "expected value" column into the DF ? It seems to need some flaky code. @jkbradley
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expected values would have to be added to def varianceData
.
testTransformerByGlobalCheckFunc[(Double, Double, Double)](dataset, model, | ||
"prediction") { case rows: Seq[Row] => | ||
val predictions = rows.map(_.getDouble(0)) | ||
assert(predictions === Array(1, 2, 2, 2, 6, 16.5, 16.5, 17, 18)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we would have to modify generateIsotonicInput to take the expected values.
Global statistics actually are not used in that test suite; all of the checks are done row-by-row. (I'm just saying that it's a rare use case. Maybe it is used somewhere, so I'm OK leaving the globalCheckFunction functionality for now.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. I still hold by my comments. : ) But I'm fine if we do those cleanups later.
Merging with master
Thanks @WeichenXu123 and @MrBago !
@@ -25,16 +25,15 @@ import org.apache.spark.ml.feature.{Instance, OffsetInstance} | |||
import org.apache.spark.ml.feature.{LabeledPoint, RFormula} | |||
import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors} | |||
import org.apache.spark.ml.param.{ParamMap, ParamsSuite} | |||
import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils} | |||
import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest, MLTestingUtils} | |||
import org.apache.spark.ml.util.TestingUtils._ | |||
import org.apache.spark.mllib.random._ | |||
import org.apache.spark.mllib.util.MLlibTestSparkContext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can remove unused imports
What changes were proposed in this pull request?
ML regression package testsuite add StructuredStreaming test
In order to make testsuite easier to modify, new helper function added in
MLTest
:How was this patch tested?
N/A