[SPARK-22881][ML][TEST] ML regression package testsuite add StructuredStreaming test #19979

WeichenXu123 · 2017-12-14T12:21:23Z

What changes were proposed in this pull request?

ML regression package testsuite add StructuredStreaming test

In order to make testsuite easier to modify, new helper function added in MLTest:

def testTransformerByGlobalCheckFunc[A : Encoder](
      dataframe: DataFrame,
      transformer: Transformer,
      firstResultCol: String,
      otherResultCols: String*)
      (globalCheckFunction: Seq[Row] => Unit): Unit

How was this patch tested?

N/A

SparkQA · 2017-12-14T15:13:45Z

Test build #84911 has finished for PR 19979 at commit 47dccdd.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class AFTSurvivalRegressionSuite extends MLTest with DefaultReadWriteTest
class DecisionTreeRegressorSuite extends MLTest with DefaultReadWriteTest
class GBTRegressorSuite extends MLTest with DefaultReadWriteTest
class GeneralizedLinearRegressionSuite extends MLTest with DefaultReadWriteTest
class IsotonicRegressionSuite extends MLTest with DefaultReadWriteTest
case class CheckAnswerRowsByFunc(

WeichenXu123 · 2017-12-20T03:47:32Z

@MrBago @jkbradley

jkbradley · 2017-12-22T22:32:18Z

Thanks! Let's track these tasks in new JIRAs. I made one for regression just now: https://issues.apache.org/jira/browse/SPARK-22881

SparkQA · 2017-12-25T15:40:28Z

Test build #85376 has finished for PR 19979 at commit 7bc588a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-25T15:40:58Z

Test build #85374 has finished for PR 19979 at commit f7a54ae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

testTransformerByGlobalCheckFunc.

MrBago · 2017-12-28T00:14:43Z

@WeichenXu123 it looks like testTransformer is a special case of testTransformerByGlobalCheckFunc. I think it's cleaner to structure the tests in this way instead of passing around nulls, because of the time crunch I made a PR to show what I mean, WeichenXu123#3.

…g data failed. ## What changes were proposed in this pull request? Fix OneVsRestModel transform on streaming data failed. ## How was this patch tested? UT will be added soon, once apache#19979 merged. (Need a helper test method there) Author: WeichenXu <weichen.xu@databricks.com> Closes apache#20077 from WeichenXu123/fix_ovs_model_transform.

jkbradley · 2017-12-28T03:23:29Z

Actually, going further than what Bago said: All of the places which use globalCheckFunction assume that Dataset.collect() returns the Rows in a fixed order. We should really fix those unit tests to check values row-by-row. As a side effect, that would allow us to eliminate globalCheckFunction.

WeichenXu123 · 2017-12-28T10:52:21Z

@jkbradley

assume that Dataset.collect() returns the Rows in a fixed order.

I think this will be guaranteed. And many test cases rely on this. Will this be broken ?

There're two cases which can use globalCheckFunction

test statistics (such as min/max ) on global transformer output
get global result array and compare it with hardcoding array values.

If remove globalCheckFunction, tests in above cases will be harder to write. So I prefer to reserve globalCheckFunction. What do you think ?

WeichenXu123 · 2017-12-28T11:00:55Z

@MrBago Merge your code suggestion. Thanks!

SparkQA · 2017-12-28T13:55:28Z

Test build #85469 has finished for PR 19979 at commit de345dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-12-28T21:18:28Z

assume that Dataset.collect() returns the Rows in a fixed order.

I'm quite sure that:

When the Dataset has been constructed without any shuffles or repartitions, then Rows are always in a fixed order.
When there has been a shuffle, it is likely the Rows will not follow a fixed order.
Spark APIs never guarantee a fixed order, except when sorting has been performed.

Basically, we should try to avoid design patterns which assume fixed Row orders. It may be safe sometimes, but the assumption can lead to mistakes.

There're two cases which can use globalCheckFunction

test statistics (such as min/max ) on global transformer output

get global result array and compare it with hardcoding array values.

For test statistics, globalCheckFunction makes sense.

But none of the tests in this PR require this. Are there any unit tests in MLlib which do?

For comparing results with expected values, I much prefer for those values to be in a column in the original input dataset. That has 2 benefits:

It makes tests easier to read since inputs + expected values are side-by-side in the code.
We don't have to worry about Row order.

WeichenXu123 · 2017-12-29T02:52:32Z

@jkbradley

When there has been a shuffle, it is likely the Rows will not follow a fixed order.

Agreed. But we can make sure it generate fix order from the last shuffle position in the physical plan RDD lineage. Those model which works like map transformation, I think it can make sure output row order to be exactly the same with input row order.

test statistics (such as min/max ) on global transformer output

This is also used in some tests, such as "predictRaw and predictProbability" testcase in `DecisionTreeClassifierSuite"

For comparing results with expected values, I much prefer for those values to be in a column in the original input dataset.

Agreed.

WeichenXu123 · 2017-12-29T13:30:52Z

mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala

-    val expectedVariances = Array(0.667, 0.667, 0.667, 2.667, 2.667, 2.667)
-    calculatedVariances.zip(expectedVariances).foreach { case (actual, expected) =>
-      assert(actual ~== expected absTol 1e-3)
+    testTransformerByGlobalCheckFunc[(Vector, Double)](varianceDF, dt.fit(varianceDF),


The varianceDF generated by TreeTests.setMetadata, how to add "expected value" column into the DF ? It seems to need some flaky code. @jkbradley

The expected values would have to be added to def varianceData.

WeichenXu123 · 2017-12-29T13:31:19Z

mllib/src/test/scala/org/apache/spark/ml/regression/IsotonicRegressionSuite.scala

+    testTransformerByGlobalCheckFunc[(Double, Double, Double)](dataset, model,
+      "prediction") { case rows: Seq[Row] =>
+      val predictions = rows.map(_.getDouble(0))
+      assert(predictions === Array(1, 2, 2, 2, 6, 16.5, 16.5, 17, 18))


Ditto issue.

Here, we would have to modify generateIsotonicInput to take the expected values.

jkbradley · 2017-12-29T17:59:17Z

test statistics (such as min/max ) on global transformer output

This is also used in some tests, such as "predictRaw and predictProbability" testcase in `DecisionTreeClassifierSuite"

Global statistics actually are not used in that test suite; all of the checks are done row-by-row. (I'm just saying that it's a rare use case. Maybe it is used somewhere, so I'm OK leaving the globalCheckFunction functionality for now.)

jkbradley

This LGTM. I still hold by my comments. : ) But I'm fine if we do those cleanups later.

Merging with master
Thanks @WeichenXu123 and @MrBago !

jkbradley · 2017-12-30T03:59:55Z

mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala

@@ -25,16 +25,15 @@ import org.apache.spark.ml.feature.{Instance, OffsetInstance}
 import org.apache.spark.ml.feature.{LabeledPoint, RFormula}
 import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors}
 import org.apache.spark.ml.param.{ParamMap, ParamsSuite}
-import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils}
+import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest, MLTestingUtils}
 import org.apache.spark.ml.util.TestingUtils._
 import org.apache.spark.mllib.random._
 import org.apache.spark.mllib.util.MLlibTestSparkContext


nit: can remove unused imports

init pr

47dccdd

WeichenXu123 changed the title ~~[SPARK-22644][ML][TEST][FOLLOW-UP] ML regression testsuite add StructuredStreaming test~~ [SPARK-22644][ML][TEST][FOLLOW-UP] ML regression package testsuite add StructuredStreaming test Dec 18, 2017

WeichenXu123 changed the title ~~[SPARK-22644][ML][TEST][FOLLOW-UP] ML regression package testsuite add StructuredStreaming test~~ [SPARK-22881][ML][TEST] ML regression package testsuite add StructuredStreaming test Dec 23, 2017

merge master & resolve conflicts

f7a54ae

WeichenXu123 mentioned this pull request Dec 25, 2017

[SPARK-22899][ML][Streaming] Fix OneVsRestModel transform on streaming data failed. #20077

Closed

update

7bc588a

Make testTranformer per row special case of

de345dc

testTransformerByGlobalCheckFunc.

WeichenXu123 commented Dec 29, 2017

View reviewed changes

jkbradley approved these changes Dec 30, 2017

View reviewed changes

asfgit closed this in 2ea17af Dec 30, 2017

WeichenXu123 deleted the ml_stream_test branch April 24, 2019 21:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22881][ML][TEST] ML regression package testsuite add StructuredStreaming test #19979

[SPARK-22881][ML][TEST] ML regression package testsuite add StructuredStreaming test #19979

WeichenXu123 commented Dec 14, 2017 •

edited

Loading

SparkQA commented Dec 14, 2017

WeichenXu123 commented Dec 20, 2017

jkbradley commented Dec 22, 2017

SparkQA commented Dec 25, 2017

SparkQA commented Dec 25, 2017

MrBago commented Dec 28, 2017

jkbradley commented Dec 28, 2017

WeichenXu123 commented Dec 28, 2017 •

edited

Loading

WeichenXu123 commented Dec 28, 2017

SparkQA commented Dec 28, 2017

jkbradley commented Dec 28, 2017

WeichenXu123 commented Dec 29, 2017

WeichenXu123 Dec 29, 2017

jkbradley Dec 29, 2017

WeichenXu123 Dec 29, 2017

jkbradley Dec 29, 2017

jkbradley commented Dec 29, 2017 •

edited

Loading

jkbradley left a comment

jkbradley Dec 30, 2017

[SPARK-22881][ML][TEST] ML regression package testsuite add StructuredStreaming test #19979

[SPARK-22881][ML][TEST] ML regression package testsuite add StructuredStreaming test #19979

Conversation

WeichenXu123 commented Dec 14, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Dec 14, 2017

WeichenXu123 commented Dec 20, 2017

jkbradley commented Dec 22, 2017

SparkQA commented Dec 25, 2017

SparkQA commented Dec 25, 2017

MrBago commented Dec 28, 2017

jkbradley commented Dec 28, 2017

WeichenXu123 commented Dec 28, 2017 • edited Loading

WeichenXu123 commented Dec 28, 2017

SparkQA commented Dec 28, 2017

jkbradley commented Dec 28, 2017

WeichenXu123 commented Dec 29, 2017

WeichenXu123 Dec 29, 2017

Choose a reason for hiding this comment

jkbradley Dec 29, 2017

Choose a reason for hiding this comment

WeichenXu123 Dec 29, 2017

Choose a reason for hiding this comment

jkbradley Dec 29, 2017

Choose a reason for hiding this comment

jkbradley commented Dec 29, 2017 • edited Loading

jkbradley left a comment

Choose a reason for hiding this comment

jkbradley Dec 30, 2017

Choose a reason for hiding this comment

WeichenXu123 commented Dec 14, 2017 •

edited

Loading

WeichenXu123 commented Dec 28, 2017 •

edited

Loading

jkbradley commented Dec 29, 2017 •

edited

Loading