Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22881][ML][TEST] ML regression package testsuite add StructuredStreaming test #19979

Closed
wants to merge 4 commits into from

Conversation

WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Dec 14, 2017

What changes were proposed in this pull request?

ML regression package testsuite add StructuredStreaming test

In order to make testsuite easier to modify, new helper function added in MLTest:

def testTransformerByGlobalCheckFunc[A : Encoder](
      dataframe: DataFrame,
      transformer: Transformer,
      firstResultCol: String,
      otherResultCols: String*)
      (globalCheckFunction: Seq[Row] => Unit): Unit

How was this patch tested?

N/A

@SparkQA
Copy link

SparkQA commented Dec 14, 2017

Test build #84911 has finished for PR 19979 at commit 47dccdd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class AFTSurvivalRegressionSuite extends MLTest with DefaultReadWriteTest
  • class DecisionTreeRegressorSuite extends MLTest with DefaultReadWriteTest
  • class GBTRegressorSuite extends MLTest with DefaultReadWriteTest
  • class GeneralizedLinearRegressionSuite extends MLTest with DefaultReadWriteTest
  • class IsotonicRegressionSuite extends MLTest with DefaultReadWriteTest
  • case class CheckAnswerRowsByFunc(

@WeichenXu123 WeichenXu123 changed the title [SPARK-22644][ML][TEST][FOLLOW-UP] ML regression testsuite add StructuredStreaming test [SPARK-22644][ML][TEST][FOLLOW-UP] ML regression package testsuite add StructuredStreaming test Dec 18, 2017
@WeichenXu123
Copy link
Contributor Author

@MrBago @jkbradley

@jkbradley
Copy link
Member

Thanks! Let's track these tasks in new JIRAs. I made one for regression just now: https://issues.apache.org/jira/browse/SPARK-22881

@WeichenXu123 WeichenXu123 changed the title [SPARK-22644][ML][TEST][FOLLOW-UP] ML regression package testsuite add StructuredStreaming test [SPARK-22881][ML][TEST] ML regression package testsuite add StructuredStreaming test Dec 23, 2017
@SparkQA
Copy link

SparkQA commented Dec 25, 2017

Test build #85376 has finished for PR 19979 at commit 7bc588a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 25, 2017

Test build #85374 has finished for PR 19979 at commit f7a54ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

testTransformerByGlobalCheckFunc.
@MrBago
Copy link
Contributor

MrBago commented Dec 28, 2017

@WeichenXu123 it looks like testTransformer is a special case of testTransformerByGlobalCheckFunc. I think it's cleaner to structure the tests in this way instead of passing around nulls, because of the time crunch I made a PR to show what I mean, WeichenXu123#3.

HyukjinKwon pushed a commit to HyukjinKwon/spark that referenced this pull request Dec 28, 2017
…g data failed.

## What changes were proposed in this pull request?

Fix OneVsRestModel transform on streaming data failed.

## How was this patch tested?

UT will be added soon, once apache#19979 merged. (Need a helper test method there)

Author: WeichenXu <weichen.xu@databricks.com>

Closes apache#20077 from WeichenXu123/fix_ovs_model_transform.
@jkbradley
Copy link
Member

Actually, going further than what Bago said: All of the places which use globalCheckFunction assume that Dataset.collect() returns the Rows in a fixed order. We should really fix those unit tests to check values row-by-row. As a side effect, that would allow us to eliminate globalCheckFunction.

@WeichenXu123
Copy link
Contributor Author

WeichenXu123 commented Dec 28, 2017

@jkbradley

assume that Dataset.collect() returns the Rows in a fixed order.

I think this will be guaranteed. And many test cases rely on this. Will this be broken ?

There're two cases which can use globalCheckFunction

  • test statistics (such as min/max ) on global transformer output
  • get global result array and compare it with hardcoding array values.

If remove globalCheckFunction, tests in above cases will be harder to write. So I prefer to reserve globalCheckFunction. What do you think ?

@WeichenXu123
Copy link
Contributor Author

@MrBago Merge your code suggestion. Thanks!

@SparkQA
Copy link

SparkQA commented Dec 28, 2017

Test build #85469 has finished for PR 19979 at commit de345dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

assume that Dataset.collect() returns the Rows in a fixed order.

I'm quite sure that:

  • When the Dataset has been constructed without any shuffles or repartitions, then Rows are always in a fixed order.
  • When there has been a shuffle, it is likely the Rows will not follow a fixed order.
  • Spark APIs never guarantee a fixed order, except when sorting has been performed.

Basically, we should try to avoid design patterns which assume fixed Row orders. It may be safe sometimes, but the assumption can lead to mistakes.

There're two cases which can use globalCheckFunction

  • test statistics (such as min/max ) on global transformer output
  • get global result array and compare it with hardcoding array values.

For test statistics, globalCheckFunction makes sense.

  • But none of the tests in this PR require this. Are there any unit tests in MLlib which do?

For comparing results with expected values, I much prefer for those values to be in a column in the original input dataset. That has 2 benefits:

  • It makes tests easier to read since inputs + expected values are side-by-side in the code.
  • We don't have to worry about Row order.

@WeichenXu123
Copy link
Contributor Author

@jkbradley

When there has been a shuffle, it is likely the Rows will not follow a fixed order.

Agreed. But we can make sure it generate fix order from the last shuffle position in the physical plan RDD lineage. Those model which works like map transformation, I think it can make sure output row order to be exactly the same with input row order.

test statistics (such as min/max ) on global transformer output

This is also used in some tests, such as "predictRaw and predictProbability" testcase in `DecisionTreeClassifierSuite"

For comparing results with expected values, I much prefer for those values to be in a column in the original input dataset.

Agreed.

val expectedVariances = Array(0.667, 0.667, 0.667, 2.667, 2.667, 2.667)
calculatedVariances.zip(expectedVariances).foreach { case (actual, expected) =>
assert(actual ~== expected absTol 1e-3)
testTransformerByGlobalCheckFunc[(Vector, Double)](varianceDF, dt.fit(varianceDF),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The varianceDF generated by TreeTests.setMetadata, how to add "expected value" column into the DF ? It seems to need some flaky code. @jkbradley

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected values would have to be added to def varianceData.

testTransformerByGlobalCheckFunc[(Double, Double, Double)](dataset, model,
"prediction") { case rows: Seq[Row] =>
val predictions = rows.map(_.getDouble(0))
assert(predictions === Array(1, 2, 2, 2, 6, 16.5, 16.5, 17, 18))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we would have to modify generateIsotonicInput to take the expected values.

@jkbradley
Copy link
Member

jkbradley commented Dec 29, 2017

test statistics (such as min/max ) on global transformer output

This is also used in some tests, such as "predictRaw and predictProbability" testcase in `DecisionTreeClassifierSuite"

Global statistics actually are not used in that test suite; all of the checks are done row-by-row. (I'm just saying that it's a rare use case. Maybe it is used somewhere, so I'm OK leaving the globalCheckFunction functionality for now.)

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. I still hold by my comments. : ) But I'm fine if we do those cleanups later.

Merging with master
Thanks @WeichenXu123 and @MrBago !

@@ -25,16 +25,15 @@ import org.apache.spark.ml.feature.{Instance, OffsetInstance}
import org.apache.spark.ml.feature.{LabeledPoint, RFormula}
import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors}
import org.apache.spark.ml.param.{ParamMap, ParamsSuite}
import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils}
import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest, MLTestingUtils}
import org.apache.spark.ml.util.TestingUtils._
import org.apache.spark.mllib.random._
import org.apache.spark.mllib.util.MLlibTestSparkContext
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can remove unused imports

@asfgit asfgit closed this in 2ea17af Dec 30, 2017
@WeichenXu123 WeichenXu123 deleted the ml_stream_test branch April 24, 2019 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants