[SPARK-22644][ML][TEST] Make ML testsuite support StructuredStreaming test #19843

WeichenXu123 · 2017-11-29T07:34:31Z

What changes were proposed in this pull request?

We need to add some helper code to make testing ML transformers & models easier with streaming data. These tests might help us catch any remaining issues and we could encourage future PRs to use these tests to prevent new Models & Transformers from having issues.

I add a MLTest trait which extends StreamTest trait, and override createSparkSession. So ML testsuite can only extend MLTest, to use both ML & Stream test util functions.

I only modify one testcase in LinearRegressionSuite, for first pass review.

Link to #19746

How was this patch tested?

MLTestSuite added.

SparkQA · 2017-11-29T08:05:02Z

Test build #84287 has finished for PR 19843 at commit 08954fe.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-29T08:05:02Z

Test build #84286 has finished for PR 19843 at commit 072f4b9.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class LinearRegressionSuite extends MLTest with DefaultReadWriteTest
trait MLTest extends StreamTest with TempDirectory
trait TransformerStreamTest extends StreamTest
case class CheckAnswerRows(expectedAnswer: Seq[Row], lastOnly: Boolean, isSorted: Boolean,

WeichenXu123 · 2017-11-29T09:38:02Z

Jenkins retest this please.

WeichenXu123 · 2017-11-29T10:10:16Z

@MrBago @jkbradley

SparkQA · 2017-11-29T12:57:55Z

Test build #84290 has finished for PR 19843 at commit 08954fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MrBago

I really like this overall approach. Just a few small comments.

MrBago · 2017-11-30T01:24:11Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala

+              assert(checkFunction != null)
+              sparkAnswer.foreach { row =>
+                try {
+                  checkFunction


Should this be checkFunction(row)?

MrBago · 2017-11-30T01:26:36Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala

@@ -133,6 +133,9 @@ trait StreamTest extends QueryTest with SharedSQLContext with TimeLimits with Be
    }

    def apply(rows: Row*): CheckAnswerRows = CheckAnswerRows(rows, false, false)
+
+    def apply(checkFunction: Row => Unit): CheckAnswerRows =
+      CheckAnswerRows(null, false, false, checkFunction)


This construction feels very forced. I wonder if we should define a new case class for CheckAnswer(function). Maybe we could have the CheckAnswer classes share a common trait to avoid needing to duplicate the setup code for checking the answers.

I don't feel strongly about this, just a thought.

MrBago · 2017-11-30T01:27:42Z

mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala

@@ -233,7 +232,8 @@ class LinearRegressionSuite
      assert(model2.intercept ~== interceptR relTol 1E-3)
      assert(model2.coefficients ~= coefficientsR relTol 1E-3)

-      model1.transform(datasetWithDenseFeature).select("features", "prediction").collect().foreach {
+      testTransformer[(Double, Vector)](datasetWithDenseFeature, Seq(model1),
+        "features", "prediction") {


I really like this, I'm impressed by how little we need to change to leverage existing tests and have them run in both streaming and in batch mode!!

WeichenXu123 · 2017-11-30T04:34:50Z

@MrBago Thanks!
I update code, now new action class CheckAnswerRowsByFunc is added. I do not add common trait as both of them are simple and I don't want to break old code.

SparkQA · 2017-11-30T07:48:36Z

Test build #84328 has finished for PR 19843 at commit ea19849.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CheckAnswerRows(expectedAnswer: Seq[Row], lastOnly: Boolean, isSorted: Boolean)
case class CheckAnswerRowsByFunc(checkFunction: Row => Unit, lastOnly: Boolean)

jkbradley · 2017-12-07T00:38:30Z

mllib/src/test/scala/org/apache/spark/ml/util/MLTest.scala

+    sc.setCheckpointDir(checkpointDir)
+  }
+
+  override def afterAll() {


In https://github.com/apache/spark/pull/9677/files#diff-2d609a0839a51280e0f10cc73ef42d21R35 we added a call to clear the active SQLContext after each MLlib test suite, and MLlibTestSparkContext still does that:

spark/mllib/src/test/scala/org/apache/spark/mllib/util/MLlibTestSparkContext.scala

Line 49 in 8ae004b

SparkSession.clearActiveSession()

At the time, that was necessary to avoid flakiness in Jenkins tests. Do you know if that's no longer necessary? CCing @nkronenfeld and @gatorsmile who seem to have worked on the SQL test traits recently

Maybe not needed if you are not hitting any flaky test

Well, we'll find out in a few weeks : )

Checking back in here, I haven't seen flakiness, but I have seen cascading failures, which I believe are a new phenomenon: A failure in one test suite A seems to cause subsequent tests suites B, C, ... to fail. (But when A is fixed, then B, C,... run correctly.) Do you not see this in SQL tests? Do you think this might be related to not cleaning up the active context?

Actually, it's worse than this. I see a bunch of failures when I run multiple test suites at once, even when doing sbt clean package beforehand and without any tests which fail by themselves. Will test on master and complain on the dev list if it's an issue. (No need to respond here)

jkbradley · 2017-12-07T00:43:06Z

mllib/src/test/scala/org/apache/spark/ml/util/MLTest.scala

+    }
+  }
+
+  def testTransformer[A : Encoder](dataframe: DataFrame, transformers: Seq[Transformer],


Do we need a Seq of Transformers? Should we not just use a PipelineModel as the Transformer when needed?

Sometime only one transformer? I am not sure which is better.

jkbradley · 2017-12-07T00:45:42Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala

@@ -162,6 +168,12 @@ trait StreamTest extends QueryTest with SharedSQLContext with TimeLimits with Be
    private def operatorName = if (lastOnly) "CheckLastBatch" else "CheckAnswer"
  }

+  case class CheckAnswerRowsByFunc(checkFunction: Row => Unit, lastOnly: Boolean)
+    extends StreamAction with StreamMustBeRunning {


nit: style: indent +2 spaces

jkbradley · 2017-12-07T00:48:47Z

This looks awesome; I just had a couple of comments. Btw, this is fancy test code. It might be nice to add a little unit test to MLTest.scala to make sure that testTransformer does indeed fail when the per-row check fails.

WeichenXu123 · 2017-12-07T09:13:38Z

add UT for MLTest and change to use PipelineModel.

SparkQA · 2017-12-07T12:37:47Z

Test build #84600 has finished for PR 19843 at commit 8318611.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class MLTestSuite extends MLTest

jkbradley · 2017-12-07T20:17:44Z

mllib/src/test/scala/org/apache/spark/ml/util/MLTest.scala

+  }
+
+  def testPipelineModelOnStreamData[A : Encoder](dataframe: DataFrame,
+      pipelineModel: PipelineModel, firstResultCol: String, otherResultCols: String*)


Sorry, I should have been clearer: I was suggesting taking a Transformer here, rather than a Seq of Transformers. A PipelineModel is a type of Transformer, so users of this trait could use a PipelineModel as the Transformer in order to string together multiple Transformers.

Also, style nit: For multi-line method headers, please put 1 argument per line.

jkbradley · 2017-12-07T23:10:42Z

I'll make a call: Given that the SQL tests do not use clearActive, let's not bother with it. If we see flakiness, then we can try adding clearActive as a fix.

SparkQA · 2017-12-08T05:50:42Z

Test build #84637 has finished for PR 19843 at commit 9629b31.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-12-08T23:03:49Z

Also, can you please remove "WIP" from the PR title and update the Testing part of the PR description?

jkbradley · 2017-12-08T23:06:08Z

LGTM, but I'll wait for the PR title & description updates to merge this. Thanks!

SparkQA · 2017-12-09T05:29:25Z

Test build #84675 has finished for PR 19843 at commit 930c113.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-12-12T21:34:42Z

LGTM
Will merge after fresh tests

SparkQA · 2017-12-12T21:36:18Z

Test build #4008 has finished for PR 19843 at commit 930c113.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-12-12T21:53:10Z

retest this please

jkbradley · 2017-12-12T22:37:28Z

That failure was caused by a bad change elsewhere which has been reverted. Testing again...

SparkQA · 2017-12-13T01:20:42Z

Test build #84796 has finished for PR 19843 at commit 930c113.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-13T02:07:11Z

Test build #4009 has finished for PR 19843 at commit 930c113.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-12-13T05:26:58Z

Merging with master
Thanks @WeichenXu123 and @MrBago !

viirya · 2017-12-26T09:09:41Z

mllib/src/test/scala/org/apache/spark/ml/util/MLTest.scala

+import org.apache.spark.sql.test.TestSparkSession
+import org.apache.spark.util.Utils
+
+trait MLTest extends StreamTest with TempDirectory { self: Suite =>


MLStreamTest seems more proper.

But my intention here, is let every ml test suite inherit MLTest.

WeichenXu123 added 2 commits November 29, 2017 12:26

init pr

072f4b9

remove useless code

08954fe

MrBago reviewed Nov 30, 2017

View reviewed changes

add CheckAnswerRowsByFunc action

ea19849

jkbradley reviewed Dec 7, 2017

View reviewed changes

address Joseph's comments

8318611

jkbradley reviewed Dec 7, 2017

View reviewed changes

address comments

9629b31

WeichenXu123 changed the title ~~[SPARK-22644][ML][TEST][WIP] Make ML testsuite support StructuredStreaming test~~ [SPARK-22644][ML][TEST] Make ML testsuite support StructuredStreaming test Dec 9, 2017

minor update

930c113

asfgit closed this in 0e36ba6 Dec 13, 2017

WeichenXu123 deleted the ml_stream_test_helper branch December 13, 2017 06:37

viirya reviewed Dec 26, 2017

View reviewed changes

[SPARK-22644][ML][TEST] Make ML testsuite support StructuredStreaming test #19843

[SPARK-22644][ML][TEST] Make ML testsuite support StructuredStreaming test #19843

Conversation

WeichenXu123 commented Nov 29, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Nov 29, 2017

SparkQA commented Nov 29, 2017

WeichenXu123 commented Nov 29, 2017

WeichenXu123 commented Nov 29, 2017

SparkQA commented Nov 29, 2017

MrBago left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MrBago Nov 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeichenXu123 commented Nov 30, 2017

SparkQA commented Nov 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbradley Dec 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbradley commented Dec 7, 2017

WeichenXu123 commented Dec 7, 2017

SparkQA commented Dec 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbradley commented Dec 7, 2017

SparkQA commented Dec 8, 2017

jkbradley commented Dec 8, 2017

jkbradley commented Dec 8, 2017

SparkQA commented Dec 9, 2017

jkbradley commented Dec 12, 2017

SparkQA commented Dec 12, 2017

gatorsmile commented Dec 12, 2017

jkbradley commented Dec 12, 2017

SparkQA commented Dec 13, 2017

SparkQA commented Dec 13, 2017

jkbradley commented Dec 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeichenXu123 commented Nov 29, 2017 •

edited

Loading

MrBago Nov 30, 2017 •

edited

Loading

jkbradley Dec 7, 2017 •

edited

Loading