[Test][SPARK-16002][Follow-up] Fix flaky test in StreamingQueryListenerSuite #15497

lw-lin · 2016-10-15T02:39:27Z

What changes were proposed in this pull request?

StreamingQueryListenerSuite #test(s"single listener, check trigger statuses") is flaky; the following is one possible flaky execution:

+-----------------------------------+--------------------------------+
|      StreamExecution thread       |         testing thread         |
+-----------------------------------+--------------------------------+
|  ManualClock.waitTillTime(100) {  |                                |
|        _isWaiting = true          |                                |
|            wait(10)               |                                |
|        still in wait(10)          |  if (_isWaiting) advance(100)  |
|        still in wait(10)          |  if (_isWaiting) advance(200)  | <- this should be disallowed !
|        still in wait(10)          |  if (_isWaiting) advance(300)  | <- this should be disallowed !
|      wake up from wait(10)        |                                |
|       current time is 600         |                                |
|       _isWaiting = false          |                                |
|  }                                |                                |
+-----------------------------------+--------------------------------+

This patch's fix is, disallowing advance(...) for more than once when there's some thread -- e.g. StreamExecution -- is still in wait(10). In other words, we do not want to advance() "too early" -- in this sense, this is a follow-up to SPARK-16002.

How was this patch tested?

Ran the flaky test for 1000 times, and all passed.

lw-lin · 2016-10-15T02:56:37Z

@tdas @zsxwing could you take a look, thanks!

SparkQA · 2016-10-15T04:24:11Z

Test build #67002 has finished for PR 15497 at commit 5bc47b6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

lw-lin · 2016-10-15T05:26:24Z

Jenkins retest this please

SparkQA · 2016-10-15T07:13:04Z

Test build #67006 has finished for PR 15497 at commit 5bc47b6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-15T10:01:19Z

Test build #3343 has finished for PR 15497 at commit 5bc47b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas

Thank you very much for identifying this potential source of flakiness. But its not clear to me why another variable _readyForFirstPeek is needed over _isWaiting. Could you add more docs? Or at least explain why a single variable _isWaiting not sufficient to get this logic right?

tdas · 2016-10-17T01:52:21Z

core/src/main/scala/org/apache/spark/util/ManualClock.scala

@@ -27,6 +27,7 @@ package org.apache.spark.util
 private[spark] class ManualClock(private var time: Long) extends Clock {

  private var _isWaiting = false
+  private var _readyForFirstPeek = false


Can you add more docs on what does peek mean, what does this variable supposed to do?

Its really not obvious on what purpose does _readyForFirstPeek server over _isWaiting.

_isWaiting indicates the main StreamExecution thread is waiting, while _readyForFirstPeek is supposed to indicate whether it's the first time the test thread knows about the main thread is waiting.

When it's a second time or a third time the test thread knows the main thread is waiting via isWaitingAndReadyForFirstPeek(), the test thread itself should block to prevent advance()ing too early.

what purpose does _readyForFirstPeek server over _isWaiting

Please refer to advance() where we would mark _readyForFirstPeek as false to indicate the test thread has already advance()d for one time, but we would not do anything to _isWaiting.

@tdas if we're in the right direction, we should definitely pick good names for _readyForFirstPeek and isWaitingAndReadyForFirstPeek() :)

I think semantically this ManualClock is becoming a very complex and confusing API to use. The caller has to

How about changing the API to something like this.

ManualClock: It has a method called isThreadWaitingAt(timeWhenWaitStarted: Long): Boolean. It returns true when another thread has started waiting when the time was timeWhenWaitStarted. This replaces isWaiting.

StreamTest: It keeps track of the expected time when manual clock is being used. For every AdvaneManualClock is first calls isThreadWaitingAt(expectedTime) until it returns true, then advances manual clock as well as the expected time.

This will ensure that successive AdvanceManualClock will wait for the StreamExecution thread to be unblocked from the previous wait, and start a new wait on the expected time.

How does this sound?

If this sounds good, then there should be unit tests to test this Manual Clock behavior thoroughly.

This sounds good!

One thing to confirm: we are assuming only one main thread and one test thread, right? Because:

given two main threads m1, m2, and one testing thread t, should isThreadWaitingAt(time) return true or false if m1 reaches waitUtil(time) but m2 does not yet?

given one main thread m, and two testing threads t1, t2, should isThreadWaitingAt(time) return true for only one of t1, t2 or both?

I could reproduce this issue as well in a Jenkins run in my PR - https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3351/consoleFull

I pulled in your current fix into my PR to test for flakiness by running it in a loop in Jenkins. At least that will give us confidence that the theory of the bug is right. Nonetheless, please change the API and code as I suggested.

My PR - #15492 . See Jenkin builds - 3352, 3353, 3354. Each of them is expected to test StreamingQueryListenerSuite about 50 times before Jenkins times out.

tdas · 2016-10-17T01:55:54Z

core/src/main/scala/org/apache/spark/util/ManualClock.scala

   */
-  def isWaiting: Boolean = synchronized { _isWaiting }
+  def isWaitingAndReadyForFirstPeek: Boolean = synchronized { _isWaiting && _readyForFirstPeek }


I dont like this name. Why not keep it simple like isWaiting.

Guess it boils down to what is the purpose of _readyForPeek, which is not clear to me.

SparkQA · 2016-10-17T04:44:35Z

Test build #3347 has finished for PR 15497 at commit 5bc47b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-10-17T06:40:45Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryListenerSuite.scala

@@ -81,7 +81,7 @@ class StreamingQueryListenerSuite extends StreamTest with BeforeAndAfter {
      AssertOnLastQueryStatus { status: StreamingQueryStatus =>
        // Check the correctness of the trigger info of the last completed batch reported by
        // onQueryProgress
-        assert(status.triggerDetails.get("triggerId") == "0")
+        assert(Seq("-1", "0").contains(status.triggerDetails.get("triggerId")))


Look at my PR, I replaced .get with .containsKey because all we need to test here is that the triggerId key is set. The exact trigger id in which the data was found, etc. is not important.

#15492

Could you change this PR to do the same.

Sure -- let me do the change, thanks!

This reverts commit 5bc47b6.

lw-lin · 2016-10-17T15:23:43Z

Jenkins retest this please

tdas · 2016-10-17T19:38:46Z

I thought about it, and I still dont like this design. This is adding more complexity in a general class ManualClock, for functionality needed only by StreamExecution. And that leads to these sort of question - should the general feature like isThreadWaiting work with multiple threads, etc.

I think we need to do it differently. I think its best to create a custom ManualClock for StreamExecution, which adds the functionality necessary for StreamExecution.

Mind if I take over this PR and work this out (in the interest of time, 2.0.2 cutoff is imminent)?

SparkQA · 2016-10-17T20:00:55Z

Test build #67074 has finished for PR 15497 at commit 7ae7782.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-10-17T21:40:56Z

I opened a PR after modifying your branch #15519. Since you did the initial investigation, I will mark you as the author when I merge it.

lw-lin · 2016-10-18T02:07:29Z

Please go ahead and take over -- let's make it into 2.0.2, thanks!

This work has largely been done by lw-lin in his PR #15497. This is a slight refactoring of it. ## What changes were proposed in this pull request? There were two sources of flakiness in StreamingQueryListener test. - When testing with manual clock, consecutive attempts to advance the clock can occur without the stream execution thread being unblocked and doing some work between the two attempts. Hence the following can happen with the current ManualClock. ``` +-----------------------------------+--------------------------------+ | StreamExecution thread | testing thread | +-----------------------------------+--------------------------------+ | ManualClock.waitTillTime(100) { | | | _isWaiting = true | | | wait(10) | | | still in wait(10) | if (_isWaiting) advance(100) | | still in wait(10) | if (_isWaiting) advance(200) | <- this should be disallowed ! | still in wait(10) | if (_isWaiting) advance(300) | <- this should be disallowed ! | wake up from wait(10) | | | current time is 600 | | | _isWaiting = false | | | } | | +-----------------------------------+--------------------------------+ ``` - Second source of flakiness is that the adding data to memory stream may get processing in any trigger, not just the first trigger. My fix is to make the manual clock wait for the other stream execution thread to start waiting for the clock at the right wait start time. That is, `advance(200)` (see above) will wait for stream execution thread to complete the wait that started at time 0, and start a new wait at time 200 (i.e. time stamp after the previous `advance(100)`). In addition, since this is a feature that is solely used by StreamExecution, I removed all the non-generic code from ManualClock and put them in StreamManualClock inside StreamTest. ## How was this patch tested? Ran existing unit test MANY TIME in Jenkins Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Liwei Lin <lwlin7@gmail.com> Closes #15519 from tdas/metrics-flaky-test-fix. (cherry picked from commit 7d878cf) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

This work has largely been done by lw-lin in his PR #15497. This is a slight refactoring of it. ## What changes were proposed in this pull request? There were two sources of flakiness in StreamingQueryListener test. - When testing with manual clock, consecutive attempts to advance the clock can occur without the stream execution thread being unblocked and doing some work between the two attempts. Hence the following can happen with the current ManualClock. ``` +-----------------------------------+--------------------------------+ | StreamExecution thread | testing thread | +-----------------------------------+--------------------------------+ | ManualClock.waitTillTime(100) { | | | _isWaiting = true | | | wait(10) | | | still in wait(10) | if (_isWaiting) advance(100) | | still in wait(10) | if (_isWaiting) advance(200) | <- this should be disallowed ! | still in wait(10) | if (_isWaiting) advance(300) | <- this should be disallowed ! | wake up from wait(10) | | | current time is 600 | | | _isWaiting = false | | | } | | +-----------------------------------+--------------------------------+ ``` - Second source of flakiness is that the adding data to memory stream may get processing in any trigger, not just the first trigger. My fix is to make the manual clock wait for the other stream execution thread to start waiting for the clock at the right wait start time. That is, `advance(200)` (see above) will wait for stream execution thread to complete the wait that started at time 0, and start a new wait at time 200 (i.e. time stamp after the previous `advance(100)`). In addition, since this is a feature that is solely used by StreamExecution, I removed all the non-generic code from ManualClock and put them in StreamManualClock inside StreamTest. ## How was this patch tested? Ran existing unit test MANY TIME in Jenkins Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Liwei Lin <lwlin7@gmail.com> Closes #15519 from tdas/metrics-flaky-test-fix.

This work has largely been done by lw-lin in his PR apache#15497. This is a slight refactoring of it. ## What changes were proposed in this pull request? There were two sources of flakiness in StreamingQueryListener test. - When testing with manual clock, consecutive attempts to advance the clock can occur without the stream execution thread being unblocked and doing some work between the two attempts. Hence the following can happen with the current ManualClock. ``` +-----------------------------------+--------------------------------+ | StreamExecution thread | testing thread | +-----------------------------------+--------------------------------+ | ManualClock.waitTillTime(100) { | | | _isWaiting = true | | | wait(10) | | | still in wait(10) | if (_isWaiting) advance(100) | | still in wait(10) | if (_isWaiting) advance(200) | <- this should be disallowed ! | still in wait(10) | if (_isWaiting) advance(300) | <- this should be disallowed ! | wake up from wait(10) | | | current time is 600 | | | _isWaiting = false | | | } | | +-----------------------------------+--------------------------------+ ``` - Second source of flakiness is that the adding data to memory stream may get processing in any trigger, not just the first trigger. My fix is to make the manual clock wait for the other stream execution thread to start waiting for the clock at the right wait start time. That is, `advance(200)` (see above) will wait for stream execution thread to complete the wait that started at time 0, and start a new wait at time 200 (i.e. time stamp after the previous `advance(100)`). In addition, since this is a feature that is solely used by StreamExecution, I removed all the non-generic code from ManualClock and put them in StreamManualClock inside StreamTest. ## How was this patch tested? Ran existing unit test MANY TIME in Jenkins Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Liwei Lin <lwlin7@gmail.com> Closes apache#15519 from tdas/metrics-flaky-test-fix.

Fix flaky test

5bc47b6

lw-lin mentioned this pull request Oct 15, 2016

[DO NOT MERGE][TEST] Testing flakiness of StreamingQueryListener #15492

Closed

tdas reviewed Oct 17, 2016

View reviewed changes

lw-lin added 2 commits October 17, 2016 21:15

Revert "Fix flaky test"

eb59a98

This reverts commit 5bc47b6.

Fix flaky test again

7ae7782

tdas mentioned this pull request Oct 17, 2016

[SQL][STREAMING][TEST] Fix flaky tests in StreamingQueryListenerSuite #15519

Closed

lw-lin closed this Oct 18, 2016

lw-lin deleted the metrics-flaky-test branch October 18, 2016 07:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Test][SPARK-16002][Follow-up] Fix flaky test in StreamingQueryListenerSuite #15497

[Test][SPARK-16002][Follow-up] Fix flaky test in StreamingQueryListenerSuite #15497

lw-lin commented Oct 15, 2016 •

edited

lw-lin commented Oct 15, 2016 •

edited

SparkQA commented Oct 15, 2016

lw-lin commented Oct 15, 2016

SparkQA commented Oct 15, 2016

SparkQA commented Oct 15, 2016

tdas left a comment

tdas Oct 17, 2016

tdas Oct 17, 2016

lw-lin Oct 17, 2016 •

edited

lw-lin Oct 17, 2016 •

edited

tdas Oct 17, 2016 •

edited

lw-lin Oct 17, 2016 •

edited

tdas Oct 17, 2016

tdas Oct 17, 2016 •

edited

tdas Oct 17, 2016

tdas Oct 17, 2016

SparkQA commented Oct 17, 2016

tdas Oct 17, 2016

lw-lin Oct 17, 2016

lw-lin commented Oct 17, 2016

tdas commented Oct 17, 2016

SparkQA commented Oct 17, 2016

tdas commented Oct 17, 2016

lw-lin commented Oct 18, 2016 •

edited

[Test][SPARK-16002][Follow-up] Fix flaky test in StreamingQueryListenerSuite #15497

[Test][SPARK-16002][Follow-up] Fix flaky test in StreamingQueryListenerSuite #15497

Conversation

lw-lin commented Oct 15, 2016 • edited

What changes were proposed in this pull request?

How was this patch tested?

lw-lin commented Oct 15, 2016 • edited

SparkQA commented Oct 15, 2016

lw-lin commented Oct 15, 2016

SparkQA commented Oct 15, 2016

SparkQA commented Oct 15, 2016

tdas left a comment

Choose a reason for hiding this comment

tdas Oct 17, 2016

Choose a reason for hiding this comment

tdas Oct 17, 2016

Choose a reason for hiding this comment

lw-lin Oct 17, 2016 • edited

Choose a reason for hiding this comment

lw-lin Oct 17, 2016 • edited

Choose a reason for hiding this comment

tdas Oct 17, 2016 • edited

Choose a reason for hiding this comment

lw-lin Oct 17, 2016 • edited

Choose a reason for hiding this comment

tdas Oct 17, 2016

Choose a reason for hiding this comment

tdas Oct 17, 2016 • edited

Choose a reason for hiding this comment

tdas Oct 17, 2016

Choose a reason for hiding this comment

tdas Oct 17, 2016

Choose a reason for hiding this comment

SparkQA commented Oct 17, 2016

tdas Oct 17, 2016

Choose a reason for hiding this comment

lw-lin Oct 17, 2016

Choose a reason for hiding this comment

lw-lin commented Oct 17, 2016

tdas commented Oct 17, 2016

SparkQA commented Oct 17, 2016

tdas commented Oct 17, 2016

lw-lin commented Oct 18, 2016 • edited

lw-lin commented Oct 15, 2016 •

edited

lw-lin commented Oct 15, 2016 •

edited

lw-lin Oct 17, 2016 •

edited

lw-lin Oct 17, 2016 •

edited

tdas Oct 17, 2016 •

edited

lw-lin Oct 17, 2016 •

edited

tdas Oct 17, 2016 •

edited

lw-lin commented Oct 18, 2016 •

edited