[SPARK-22297][CORE TESTS] Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts conf" #19671

mpetruska · 2017-11-06T12:48:26Z

What changes were proposed in this pull request?

one of the tests seems to produce unreliable results due to execution speed variability

Since the original test was trying to connect to the test server with 40 ms timeout, and the test server replied after 50 ms, the error might be produced under the following conditions:

it might occur that the test server replies correctly after 50 ms
but the client does only receive the timeout after 51 mss
this might happen if the executor has to schedule a big number of threads, and decides to delay the thread/actor that is responsible to watch the timeout, because of high CPU load
running an entire test suite usually produces high loads on the CPU executing the tests

How was this patch tested?

The test's check cases remain the same and the set-up emulates the previous version's.

vanzin · 2017-11-15T21:16:37Z

ok to test

vanzin · 2017-11-15T21:16:54Z

core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala


  test("SPARK-20640: Shuffle registration timeout and maxAttempts conf are working") {
    val tryAgainMsg = "test_spark_20640_try_again"
+    val `timingoutExecutor` = "timingoutExecutor"


Why the backticks?

No real reason, just a mistake made when renamed the value.

SparkQA · 2017-11-15T23:56:57Z

Test build #83913 has finished for PR 19671 at commit 6da1d34.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-11-16T00:15:56Z

retest this please

SparkQA · 2017-11-16T03:34:38Z

Test build #83920 has finished for PR 19671 at commit 6da1d34.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mpetruska · 2017-11-16T10:21:32Z

retest this please

SparkQA · 2017-11-16T13:44:40Z

Test build #83937 has finished for PR 19671 at commit 0532cc3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2017-11-16T15:35:06Z

I think we should run the test for executor2 for multiple times to avoid the test case being flaky. The test case is flaky mainly because we may lose the RegisterExecutor message, which is not related to the Thead.sleep() issue.

mpetruska · 2017-11-17T14:06:13Z

@jiangxb1987 : Well, I ran the BlockManagerSuite with testOnly and did not encounter any failures, passed 10 out of 10. Then, as a benchmark, I checked out the version without my changes and run the tests the same way, I got the same results 10 passed out of 10.

I have these theories on the matter:

some other change fixed the issue
test failures only manifest when running multiple other tests in parallel
there's some setting/hardware on my machine that makes it very unlikely for the test to fail

So, it turns out that unfortunately I cannot validate on my machine that removing the Thread.sleeps solves the flakiness issue.

Can you please try running the tests on your machine with and without the proposed changes?
(4bacddb602e19fcd4e1ec75a7b10bed524e6989a vs SPARK-22297)

jiangxb1987 · 2017-11-20T02:56:37Z

@vanzin How often does this test case fail on your local environment? Thanks!

vanzin · 2018-01-04T19:21:09Z

It's not a question of how often, it's that it fails. That's the definition of flaky.

It obviously doesn't fail all the time, which is why it's a minor bug.

vanzin

If you still remember, can you explain in the PR summary how the error reported in the bug can happen?

vanzin · 2018-01-22T22:01:18Z

core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala

      }

-      val transConf = SparkTransportConf.fromSparkConf(conf, "shuffle", numUsableCores = 0)
+      val transConf: TransportConf =


Change is not necessary?

Not really, although IntelliJ Idea recommends adding the type annotation. Do you think I should delete it?

Yes, since there's not really any change to this line. We omit the type of local variables in general.

vanzin · 2018-01-22T22:03:13Z

core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala

+              callback.onFailure(failure)
+
+            case exec: RegisterExecutor if exec.execId == tryAgainExecutor =>
+              callback.onSuccess(success)


This should be an error, right? Because the test is setting max attempts at 1 for this executor.

This is to demonstrate that the client only tries connection once, then fails.

I mean that this branch shouldn't even be reached by the test, right? Since tryAgainExecutor uses conf.set(SHUFFLE_REGISTRATION_MAX_ATTEMPTS.key, "1") which is handled above.

Not a big deal though.

Yes, that's the point, execution should not progress here, but since it's executed on a different thread (in an actor receive), the best solution is to report back to the client "thread", which would cause a test failure.

Of course, I can remove/modify the case, if you'd like...

It's fine. At worst it's redundant code.

mpetruska · 2018-01-24T08:19:22Z

Jenkins, retest this please.

SparkQA · 2018-01-24T11:39:55Z

Test build #86579 has finished for PR 19671 at commit 5319ae3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mpetruska · 2018-01-24T13:13:26Z

Jenkins, retest this please.

SparkQA · 2018-01-24T17:18:22Z

Test build #86587 has finished for PR 19671 at commit 5319ae3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-01-24T18:24:56Z

Merging to master.

fixes test case 'flakiness'

6da1d34

vanzin reviewed Nov 15, 2017

View reviewed changes

removes unneeded backticks

0532cc3

vanzin reviewed Jan 22, 2018

View reviewed changes

removes unneeded type annotation

5319ae3

asfgit closed this in 0e178e1 Jan 24, 2018

[SPARK-22297][CORE TESTS] Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts conf" #19671

[SPARK-22297][CORE TESTS] Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts conf" #19671

Uh oh!

Conversation

mpetruska commented Nov 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

vanzin commented Nov 15, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 15, 2017

Uh oh!

vanzin commented Nov 16, 2017

Uh oh!

SparkQA commented Nov 16, 2017

Uh oh!

mpetruska commented Nov 16, 2017

Uh oh!

SparkQA commented Nov 16, 2017

Uh oh!

jiangxb1987 commented Nov 16, 2017

Uh oh!

mpetruska commented Nov 17, 2017

Uh oh!

jiangxb1987 commented Nov 20, 2017

Uh oh!

vanzin commented Jan 4, 2018

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpetruska commented Jan 24, 2018

Uh oh!

SparkQA commented Jan 24, 2018

Uh oh!

mpetruska commented Jan 24, 2018

Uh oh!

SparkQA commented Jan 24, 2018

Uh oh!

vanzin commented Jan 24, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mpetruska commented Nov 6, 2017 •

edited

Loading