Skip to content

Conversation

@mpetruska
Copy link
Contributor

@mpetruska mpetruska commented Nov 6, 2017

What changes were proposed in this pull request?

Ticket

  • one of the tests seems to produce unreliable results due to execution speed variability

Since the original test was trying to connect to the test server with 40 ms timeout, and the test server replied after 50 ms, the error might be produced under the following conditions:

  • it might occur that the test server replies correctly after 50 ms
  • but the client does only receive the timeout after 51 mss
  • this might happen if the executor has to schedule a big number of threads, and decides to delay the thread/actor that is responsible to watch the timeout, because of high CPU load
  • running an entire test suite usually produces high loads on the CPU executing the tests

How was this patch tested?

The test's check cases remain the same and the set-up emulates the previous version's.

@vanzin
Copy link
Contributor

vanzin commented Nov 15, 2017

ok to test


test("SPARK-20640: Shuffle registration timeout and maxAttempts conf are working") {
val tryAgainMsg = "test_spark_20640_try_again"
val `timingoutExecutor` = "timingoutExecutor"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the backticks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No real reason, just a mistake made when renamed the value.

@SparkQA
Copy link

SparkQA commented Nov 15, 2017

Test build #83913 has finished for PR 19671 at commit 6da1d34.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Nov 16, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Nov 16, 2017

Test build #83920 has finished for PR 19671 at commit 6da1d34.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mpetruska
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 16, 2017

Test build #83937 has finished for PR 19671 at commit 0532cc3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor

I think we should run the test for executor2 for multiple times to avoid the test case being flaky. The test case is flaky mainly because we may lose the RegisterExecutor message, which is not related to the Thead.sleep() issue.

@mpetruska
Copy link
Contributor Author

@jiangxb1987 : Well, I ran the BlockManagerSuite with testOnly and did not encounter any failures, passed 10 out of 10. Then, as a benchmark, I checked out the version without my changes and run the tests the same way, I got the same results 10 passed out of 10.

I have these theories on the matter:

  • some other change fixed the issue
  • test failures only manifest when running multiple other tests in parallel
  • there's some setting/hardware on my machine that makes it very unlikely for the test to fail

So, it turns out that unfortunately I cannot validate on my machine that removing the Thread.sleeps solves the flakiness issue.

Can you please try running the tests on your machine with and without the proposed changes?
(4bacddb602e19fcd4e1ec75a7b10bed524e6989a vs SPARK-22297)

@jiangxb1987
Copy link
Contributor

@vanzin How often does this test case fail on your local environment? Thanks!

@vanzin
Copy link
Contributor

vanzin commented Jan 4, 2018

It's not a question of how often, it's that it fails. That's the definition of flaky.

It obviously doesn't fail all the time, which is why it's a minor bug.

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you still remember, can you explain in the PR summary how the error reported in the bug can happen?

}

val transConf = SparkTransportConf.fromSparkConf(conf, "shuffle", numUsableCores = 0)
val transConf: TransportConf =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change is not necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, although IntelliJ Idea recommends adding the type annotation. Do you think I should delete it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since there's not really any change to this line. We omit the type of local variables in general.

callback.onFailure(failure)

case exec: RegisterExecutor if exec.execId == tryAgainExecutor =>
callback.onSuccess(success)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an error, right? Because the test is setting max attempts at 1 for this executor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to demonstrate that the client only tries connection once, then fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that this branch shouldn't even be reached by the test, right? Since tryAgainExecutor uses conf.set(SHUFFLE_REGISTRATION_MAX_ATTEMPTS.key, "1") which is handled above.

Not a big deal though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the point, execution should not progress here, but since it's executed on a different thread (in an actor receive), the best solution is to report back to the client "thread", which would cause a test failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, I can remove/modify the case, if you'd like...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine. At worst it's redundant code.

@mpetruska
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jan 24, 2018

Test build #86579 has finished for PR 19671 at commit 5319ae3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mpetruska
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jan 24, 2018

Test build #86587 has finished for PR 19671 at commit 5319ae3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Jan 24, 2018

Merging to master.

@asfgit asfgit closed this in 0e178e1 Jan 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants