-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22297][CORE TESTS] Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts conf" #19671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
|
|
||
| test("SPARK-20640: Shuffle registration timeout and maxAttempts conf are working") { | ||
| val tryAgainMsg = "test_spark_20640_try_again" | ||
| val `timingoutExecutor` = "timingoutExecutor" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the backticks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No real reason, just a mistake made when renamed the value.
|
Test build #83913 has finished for PR 19671 at commit
|
|
retest this please |
|
Test build #83920 has finished for PR 19671 at commit
|
|
retest this please |
|
Test build #83937 has finished for PR 19671 at commit
|
|
I think we should run the test for |
|
@jiangxb1987 : Well, I ran the I have these theories on the matter:
So, it turns out that unfortunately I cannot validate on my machine that removing the Can you please try running the tests on your machine with and without the proposed changes? |
|
@vanzin How often does this test case fail on your local environment? Thanks! |
|
It's not a question of how often, it's that it fails. That's the definition of flaky. It obviously doesn't fail all the time, which is why it's a minor bug. |
vanzin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you still remember, can you explain in the PR summary how the error reported in the bug can happen?
| } | ||
|
|
||
| val transConf = SparkTransportConf.fromSparkConf(conf, "shuffle", numUsableCores = 0) | ||
| val transConf: TransportConf = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change is not necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, although IntelliJ Idea recommends adding the type annotation. Do you think I should delete it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, since there's not really any change to this line. We omit the type of local variables in general.
| callback.onFailure(failure) | ||
|
|
||
| case exec: RegisterExecutor if exec.execId == tryAgainExecutor => | ||
| callback.onSuccess(success) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an error, right? Because the test is setting max attempts at 1 for this executor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to demonstrate that the client only tries connection once, then fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that this branch shouldn't even be reached by the test, right? Since tryAgainExecutor uses conf.set(SHUFFLE_REGISTRATION_MAX_ATTEMPTS.key, "1") which is handled above.
Not a big deal though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the point, execution should not progress here, but since it's executed on a different thread (in an actor receive), the best solution is to report back to the client "thread", which would cause a test failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, I can remove/modify the case, if you'd like...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine. At worst it's redundant code.
|
Jenkins, retest this please. |
|
Test build #86579 has finished for PR 19671 at commit
|
|
Jenkins, retest this please. |
|
Test build #86587 has finished for PR 19671 at commit
|
|
Merging to master. |
What changes were proposed in this pull request?
Ticket
Since the original test was trying to connect to the test server with
40 mstimeout, and the test server replied after50 ms, the error might be produced under the following conditions:50 ms51 mssHow was this patch tested?
The test's check cases remain the same and the set-up emulates the previous version's.