[SPARK-13823][HOTFIX] Increase tryAcquire timeout and assert it succeeds to fix failure on slow machines #11763

srowen · 2016-03-16T13:58:56Z

What changes were proposed in this pull request?

I'm seeing several PR builder builds fail after https://github.com/apache/spark/pull/11725/files. Example:

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.4/lastFailedBuild/console

testCommunication(org.apache.spark.launcher.LauncherServerSuite)  Time elapsed: 0.023 sec  <<< FAILURE!
java.lang.AssertionError: expected:<app-id> but was:<null>
    at org.apache.spark.launcher.LauncherServerSuite.testCommunication(LauncherServerSuite.java:93)

However, other builds pass this same test, including the test when run locally and on the Jenkins PR builder. The failure itself concerns a change to how the test waits on a condition, and the wait can time out; therefore I think this is due to fast/slow machine differences.

This is an attempt at a hot fix; it's a little hard to verify since locally and on the PR builder, it passes anyway. The change itself should be harmless anyway.

Why didn't this happen before, if the new logic was supposed to be equivalent to the old? I think this is the sequence:

First attempt to acquire semaphore for 10ms actually silently times out
The changed being waited for happens just after that, a bit too late
Assertion passes since condition became true just in time
release() fires from the listener
Next tryAcquire however immediately succeeds because the first tryAcquire didn't acquire anything, but its subsequent condition is not yet true; this would explain why the second one always fails

Versus the original using notifyAll(), there's a small difference: wait()-ing after notifyAll() just results in another wait; it doesn't make it return immediately. So this was a tiny latent issue that was masked by the semantics. Now the test asserts that the event actually happened (semaphore was acquired). (The timeout is still here to prevent the test from hanging forever, and to detect really slow response.) The timeout is increased to a second to allow plenty of time anyway.

How was this patch tested?

Jenkins tests

…slow machines

SparkQA · 2016-03-16T16:07:34Z

Test build #53319 has finished for PR 11763 at commit dac7205.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-03-16T16:10:41Z

I'm going to merge so I can see if this resolves the failures on the QA test pull request builder.

…eds to fix failure on slow machines ## What changes were proposed in this pull request? I'm seeing several PR builder builds fail after https://github.com/apache/spark/pull/11725/files. Example: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.4/lastFailedBuild/console ``` testCommunication(org.apache.spark.launcher.LauncherServerSuite) Time elapsed: 0.023 sec <<< FAILURE! java.lang.AssertionError: expected:<app-id> but was:<null> at org.apache.spark.launcher.LauncherServerSuite.testCommunication(LauncherServerSuite.java:93) ``` However, other builds pass this same test, including the test when run locally and on the Jenkins PR builder. The failure itself concerns a change to how the test waits on a condition, and the wait can time out; therefore I think this is due to fast/slow machine differences. This is an attempt at a hot fix; it's a little hard to verify since locally and on the PR builder, it passes anyway. The change itself should be harmless anyway. Why didn't this happen before, if the new logic was supposed to be equivalent to the old? I think this is the sequence: - First attempt to acquire semaphore for 10ms actually silently times out - The changed being waited for happens just after that, a bit too late - Assertion passes since condition became true just in time - `release()` fires from the listener - Next `tryAcquire` however immediately succeeds because the first `tryAcquire` didn't acquire anything, but its subsequent condition is not yet true; this would explain why the second one always fails Versus the original using `notifyAll()`, there's a small difference: `wait()`-ing after `notifyAll()` just results in another wait; it doesn't make it return immediately. So this was a tiny latent issue that was masked by the semantics. Now the test asserts that the event actually happened (semaphore was acquired). (The timeout is still here to prevent the test from hanging forever, and to detect really slow response.) The timeout is increased to a second to allow plenty of time anyway. ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes apache#11763 from srowen/SPARK-13823.3.

Increase tryAcquire timeout and assert it succeeds to fix failure on …

dac7205

…slow machines

asfgit closed this in 9412547 Mar 16, 2016

srowen deleted the SPARK-13823.3 branch March 17, 2016 13:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13823][HOTFIX] Increase tryAcquire timeout and assert it succeeds to fix failure on slow machines #11763

[SPARK-13823][HOTFIX] Increase tryAcquire timeout and assert it succeeds to fix failure on slow machines #11763

srowen commented Mar 16, 2016

SparkQA commented Mar 16, 2016

srowen commented Mar 16, 2016

[SPARK-13823][HOTFIX] Increase tryAcquire timeout and assert it succeeds to fix failure on slow machines #11763

[SPARK-13823][HOTFIX] Increase tryAcquire timeout and assert it succeeds to fix failure on slow machines #11763

Conversation

srowen commented Mar 16, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 16, 2016

srowen commented Mar 16, 2016