Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13823][HOTFIX] Increase tryAcquire timeout and assert it succeeds to fix failure on slow machines #11763

Closed
wants to merge 1 commit into from

Conversation

srowen
Copy link
Member

@srowen srowen commented Mar 16, 2016

What changes were proposed in this pull request?

I'm seeing several PR builder builds fail after https://github.com/apache/spark/pull/11725/files. Example:

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.4/lastFailedBuild/console

testCommunication(org.apache.spark.launcher.LauncherServerSuite)  Time elapsed: 0.023 sec  <<< FAILURE!
java.lang.AssertionError: expected:<app-id> but was:<null>
    at org.apache.spark.launcher.LauncherServerSuite.testCommunication(LauncherServerSuite.java:93)

However, other builds pass this same test, including the test when run locally and on the Jenkins PR builder. The failure itself concerns a change to how the test waits on a condition, and the wait can time out; therefore I think this is due to fast/slow machine differences.

This is an attempt at a hot fix; it's a little hard to verify since locally and on the PR builder, it passes anyway. The change itself should be harmless anyway.

Why didn't this happen before, if the new logic was supposed to be equivalent to the old? I think this is the sequence:

  • First attempt to acquire semaphore for 10ms actually silently times out
  • The changed being waited for happens just after that, a bit too late
  • Assertion passes since condition became true just in time
  • release() fires from the listener
  • Next tryAcquire however immediately succeeds because the first tryAcquire didn't acquire anything, but its subsequent condition is not yet true; this would explain why the second one always fails

Versus the original using notifyAll(), there's a small difference: wait()-ing after notifyAll() just results in another wait; it doesn't make it return immediately. So this was a tiny latent issue that was masked by the semantics. Now the test asserts that the event actually happened (semaphore was acquired). (The timeout is still here to prevent the test from hanging forever, and to detect really slow response.) The timeout is increased to a second to allow plenty of time anyway.

How was this patch tested?

Jenkins tests

@SparkQA
Copy link

SparkQA commented Mar 16, 2016

Test build #53319 has finished for PR 11763 at commit dac7205.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member Author

srowen commented Mar 16, 2016

I'm going to merge so I can see if this resolves the failures on the QA test pull request builder.

@asfgit asfgit closed this in 9412547 Mar 16, 2016
@srowen srowen deleted the SPARK-13823.3 branch March 17, 2016 13:26
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
…eds to fix failure on slow machines

## What changes were proposed in this pull request?

I'm seeing several PR builder builds fail after https://github.com/apache/spark/pull/11725/files. Example:

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.4/lastFailedBuild/console

```
testCommunication(org.apache.spark.launcher.LauncherServerSuite)  Time elapsed: 0.023 sec  <<< FAILURE!
java.lang.AssertionError: expected:<app-id> but was:<null>
	at org.apache.spark.launcher.LauncherServerSuite.testCommunication(LauncherServerSuite.java:93)
```

However, other builds pass this same test, including the test when run locally and on the Jenkins PR builder. The failure itself concerns a change to how the test waits on a condition, and the wait can time out; therefore I think this is due to fast/slow machine differences.

This is an attempt at a hot fix; it's a little hard to verify since locally and on the PR builder, it passes anyway. The change itself should be harmless anyway.

Why didn't this happen before, if the new logic was supposed to be equivalent to the old? I think this is the sequence:

- First attempt to acquire semaphore for 10ms actually silently times out
- The changed being waited for happens just after that, a bit too late
- Assertion passes since condition became true just in time
- `release()` fires from the listener
- Next `tryAcquire` however immediately succeeds because the first `tryAcquire` didn't acquire anything, but its subsequent condition is not yet true; this would explain why the second one always fails

Versus the original using `notifyAll()`, there's a small difference: `wait()`-ing after `notifyAll()` just results in another wait; it doesn't make it return immediately. So this was a tiny latent issue that was masked by the semantics. Now the test asserts that the event actually happened (semaphore was acquired). (The timeout is still here to prevent the test from hanging forever, and to detect really slow response.) The timeout is increased to a second to allow plenty of time anyway.

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes apache#11763 from srowen/SPARK-13823.3.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants