Please sign in to comment.
[SPARK-23020][CORE] Fix races in launcher code, test.
The race in the code is because the handle might update its state to the wrong state if the connection handling thread is still processing incoming data; so the handle needs to wait for the connection to finish up before checking the final state. The race in the test is because when waiting for a handle to reach a final state, the waitFor() method needs to wait until all handle state is updated (which also includes waiting for the connection thread above to finish). Otherwise, waitFor() may return too early, which would cause a bunch of different races (like the listener not being yet notified of the state change, or being in the middle of being notified, or the handle not being properly disposed and causing postChecks() to assert). On top of that I found, by code inspection, a couple of potential races that could make a handle end up in the wrong state when being killed. The original version of this fix introduced the flipped version of the first race described above; the connection closing might override the handle state before the handle might have a chance to do cleanup. The fix there is to only dispose of the handle from the connection when there is an error, and let the handle dispose itself in the normal case. The fix also caused a bug in YarnClusterSuite to be surfaced; the code was checking for a file in the classpath that was not expected to be there in client mode. Because of the above issues, the error was not propagating correctly and the (buggy) test was incorrectly passing. Tested by running the existing unit tests a lot (and not seeing the errors I was seeing before). Author: Marcelo Vanzin <firstname.lastname@example.org> Closes #20297 from vanzin/SPARK-23020. (cherry picked from commit ec22897) Signed-off-by: Wenchen Fan <email@example.com>
- Loading branch information...
Showing with 165 additions and 81 deletions.
- +32 −21 core/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java
- +18 −4 launcher/src/main/java/org/apache/spark/launcher/AbstractAppHandle.java
- +10 −8 launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java
- +9 −8 launcher/src/main/java/org/apache/spark/launcher/InProcessAppHandle.java
- +9 −9 launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java
- +40 −9 launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java
- +35 −7 launcher/src/test/java/org/apache/spark/launcher/BaseSuite.java
- +9 −14 launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java
- +3 −1 resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala
Oops, something went wrong.