-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33146][CORE] Check for non-fatal errors when loading new applications in SHS #30037
[SPARK-33146][CORE] Check for non-fatal errors when loading new applications in SHS #30037
Conversation
@@ -538,6 +538,9 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) | |||
reader.fileSizeForLastIndex > 0 | |||
} catch { | |||
case _: FileNotFoundException => false | |||
case NonFatal(e) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I roughly remember this was pointed out earlier, but we wanted more elegant code and during finding we forgot the actual issue. Happy to see it fixed finally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the new test fails in CI. Could you please check this?
[info] - SPARK-33146: don't let one bad rolling log folder prevent loading other applications *** FAILED *** (560 milliseconds)
[info] 1 did not equal 2 (FsHistoryProviderSuite.scala:1513)
[info] org.scalatest.exceptions.TestFailedException:
[info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info] at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
[info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
[info] at org.apache.spark.deploy.history.FsHistoryProviderSuite.$anonfun$new$114(FsHistoryProviderSuite.scala:1513)
[info] at org.apache.spark.deploy.history.FsHistoryProviderSuite.$anonfun$new$114$adapted(FsHistoryProviderSuite.scala:1479)
[info] at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:188)
[info] at org.apache.spark.deploy.history.FsHistoryProviderSuite.$anonfun$new$113(FsHistoryProviderSuite.scala:1479)
[info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info] at org.scalatest.Transformer.apply(Transformer.scala:22)
[info] at org.scalatest.Transformer.apply(Transformer.scala:20)
[info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:189)
[info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:176)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:187)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:199)
[info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:199)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:181)
[info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:61)
[info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
[info] at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
[info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:61)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:232)
[info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
[info] at scala.collection.immutable.List.foreach(List.scala:392)
[info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
[info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:232)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:231)
[info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1562)
[info] at org.scalatest.Suite.run(Suite.scala:1112)
[info] at org.scalatest.Suite.run$(Suite.scala:1094)
[info] at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1562)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:236)
[info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:236)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:235)
[info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:61)
[info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
[info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info] at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:61)
[info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318)
[info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513)
[info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:304)
[info] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info] at java.lang.Thread.run(Thread.java:748)
Yep had a mistake in the new test sorry! |
ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 when Jenkins passes
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #129776 has finished for PR 30037 at commit
|
…cations in SHS ### What changes were proposed in this pull request? Adds an additional check for non-fatal errors when attempting to add a new entry to the history server application listing. ### Why are the changes needed? A bad rolling event log folder (missing appstatus file or no log files) would cause no applications to be loaded by the Spark history server. Figuring out why invalid event log folders are created in the first place will be addressed in separate issues, this just lets the history server skip the invalid folder and successfully load all the valid applications. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT Closes #30037 from Kimahriman/bug/rolling-log-crashing-history. Authored-by: Adam Binford <adam.binford@radiantsolutions.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> (cherry picked from commit 9ab0ec4) Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Thanks @Kimahriman for your great effort! Merged into |
Hi, @Kimahriman and @HeartSaVioR . |
I made a PR to fix the import error first. |
It turns out the UT is invalid in branch-3.0, too. I'll revert it from
|
Please make a PR for |
Thanks for catching the issue and sorry to miss the divergence between branches before. Looks like the root cause is missing commit in branch-3.0: SPARK-32557 was a kind of follow-up for SPARK-32529, but while SPARK-32529 was landed to both master and branch-3.0, SPARK-32557 was only landed to master. Instead of keeping unnecessary divergence between branches, I'd propose to port back SPARK-32557 to branch-3.0, and re-introduce this to branch-3.0. I expect no change will be needed after injecting SPARK-32557, but I'll do manual check. WDYT? |
Thanks for sorting that out @HeartSaVioR. I had manually backported that commit for my local build to originally trying to fix this issue. |
What changes were proposed in this pull request?
Adds an additional check for non-fatal errors when attempting to add a new entry to the history server application listing.
Why are the changes needed?
A bad rolling event log folder (missing appstatus file or no log files) would cause no applications to be loaded by the Spark history server. Figuring out why invalid event log folders are created in the first place will be addressed in separate issues, this just lets the history server skip the invalid folder and successfully load all the valid applications.
Does this PR introduce any user-facing change?
No
How was this patch tested?
New UT