Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33146][CORE] Check for non-fatal errors when loading new applications in SHS #30037

Closed
wants to merge 2 commits into from
Closed

[SPARK-33146][CORE] Check for non-fatal errors when loading new applications in SHS #30037

wants to merge 2 commits into from

Conversation

Kimahriman
Copy link
Contributor

What changes were proposed in this pull request?

Adds an additional check for non-fatal errors when attempting to add a new entry to the history server application listing.

Why are the changes needed?

A bad rolling event log folder (missing appstatus file or no log files) would cause no applications to be loaded by the Spark history server. Figuring out why invalid event log folders are created in the first place will be addressed in separate issues, this just lets the history server skip the invalid folder and successfully load all the valid applications.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New UT

@@ -538,6 +538,9 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock)
reader.fileSizeForLastIndex > 0
} catch {
case _: FileNotFoundException => false
case NonFatal(e) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I roughly remember this was pointed out earlier, but we wanted more elegant code and during finding we forgot the actual issue. Happy to see it fixed finally.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the new test fails in CI. Could you please check this?

[info] - SPARK-33146: don't let one bad rolling log folder prevent loading other applications *** FAILED *** (560 milliseconds)
[info]   1 did not equal 2 (FsHistoryProviderSuite.scala:1513)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
[info]   at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
[info]   at org.apache.spark.deploy.history.FsHistoryProviderSuite.$anonfun$new$114(FsHistoryProviderSuite.scala:1513)
[info]   at org.apache.spark.deploy.history.FsHistoryProviderSuite.$anonfun$new$114$adapted(FsHistoryProviderSuite.scala:1479)
[info]   at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:188)
[info]   at org.apache.spark.deploy.history.FsHistoryProviderSuite.$anonfun$new$113(FsHistoryProviderSuite.scala:1479)
[info]   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:189)
[info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:176)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:187)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:199)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:199)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:181)
[info]   at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:61)
[info]   at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
[info]   at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
[info]   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:61)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:232)
[info]   at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
[info]   at scala.collection.immutable.List.foreach(List.scala:392)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:232)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:231)
[info]   at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1562)
[info]   at org.scalatest.Suite.run(Suite.scala:1112)
[info]   at org.scalatest.Suite.run$(Suite.scala:1094)
[info]   at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1562)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:236)
[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:236)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:235)
[info]   at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:61)
[info]   at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
[info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info]   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info]   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:61)
[info]   at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318)
[info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513)
[info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:304)
[info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info]   at java.lang.Thread.run(Thread.java:748)

@Kimahriman
Copy link
Contributor Author

Yep had a mistake in the new test sorry!

@HeartSaVioR
Copy link
Contributor

ok to test

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 when Jenkins passes

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34382/

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34382/

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Test build #129776 has finished for PR 30037 at commit 0efba0a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

HeartSaVioR pushed a commit that referenced this pull request Oct 15, 2020
…cations in SHS

### What changes were proposed in this pull request?

Adds an additional check for non-fatal errors when attempting to add a new entry to the history server application listing.

### Why are the changes needed?

A bad rolling event log folder (missing appstatus file or no log files) would cause no applications to be loaded by the Spark history server. Figuring out why invalid event log folders are created in the first place will be addressed in separate issues, this just lets the history server skip the invalid folder and successfully load all the valid applications.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New UT

Closes #30037 from Kimahriman/bug/rolling-log-crashing-history.

Authored-by: Adam Binford <adam.binford@radiantsolutions.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
(cherry picked from commit 9ab0ec4)
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
@HeartSaVioR
Copy link
Contributor

Thanks @Kimahriman for your great effort! Merged into master and branch-3.0.

@dongjoon-hyun
Copy link
Member

Hi, @Kimahriman and @HeartSaVioR .
This seems to break branch-3.0. Could you take a look at that?

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 15, 2020

It turns out the UT is invalid in branch-3.0, too. I'll revert it from branch-3.0.

[info] - SPARK-33146: don't let one bad rolling log folder prevent loading other applications *** FAILED *** (68 milliseconds)
[info]   1 did not equal 0 (FsHistoryProviderSuite.scala:1499)
[info]   org.scalatest.exceptions.TestFailedException:

@dongjoon-hyun
Copy link
Member

Please make a PR for branch-3.0 with new valid test case and pass the CI. @Kimahriman and @HeartSaVioR .

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Oct 15, 2020

Thanks for catching the issue and sorry to miss the divergence between branches before.

Looks like the root cause is missing commit in branch-3.0: SPARK-32557 was a kind of follow-up for SPARK-32529, but while SPARK-32529 was landed to both master and branch-3.0, SPARK-32557 was only landed to master.

Instead of keeping unnecessary divergence between branches, I'd propose to port back SPARK-32557 to branch-3.0, and re-introduce this to branch-3.0. I expect no change will be needed after injecting SPARK-32557, but I'll do manual check. WDYT?

@Kimahriman
Copy link
Contributor Author

Thanks for sorting that out @HeartSaVioR. I had manually backported that commit for my local build to originally trying to fix this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants