Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32024][WEBUI] Update ApplicationStoreInfo.size during HistoryServerDiskManager initializing #28859

Conversation

zhli1142015
Copy link
Contributor

@zhli1142015 zhli1142015 commented Jun 18, 2020

What changes were proposed in this pull request?

Update ApplicationStoreInfo.size to real size during HistoryServerDiskManager initializing.

Why are the changes needed?

This PR is for fixing bug 32024. We found after history server restart, below error would randomly happen: "java.lang.IllegalStateException: Disk usage tracker went negative (now = -, delta = -)" from HistoryServerDiskManager.
Capture

Cause: Reading data from level db would trigger table file compaction, which may also trigger size of level db directory changes. This size change may not be recorded in LevelDB (ApplicationStoreInfo in listing). When service restarts, currentUsage is calculated from real directory size, but ApplicationStoreInfo are loaded from leveldb, then currentUsage may be less then sum of ApplicationStoreInfo.size. In makeRoom() function, ApplicationStoreInfo.size is used to update usage. Then currentUsage becomes negative after several round of release() and lease() (makeRoom()).
Reproduce: we can reproduce this issue in dev environment by reducing config value of "spark.history.retainedApplications" and "spark.history.store.maxDiskUsage" to some small values. Here are steps: 1. start history server, load some applications and access some pages (maybe "stages" page to trigger leveldb compaction). 2. restart HS, and refresh pages.
I also added an UT to simulate this case in HistoryServerDiskManagerSuite.
Benefit: this change would help improve history server reliability.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add unit test and manually tested it.

@zhli1142015
Copy link
Contributor Author

@jiangxb1987 , @dongjoon-hyun , @ueshin , Could you please help to review this?

@gatorsmile
Copy link
Member

cc @gengliangwang

@@ -83,6 +83,20 @@ private class HistoryServerDiskManager(
listing.delete(info.getClass(), info.path)
}

// Reading level db would trigger table file compaction, then it may cause size of level db
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could apply "partition" instead of filter in L78, so that we only call iterate with view once and retrieve two lists (orphans, existing) - L90 can use existing to avoid calling view again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HeartSaVioR , thank you for your comments, updated it.

@@ -158,4 +158,26 @@ class HistoryServerDiskManagerSuite extends SparkFunSuite with BeforeAndAfter {
assert(manager.approximateSize(50L, true) > 50L)
}

test ("SPARK-32024: update ApplicationStoreInfo.size during initializing") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no space between test and (

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, updated.

val leaseB = manager1.lease(2)
assert(manager1.free() === 1)
leaseB.rollback()
assert(manager1.free() === 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to simulate another case: actual directory size is bigger than appinfo - some cache entities can be written after replaying, which "might" affect the size.
(That said, the mismatch between appinfo and actual dir size also occurs during run, though it may not matter unless there's no reflection.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a valid case, i added it in my UT.

@zhli1142015
Copy link
Contributor Author

Gently ping, @HeartSaVioR , could you please help review again?
@jiangxb1987 , @dongjoon-hyun , @ueshin , @gengliangwang , could you please help to review?

doReturn(3L).when(manager2).sizeOf(meq(new File(testDir, "apps")))
manager2.initialize()
assert(manager2.free() === 0)
val leaseC = manager2.lease(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to explicitly mention it leads eviction on cached entities, hence free() goes to 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated

doReturn(2L).when(manager1).sizeOf(meq(leaseB.tmpPath))
val dstB = leaseB.commit("app2", None)
assert(manager1.committed() === 2)
val manager2 = mockManager()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some empty lines in overall to bring readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated

doReturn(3L).when(manager).sizeOf(meq(leaseA.tmpPath))
val dstA = leaseA.commit("app1", None)
assert(manager.free() === 0)
assert(manager.committed() === 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we track and verify ApplicationStoreInfo in store as well? I see the test represents how HistoryServerDiskManager fails without the patch, but it would be pretty much intuitive if we also verify what we've changed directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, updated.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhli1142015
Copy link
Contributor Author

Gently ping, @jiangxb1987 , @dongjoon-hyun , @ueshin , @gengliangwang , could you please help to review?

@gengliangwang
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jul 2, 2020

Test build #124847 has finished for PR 28859 at commit b9057ec.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

retest this, please

@SparkQA
Copy link

SparkQA commented Jul 2, 2020

Test build #124897 has finished for PR 28859 at commit b9057ec.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

retest this, please

1 similar comment
@zhli1142015
Copy link
Contributor Author

retest this, please

val orphans = listing.view(classOf[ApplicationStoreInfo]).asScala.filter { info =>
!new File(info.path).exists()
}.toSeq
val (existing, orphans) = listing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: existing => existences

// directory changed. When service restarts, "currentUsage" is calculated from real directory
// size. Update "ApplicationStoreInfo.size" to ensure "currentUsage" equals
// sum of "ApplicationStoreInfo.size".
val changedStoreInfo = existing.filter { info =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think we can write this in one loop:

existing.foreach { info =>
  val fileSize = sizeOf(new File(info.path)
  if (fileSize != info.size) {
    listing.write(info.copy(size = fileSize))
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, updated.

@SparkQA
Copy link

SparkQA commented Jul 5, 2020

Test build #124947 has finished for PR 28859 at commit ed05872.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

retest this, please

@HeartSaVioR
Copy link
Contributor

retest this, please

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125267 has finished for PR 28859 at commit ed05872.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Jul 8, 2020

The PR builders look to be quite unstable, though all the test failures are not related to the change.

build 125054

org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.(It is not a test it is a sbt.testing.SuiteSelector)

build 125079

org.apache.spark.sql.hive.thriftserver.CliSuite.xml

build 125250

org.apache.spark.sql.kafka010.KafkaRelationSuiteV1.timestamp provided for starting, ending not provided
org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.(It is not a test it is a sbt.testing.SuiteSelector)

build 125267

org.apache.spark.sql.execution.DataSourceV2ScanExecRedactionSuite.SPARK-30362: test input metrics for DSV2
org.apache.spark.sql.hive.thriftserver.CliSuite.Pad Decimal numbers with trailing zeros to the scale of the column

There's no consistent failure on specific test in these builds, which represents the change doesn't break the test.

@HeartSaVioR
Copy link
Contributor

Let me trigger the test once more. If that also fails with flaky test we'd probably be better to just merge.

@HeartSaVioR
Copy link
Contributor

retest this, please

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125313 has finished for PR 28859 at commit ed05872.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

build 125313

org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.(It is not a test it is a sbt.testing.SuiteSelector)

@zhli1142015
Copy link
Contributor Author

Thank you for your helping @HeartSaVioR . cc @gengliangwang , is it ok to merge this PR, or I need to run test again?

HeartSaVioR pushed a commit that referenced this pull request Jul 8, 2020
…erverDiskManager initializing

### What changes were proposed in this pull request?

Update ApplicationStoreInfo.size to real size during HistoryServerDiskManager initializing.

### Why are the changes needed?

This PR is for fixing bug [32024](https://issues.apache.org/jira/browse/SPARK-32024). We found after history server restart, below error would randomly happen: "java.lang.IllegalStateException: Disk usage tracker went negative (now = -***, delta = -***)" from `HistoryServerDiskManager`.
![Capture](https://user-images.githubusercontent.com/10524738/85034468-fda4ae80-b136-11ea-9011-f0c3e6508002.JPG)

**Cause**: Reading data from level db would trigger table file compaction, which may also trigger size of level db directory changes.  This size change may not be recorded in LevelDB (`ApplicationStoreInfo` in `listing`). When service restarts, `currentUsage` is calculated from real directory size, but `ApplicationStoreInfo` are loaded from leveldb, then `currentUsage` may be less then sum of `ApplicationStoreInfo.size`. In `makeRoom()` function, `ApplicationStoreInfo.size` is used to update usage. Then `currentUsage` becomes negative after several round of `release()` and `lease()` (`makeRoom()`).
**Reproduce**: we can reproduce this issue in dev environment by reducing config value of "spark.history.retainedApplications" and "spark.history.store.maxDiskUsage" to some small values. Here are steps: 1. start history server, load some applications and access some pages (maybe "stages" page to trigger leveldb compaction). 2. restart HS, and refresh pages.
I also added an UT to simulate this case in `HistoryServerDiskManagerSuite`.
**Benefit**: this change would help improve history server reliability.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Add unit test and manually tested it.

Closes #28859 from zhli1142015/update-ApplicationStoreInfo.size-during-disk-manager-initialize.

Authored-by: Zhen Li <zhli@microsoft.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
(cherry picked from commit 8e7fc04)
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
HeartSaVioR pushed a commit that referenced this pull request Jul 8, 2020
…erverDiskManager initializing

### What changes were proposed in this pull request?

Update ApplicationStoreInfo.size to real size during HistoryServerDiskManager initializing.

### Why are the changes needed?

This PR is for fixing bug [32024](https://issues.apache.org/jira/browse/SPARK-32024). We found after history server restart, below error would randomly happen: "java.lang.IllegalStateException: Disk usage tracker went negative (now = -***, delta = -***)" from `HistoryServerDiskManager`.
![Capture](https://user-images.githubusercontent.com/10524738/85034468-fda4ae80-b136-11ea-9011-f0c3e6508002.JPG)

**Cause**: Reading data from level db would trigger table file compaction, which may also trigger size of level db directory changes.  This size change may not be recorded in LevelDB (`ApplicationStoreInfo` in `listing`). When service restarts, `currentUsage` is calculated from real directory size, but `ApplicationStoreInfo` are loaded from leveldb, then `currentUsage` may be less then sum of `ApplicationStoreInfo.size`. In `makeRoom()` function, `ApplicationStoreInfo.size` is used to update usage. Then `currentUsage` becomes negative after several round of `release()` and `lease()` (`makeRoom()`).
**Reproduce**: we can reproduce this issue in dev environment by reducing config value of "spark.history.retainedApplications" and "spark.history.store.maxDiskUsage" to some small values. Here are steps: 1. start history server, load some applications and access some pages (maybe "stages" page to trigger leveldb compaction). 2. restart HS, and refresh pages.
I also added an UT to simulate this case in `HistoryServerDiskManagerSuite`.
**Benefit**: this change would help improve history server reliability.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Add unit test and manually tested it.

Closes #28859 from zhli1142015/update-ApplicationStoreInfo.size-during-disk-manager-initialize.

Authored-by: Zhen Li <zhli@microsoft.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
(cherry picked from commit 8e7fc04)
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
@HeartSaVioR
Copy link
Contributor

Thanks @zhli1142015 for the contribution! Merged into master, branch-3.0, branch-2.4 respectively.

@HyukjinKwon
Copy link
Member

Hey @HeartSaVioR, let's don't merge PRs when tests are being failed.

@HeartSaVioR
Copy link
Contributor

Oh OK. Thanks for noticing. I hope we'd be a bit flexible and pragmatic, but I also sincerely respect the rule and policy. I'll follow it in next time.

@HyukjinKwon
Copy link
Member

Honestly, I do merge too time to time e.g.) when there are some reasons or when the Jenkins build itself is completely orthogonal such as fixing AppVeyor CI issue .. but should be best to stick to the standard in general :-).

@dongjoon-hyun
Copy link
Member

Hi, All. This breaks branch-2.4 UT. I didn't take a look at the other branches.

- SPARK-32024: update ApplicationStoreInfo.size during initializing *** FAILED ***
  0 did not equal 3 (HistoryServerDiskManagerSuite.scala:168)

@dongjoon-hyun
Copy link
Member

Could you recover this please via a follow-up or a revert, @HeartSaVioR ?

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Jul 8, 2020

Thanks for sharing the test failure, @dongjoon-hyun . I took a quick look, and realized 245aee9 was landed only on master branch (as it's an improvement) which got rid of size call in destination path, whereas branch-3.0 and branch-2.4 don't.

Below is the commit I fixed the test in branch-3.0 (confirmed test passed):

HeartSaVioR@6a36c9c

and I also confirmed the test pass if we cherry pick 245aee9 to branch-3.0:

HeartSaVioR@c8c9531

I'm now applying the fix to branch-2.4 - IDE requires rebuild on switching 3.x vs 2.x so it'd take time, but the mechanism of the fix should be applied to 2.4 as well.

Would we like to go with hotfix, or cherry pick 245aee9?

@HeartSaVioR
Copy link
Contributor

Probably better to land the hotfix into master as well if we go with hotfix, as it looks to be safer assumption rather than depending on 245aee9.

@zhli1142015
Copy link
Contributor Author

Thanks @HeartSaVioR and @dongjoon-hyun . i can reproduce it on branch 3.0 also, the cause is same as @HeartSaVioR mentioned, commit 245aee9 is not in branch 2.4 and 3.0. how about fix test case "SPARK-32024: update ApplicationStoreInfo.size during initializing" in default , 2.4 and 3.0, let it only covers this change?

@HeartSaVioR
Copy link
Contributor

I'll submit a hotfix patch against master (can cherry-pick to broken branches) once I confirmed my fix works for branch-2.4 as well (it should work with master as well, as it "adds" the when statement).

@dongjoon-hyun
Copy link
Member

Thank you, @HeartSaVioR !

@HeartSaVioR
Copy link
Contributor

OK confirmed the hotfix works in master and branch-2.4 as well. Will submit a hotfix soon.

HyukjinKwon pushed a commit that referenced this pull request Jul 9, 2020
…hen statements

### What changes were proposed in this pull request?

This patch fixes the test failure due to the missing when statements for destination path. Note that it didn't fail on master branch, because 245aee9 got rid of size call in destination path, but still good to not depend on 245aee9.

### Why are the changes needed?

The build against branch-3.0 / branch-2.4 starts to fail after merging SPARK-32024 (#28859) and this patch will fix it.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Ran modified UT against master / branch-3.0 / branch-2.4.

Closes #29046 from HeartSaVioR/QUICKFIX-SPARK-32024.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Jul 9, 2020
…hen statements

### What changes were proposed in this pull request?

This patch fixes the test failure due to the missing when statements for destination path. Note that it didn't fail on master branch, because 245aee9 got rid of size call in destination path, but still good to not depend on 245aee9.

### Why are the changes needed?

The build against branch-3.0 / branch-2.4 starts to fail after merging SPARK-32024 (#28859) and this patch will fix it.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Ran modified UT against master / branch-3.0 / branch-2.4.

Closes #29046 from HeartSaVioR/QUICKFIX-SPARK-32024.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit 161cf2a)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Jul 9, 2020
…hen statements

### What changes were proposed in this pull request?

This patch fixes the test failure due to the missing when statements for destination path. Note that it didn't fail on master branch, because 245aee9 got rid of size call in destination path, but still good to not depend on 245aee9.

### Why are the changes needed?

The build against branch-3.0 / branch-2.4 starts to fail after merging SPARK-32024 (#28859) and this patch will fix it.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Ran modified UT against master / branch-3.0 / branch-2.4.

Closes #29046 from HeartSaVioR/QUICKFIX-SPARK-32024.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit 161cf2a)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants