New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32024][WEBUI] Update ApplicationStoreInfo.size during HistoryServerDiskManager initializing #28859
Conversation
@jiangxb1987 , @dongjoon-hyun , @ueshin , Could you please help to review this? |
@@ -83,6 +83,20 @@ private class HistoryServerDiskManager( | |||
listing.delete(info.getClass(), info.path) | |||
} | |||
|
|||
// Reading level db would trigger table file compaction, then it may cause size of level db |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could apply "partition" instead of filter in L78, so that we only call iterate with view once and retrieve two lists (orphans, existing) - L90 can use existing to avoid calling view again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HeartSaVioR , thank you for your comments, updated it.
@@ -158,4 +158,26 @@ class HistoryServerDiskManagerSuite extends SparkFunSuite with BeforeAndAfter { | |||
assert(manager.approximateSize(50L, true) > 50L) | |||
} | |||
|
|||
test ("SPARK-32024: update ApplicationStoreInfo.size during initializing") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no space between test
and (
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, updated.
val leaseB = manager1.lease(2) | ||
assert(manager1.free() === 1) | ||
leaseB.rollback() | ||
assert(manager1.free() === 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to simulate another case: actual directory size is bigger than appinfo - some cache entities can be written after replaying, which "might" affect the size.
(That said, the mismatch between appinfo and actual dir size also occurs during run, though it may not matter unless there's no reflection.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is a valid case, i added it in my UT.
…nStoreInfo.size-during-disk-manager-initialize
Gently ping, @HeartSaVioR , could you please help review again? |
doReturn(3L).when(manager2).sizeOf(meq(new File(testDir, "apps"))) | ||
manager2.initialize() | ||
assert(manager2.free() === 0) | ||
val leaseC = manager2.lease(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to explicitly mention it leads eviction on cached entities, hence free() goes to 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated
doReturn(2L).when(manager1).sizeOf(meq(leaseB.tmpPath)) | ||
val dstB = leaseB.commit("app2", None) | ||
assert(manager1.committed() === 2) | ||
val manager2 = mockManager() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add some empty lines in overall to bring readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated
doReturn(3L).when(manager).sizeOf(meq(leaseA.tmpPath)) | ||
val dstA = leaseA.commit("app1", None) | ||
assert(manager.free() === 0) | ||
assert(manager.committed() === 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we track and verify ApplicationStoreInfo in store as well? I see the test represents how HistoryServerDiskManager fails without the patch, but it would be pretty much intuitive if we also verify what we've changed directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Gently ping, @jiangxb1987 , @dongjoon-hyun , @ueshin , @gengliangwang , could you please help to review? |
ok to test |
Test build #124847 has finished for PR 28859 at commit
|
retest this, please |
Test build #124897 has finished for PR 28859 at commit
|
retest this, please |
1 similar comment
retest this, please |
val orphans = listing.view(classOf[ApplicationStoreInfo]).asScala.filter { info => | ||
!new File(info.path).exists() | ||
}.toSeq | ||
val (existing, orphans) = listing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: existing => existences
// directory changed. When service restarts, "currentUsage" is calculated from real directory | ||
// size. Update "ApplicationStoreInfo.size" to ensure "currentUsage" equals | ||
// sum of "ApplicationStoreInfo.size". | ||
val changedStoreInfo = existing.filter { info => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think we can write this in one loop:
existing.foreach { info =>
val fileSize = sizeOf(new File(info.path)
if (fileSize != info.size) {
listing.write(info.copy(size = fileSize))
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, updated.
Test build #124947 has finished for PR 28859 at commit
|
retest this, please |
retest this, please |
Test build #125267 has finished for PR 28859 at commit
|
The PR builders look to be quite unstable, though all the test failures are not related to the change.
org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.(It is not a test it is a sbt.testing.SuiteSelector)
org.apache.spark.sql.hive.thriftserver.CliSuite.xml
org.apache.spark.sql.kafka010.KafkaRelationSuiteV1.timestamp provided for starting, ending not provided
org.apache.spark.sql.execution.DataSourceV2ScanExecRedactionSuite.SPARK-30362: test input metrics for DSV2 There's no consistent failure on specific test in these builds, which represents the change doesn't break the test. |
Let me trigger the test once more. If that also fails with flaky test we'd probably be better to just merge. |
retest this, please |
Test build #125313 has finished for PR 28859 at commit
|
org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.(It is not a test it is a sbt.testing.SuiteSelector) |
Thank you for your helping @HeartSaVioR . cc @gengliangwang , is it ok to merge this PR, or I need to run test again? |
…erverDiskManager initializing ### What changes were proposed in this pull request? Update ApplicationStoreInfo.size to real size during HistoryServerDiskManager initializing. ### Why are the changes needed? This PR is for fixing bug [32024](https://issues.apache.org/jira/browse/SPARK-32024). We found after history server restart, below error would randomly happen: "java.lang.IllegalStateException: Disk usage tracker went negative (now = -***, delta = -***)" from `HistoryServerDiskManager`. ![Capture](https://user-images.githubusercontent.com/10524738/85034468-fda4ae80-b136-11ea-9011-f0c3e6508002.JPG) **Cause**: Reading data from level db would trigger table file compaction, which may also trigger size of level db directory changes. This size change may not be recorded in LevelDB (`ApplicationStoreInfo` in `listing`). When service restarts, `currentUsage` is calculated from real directory size, but `ApplicationStoreInfo` are loaded from leveldb, then `currentUsage` may be less then sum of `ApplicationStoreInfo.size`. In `makeRoom()` function, `ApplicationStoreInfo.size` is used to update usage. Then `currentUsage` becomes negative after several round of `release()` and `lease()` (`makeRoom()`). **Reproduce**: we can reproduce this issue in dev environment by reducing config value of "spark.history.retainedApplications" and "spark.history.store.maxDiskUsage" to some small values. Here are steps: 1. start history server, load some applications and access some pages (maybe "stages" page to trigger leveldb compaction). 2. restart HS, and refresh pages. I also added an UT to simulate this case in `HistoryServerDiskManagerSuite`. **Benefit**: this change would help improve history server reliability. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add unit test and manually tested it. Closes #28859 from zhli1142015/update-ApplicationStoreInfo.size-during-disk-manager-initialize. Authored-by: Zhen Li <zhli@microsoft.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> (cherry picked from commit 8e7fc04) Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
…erverDiskManager initializing ### What changes were proposed in this pull request? Update ApplicationStoreInfo.size to real size during HistoryServerDiskManager initializing. ### Why are the changes needed? This PR is for fixing bug [32024](https://issues.apache.org/jira/browse/SPARK-32024). We found after history server restart, below error would randomly happen: "java.lang.IllegalStateException: Disk usage tracker went negative (now = -***, delta = -***)" from `HistoryServerDiskManager`. ![Capture](https://user-images.githubusercontent.com/10524738/85034468-fda4ae80-b136-11ea-9011-f0c3e6508002.JPG) **Cause**: Reading data from level db would trigger table file compaction, which may also trigger size of level db directory changes. This size change may not be recorded in LevelDB (`ApplicationStoreInfo` in `listing`). When service restarts, `currentUsage` is calculated from real directory size, but `ApplicationStoreInfo` are loaded from leveldb, then `currentUsage` may be less then sum of `ApplicationStoreInfo.size`. In `makeRoom()` function, `ApplicationStoreInfo.size` is used to update usage. Then `currentUsage` becomes negative after several round of `release()` and `lease()` (`makeRoom()`). **Reproduce**: we can reproduce this issue in dev environment by reducing config value of "spark.history.retainedApplications" and "spark.history.store.maxDiskUsage" to some small values. Here are steps: 1. start history server, load some applications and access some pages (maybe "stages" page to trigger leveldb compaction). 2. restart HS, and refresh pages. I also added an UT to simulate this case in `HistoryServerDiskManagerSuite`. **Benefit**: this change would help improve history server reliability. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add unit test and manually tested it. Closes #28859 from zhli1142015/update-ApplicationStoreInfo.size-during-disk-manager-initialize. Authored-by: Zhen Li <zhli@microsoft.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> (cherry picked from commit 8e7fc04) Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Thanks @zhli1142015 for the contribution! Merged into master, branch-3.0, branch-2.4 respectively. |
Hey @HeartSaVioR, let's don't merge PRs when tests are being failed. |
Oh OK. Thanks for noticing. I hope we'd be a bit flexible and pragmatic, but I also sincerely respect the rule and policy. I'll follow it in next time. |
Honestly, I do merge too time to time e.g.) when there are some reasons or when the Jenkins build itself is completely orthogonal such as fixing AppVeyor CI issue .. but should be best to stick to the standard in general :-). |
Hi, All. This breaks
|
Could you recover this please via a follow-up or a revert, @HeartSaVioR ? |
It turns out this breaks |
Thanks for sharing the test failure, @dongjoon-hyun . I took a quick look, and realized 245aee9 was landed only on master branch (as it's an improvement) which got rid of size call in destination path, whereas branch-3.0 and branch-2.4 don't. Below is the commit I fixed the test in branch-3.0 (confirmed test passed): and I also confirmed the test pass if we cherry pick 245aee9 to branch-3.0: I'm now applying the fix to branch-2.4 - IDE requires rebuild on switching 3.x vs 2.x so it'd take time, but the mechanism of the fix should be applied to 2.4 as well. Would we like to go with hotfix, or cherry pick 245aee9? |
Probably better to land the hotfix into master as well if we go with hotfix, as it looks to be safer assumption rather than depending on 245aee9. |
Thanks @HeartSaVioR and @dongjoon-hyun . i can reproduce it on branch 3.0 also, the cause is same as @HeartSaVioR mentioned, commit 245aee9 is not in branch 2.4 and 3.0. how about fix test case "SPARK-32024: update ApplicationStoreInfo.size during initializing" in default , 2.4 and 3.0, let it only covers this change? |
I'll submit a hotfix patch against master (can cherry-pick to broken branches) once I confirmed my fix works for branch-2.4 as well (it should work with master as well, as it "adds" the when statement). |
Thank you, @HeartSaVioR ! |
OK confirmed the hotfix works in master and branch-2.4 as well. Will submit a hotfix soon. |
…hen statements ### What changes were proposed in this pull request? This patch fixes the test failure due to the missing when statements for destination path. Note that it didn't fail on master branch, because 245aee9 got rid of size call in destination path, but still good to not depend on 245aee9. ### Why are the changes needed? The build against branch-3.0 / branch-2.4 starts to fail after merging SPARK-32024 (#28859) and this patch will fix it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Ran modified UT against master / branch-3.0 / branch-2.4. Closes #29046 from HeartSaVioR/QUICKFIX-SPARK-32024. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…hen statements ### What changes were proposed in this pull request? This patch fixes the test failure due to the missing when statements for destination path. Note that it didn't fail on master branch, because 245aee9 got rid of size call in destination path, but still good to not depend on 245aee9. ### Why are the changes needed? The build against branch-3.0 / branch-2.4 starts to fail after merging SPARK-32024 (#28859) and this patch will fix it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Ran modified UT against master / branch-3.0 / branch-2.4. Closes #29046 from HeartSaVioR/QUICKFIX-SPARK-32024. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 161cf2a) Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…hen statements ### What changes were proposed in this pull request? This patch fixes the test failure due to the missing when statements for destination path. Note that it didn't fail on master branch, because 245aee9 got rid of size call in destination path, but still good to not depend on 245aee9. ### Why are the changes needed? The build against branch-3.0 / branch-2.4 starts to fail after merging SPARK-32024 (#28859) and this patch will fix it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Ran modified UT against master / branch-3.0 / branch-2.4. Closes #29046 from HeartSaVioR/QUICKFIX-SPARK-32024. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 161cf2a) Signed-off-by: HyukjinKwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Update ApplicationStoreInfo.size to real size during HistoryServerDiskManager initializing.
Why are the changes needed?
This PR is for fixing bug 32024. We found after history server restart, below error would randomly happen: "java.lang.IllegalStateException: Disk usage tracker went negative (now = -, delta = -)" from
HistoryServerDiskManager
.Cause: Reading data from level db would trigger table file compaction, which may also trigger size of level db directory changes. This size change may not be recorded in LevelDB (
ApplicationStoreInfo
inlisting
). When service restarts,currentUsage
is calculated from real directory size, butApplicationStoreInfo
are loaded from leveldb, thencurrentUsage
may be less then sum ofApplicationStoreInfo.size
. InmakeRoom()
function,ApplicationStoreInfo.size
is used to update usage. ThencurrentUsage
becomes negative after several round ofrelease()
andlease()
(makeRoom()
).Reproduce: we can reproduce this issue in dev environment by reducing config value of "spark.history.retainedApplications" and "spark.history.store.maxDiskUsage" to some small values. Here are steps: 1. start history server, load some applications and access some pages (maybe "stages" page to trigger leveldb compaction). 2. restart HS, and refresh pages.
I also added an UT to simulate this case in
HistoryServerDiskManagerSuite
.Benefit: this change would help improve history server reliability.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Add unit test and manually tested it.