[SPARK-29043][Core] Improve the concurrent performance of History Server #25797

turboFei · 2019-09-15T14:58:05Z

What changes were proposed in this pull request?

Even we set spark.history.fs.numReplayThreads to a large number, such as 30.
The history server still replays logs slowly.
We found that, if there is a straggler in a batch of replay tasks, all the other threads will wait for this
straggler.

In this PR, we create processing to save the logs which are being replayed.
So that the replay tasks can execute Asynchronously.

Why are the changes needed?

It can accelerate the speed to replay logs for history server.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT.

dongjoon-hyun · 2019-09-15T20:24:18Z

ok to test

cc @wangyum

turboFei · 2019-09-15T22:58:35Z

retest please

HyukjinKwon · 2019-09-15T23:54:52Z

ok to test

HyukjinKwon · 2019-09-15T23:54:57Z

cc @vanzin too

HeartSaVioR

The code change looks OK, but I might be missing the effect where "listing" and "reloading applications" run concurrently. It would bring accessing some places via multi-threads which they weren't, so may need to check.

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

wangyum · 2019-09-16T04:23:02Z

ok to test

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

SparkQA · 2019-09-16T22:38:05Z

Test build #110636 has finished for PR 25797 at commit ec387f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

turboFei · 2019-09-18T16:03:26Z

gentle ping @dongjoon-hyun @HyukjinKwon

kiszk · 2019-09-19T19:50:56Z

Looks good to me
Want to wait for comments from others cc @vanzin @wangyum

turboFei · 2019-09-20T02:33:19Z

I just modify the a method from protected to private in new commit, it would not impact the test result.

SparkQA · 2019-09-20T05:05:07Z

Test build #111038 has finished for PR 25797 at commit 1c36bfe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51

Looks reasonable.

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

turboFei · 2019-09-20T14:29:41Z

I just add some comment and rename a method, which would not impact the test result.

SparkQA · 2019-09-20T16:58:44Z

Test build #111074 has finished for PR 25797 at commit 12f6300.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

turboFei · 2019-09-26T03:24:32Z

@cloud-fan Could you help take a look?

kiszk · 2019-09-26T05:26:05Z

ping @vanzin @wangyum

gengliangwang · 2019-09-26T06:01:38Z

Seems good to me.
@turboFei Have you tested the changes manually?

turboFei · 2019-09-26T07:13:54Z

Seems good to me.
@turboFei Have you tested the changes manually?

I have tested it for a week, it works well.

SparkQA · 2019-11-06T16:54:20Z

Test build #113321 has finished for PR 25797 at commit 611ea30.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-06T17:49:41Z

Test build #113328 has finished for PR 25797 at commit 832dd7a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-06T18:10:42Z

Test build #113330 has finished for PR 25797 at commit 8c0b532.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-06T18:57:41Z

Test build #113332 has finished for PR 25797 at commit 66e9dd1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-11-07T00:23:01Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

+      endProcessing(reader.rootPath)
+      pendingReplayTasksCount.decrementAndGet()
+
+      val isExpired = scanTime + conf.get(MAX_LOG_AGE_S) * 1000 < clock.getTimeMillis()


This is not right. Expiration is based on the time the log was last updated, not the time it was last scanned.

vanzin · 2019-11-07T00:26:24Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

+
+      val isExpired = scanTime + conf.get(MAX_LOG_AGE_S) * 1000 < clock.getTimeMillis()
+      if (isExpired) {
+        listing.delete(classOf[LogInfo], reader.rootPath.toString)


You may also need to remove the application attempt that refers to the log from the listing database.

Basically you have to do what cleanLogs does, both to define whether the log is expired, and what needs to be deleted.

SparkQA · 2019-11-07T12:13:29Z

Test build #113379 has finished for PR 25797 at commit d5ff72c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

turboFei · 2019-11-07T12:34:15Z

retest this please.

turboFei · 2019-11-12T10:30:05Z

rebased.

SparkQA · 2019-11-12T12:26:21Z

Test build #113625 has finished for PR 25797 at commit 579c1ad.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-12T13:52:06Z

Test build #113626 has finished for PR 25797 at commit 7d1301c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

HeartSaVioR · 2019-12-13T05:07:51Z

@turboFei Hi, could you address the review comments? This is good to have and seems close to be merged (according to #26416 (review) ).

turboFei · 2019-12-13T06:46:54Z

@turboFei Hi, could you address the review comments? This is good to have and seems close to be merged (according to #26416 (review) ).

Thanks, I will address it as soon as possible.
Thanks for your reminder. @HeartSaVioR

turboFei · 2019-12-13T11:05:01Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

+
+    log.appId.foreach { appId =>
+      val app = listing.read(classOf[ApplicationInfoWrapper], appId)
+      if (app.oldestAttempt() <= maxTime) {


Here the logic is consistent with cleanLogs().
But, I think there is an overlap between app.oldestAttempt() <= maxTime and attempt.info.lastUpdated.getTime() >= maxTime, even it does not matter.

SparkQA · 2019-12-13T13:26:43Z

Test build #115300 has finished for PR 25797 at commit c6ac35e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala

SparkQA · 2019-12-14T19:10:00Z

Test build #115334 has finished for PR 25797 at commit e9ebb6f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin

Looks ok now. Merging to master.

vanzin · 2019-12-16T22:38:54Z

core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala

@@ -1321,6 +1321,35 @@ class FsHistoryProviderSuite extends SparkFunSuite with Matchers with Logging {
    assertSerDe(serializer, attemptInfoWithIndex)
  }

+  test("SPARK-29043: clean up specified event log") {
+    val clock = new ManualClock()
+    val conf = createTestConf().set(MAX_LOG_AGE_S.key, "0").set(CLEANER_ENABLED.key, "true")


No need to use .key here.

(I'll fix this during merge.)

turboFei changed the title ~~[SPARK-29043] Improve the concurrent performance of History Server~~ [WIP][SPARK-29043] Improve the concurrent performance of History Server Sep 15, 2019

turboFei changed the title ~~[WIP][SPARK-29043] Improve the concurrent performance of History Server~~ [WIP][SPARK-29043][Core] Improve the concurrent performance of History Server Sep 15, 2019

turboFei force-pushed the SPARK-29043 branch 3 times, most recently from ff7b9b9 to fa89be3 Compare September 15, 2019 15:58

turboFei changed the title ~~[WIP][SPARK-29043][Core] Improve the concurrent performance of History Server~~ [SPARK-29043][Core] Improve the concurrent performance of History Server Sep 15, 2019

HeartSaVioR reviewed Sep 16, 2019

View reviewed changes

dongjoon-hyun added the SPARK CORE label Sep 16, 2019

kiszk reviewed Sep 16, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala Outdated Show resolved Hide resolved

turboFei force-pushed the SPARK-29043 branch from d939301 to ec387f3 Compare September 16, 2019 17:49

turboFei requested a review from kiszk September 17, 2019 03:05

Ngone51 reviewed Sep 20, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala Show resolved Hide resolved

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala Outdated Show resolved Hide resolved

HeartSaVioR reviewed Sep 25, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala Show resolved Hide resolved

gengliangwang approved these changes Sep 26, 2019

View reviewed changes

vanzin reviewed Nov 7, 2019

View reviewed changes

turboFei force-pushed the SPARK-29043 branch from d5ff72c to 579c1ad Compare November 12, 2019 10:30

vanzin reviewed Nov 19, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala Outdated Show resolved Hide resolved

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala Outdated Show resolved Hide resolved

HeartSaVioR mentioned this pull request Dec 13, 2019

[SPARK-29779][CORE] Compact old event log files and cleanup #26416

Closed

turboFei added 6 commits December 13, 2019 14:47

[SPARK-29043] Improve the concurrent performance of History Server

b6cd802

check and delete expired log

7621626

remove used code

44e3c80

fix code

8540c39

add ut

34aeab8

fix style

c6ac35e

turboFei force-pushed the SPARK-29043 branch from 7d1301c to c6ac35e Compare December 13, 2019 11:01

turboFei commented Dec 13, 2019

View reviewed changes

vanzin reviewed Dec 14, 2019

View reviewed changes

fix ut

e9ebb6f

vanzin reviewed Dec 16, 2019

View reviewed changes

vanzin closed this in 5954311 Dec 16, 2019

[SPARK-29043][Core] Improve the concurrent performance of History Server #25797

[SPARK-29043][Core] Improve the concurrent performance of History Server #25797

Conversation

turboFei commented Sep 15, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun commented Sep 15, 2019

turboFei commented Sep 15, 2019

HyukjinKwon commented Sep 15, 2019

HyukjinKwon commented Sep 15, 2019

HeartSaVioR left a comment

Choose a reason for hiding this comment

wangyum commented Sep 16, 2019

SparkQA commented Sep 16, 2019

turboFei commented Sep 18, 2019

kiszk commented Sep 19, 2019

turboFei commented Sep 20, 2019

SparkQA commented Sep 20, 2019

Ngone51 left a comment

Choose a reason for hiding this comment

turboFei commented Sep 20, 2019

SparkQA commented Sep 20, 2019

turboFei commented Sep 26, 2019

kiszk commented Sep 26, 2019

gengliangwang commented Sep 26, 2019

turboFei commented Sep 26, 2019

SparkQA commented Nov 6, 2019

SparkQA commented Nov 6, 2019

SparkQA commented Nov 6, 2019

SparkQA commented Nov 6, 2019

vanzin Nov 7, 2019

Choose a reason for hiding this comment

vanzin Nov 7, 2019

Choose a reason for hiding this comment

turboFei Nov 7, 2019

Choose a reason for hiding this comment

SparkQA commented Nov 7, 2019

turboFei commented Nov 7, 2019

turboFei commented Nov 12, 2019

SparkQA commented Nov 12, 2019

SparkQA commented Nov 12, 2019

HeartSaVioR commented Dec 13, 2019

turboFei commented Dec 13, 2019

turboFei Dec 13, 2019

Choose a reason for hiding this comment

SparkQA commented Dec 13, 2019

SparkQA commented Dec 14, 2019

vanzin left a comment

Choose a reason for hiding this comment

vanzin Dec 16, 2019 • edited Loading

Choose a reason for hiding this comment

turboFei commented Sep 15, 2019 •

edited

Loading

vanzin Dec 16, 2019 •

edited

Loading