[SPARK-21571][Scheduler] Spark history server leaves incomplete or unrea… #18791

ericvandenbergfb-zz · 2017-08-01T01:48:18Z

…dable history files

around forever.

Fix logic

checkForLogs excluded 0-size files so they stuck around forever.
checkForLogs / mergeApplicationListing indefinitely ignored files
that were not parseable/couldn't extract an appID, so they stuck around
forever.

Only apply above logic if spark.history.fs.cleaner.aggressive=true.

Fixed race condition in a test (SPARK-3697: ignore files that cannot be
read.) where the number of mergeApplicationListings could be more than 1
since the FsHistoryProvider would spin up an executor that also calls
checkForLogs in parallel with the test unless spark.testing=true configured.

Added unit test to cover all cases with aggressive and non-aggressive
clean up logic.

What changes were proposed in this pull request?

The spark history server doesn't clean up certain history files outside the retention window leading to thousands of such files lingering around on our servers. The log checking and clean up logic skipped 0 byte files and expired inprogress or complete history files that weren't properly parseable (not able to extract an app id or otherwise parse...) Note these files most likely appeared to due aborted jobs or earlier spark/file system driver bugs. To mitigate this, FsHistoryProvider.checkForLogs now internally identifies these untracked files and will remove them if they expire outside the cleaner retention window.

This is currently controlled via configuration spark.history.fs.cleaner.aggressive=true to perform more aggressive cleaning.

How was this patch tested?

Implemented a unit test that exercises the above cases without and without the aggressive cleaning to ensure correct results in all cases. Note that FsHistoryProvider at one place uses the file system to get the current time and and at other times the local system time, this seems inconsistent/buggy but I did not attempt to fix in this commit. I was forced to change one of the method FsHistoryProvider.getNewLastScanTime() for the test to properly mock the clock.

Also ran a history server and touched some files to verify they were properly removed.

ericvandenberg@localhost /tmp/spark-events % ls -la
total 808K
drwxr-xr-x 8 ericvandenberg 272 Jul 31 18:22 .
drwxrwxrwt 127 root
-rw-r--r-- 1 ericvandenberg 0 Jan 1 2016 local-123.inprogress
-rwxr-x--- 1 ericvandenberg 342K Jan 1 2016 local-1501549952084
-rwxrwx--- 1 ericvandenberg 342K Jan 1 2016 local-1501549952084.inprogress
-rwxrwx--- 1 ericvandenberg 59K Jul 31 18:19 local-1501550073208
-rwxrwx--- 1 ericvandenberg 59K Jul 31 18:21 local-1501550473508.inprogress
-rw-r--r-- 1 ericvandenberg 0 Jan 1 2016 local-234

Observed in history server logs:

17/07/31 18:23:52 INFO FsHistoryProvider: Aggressively cleaned up 4 untracked history files.

ericvandenberg@localhost /tmp/spark-events % ls -la
total 120K
drwxr-xr-x 4 ericvandenberg 136 Jul 31 18:24 .
drwxrwxrwt 127 root 4.3K Jul 31 18:07 ..
-rwxrwx--- 1 ericvandenberg 59K Jul 31 18:19 local-1501550073208
-rwxrwx--- 1 ericvandenberg 59K Jul 31 18:22 local-1501550473508

…dable history files around forever. Fix logic 1. checkForLogs excluded 0-size files so they stuck around forever. 2. checkForLogs / mergeApplicationListing indefinitely ignored files that were not parseable/couldn't extract an appID, so they stuck around forever. Only apply above logic if spark.history.fs.cleaner.aggressive=true. Fixed race condition in a test (SPARK-3697: ignore files that cannot be read.) where the number of mergeApplicationListings could be more than 1 since the FsHistoryProvider would spin up an executor that also calls checkForLogs in parallel with the test. Added unit test to cover all cases with aggressive and non-aggressive clean up logic.

AmplabJenkins · 2017-08-01T01:52:08Z

Can one of the admins verify this patch?

ajbozarth · 2017-08-01T20:02:14Z

core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala

@@ -134,7 +134,8 @@ class FsHistoryProviderSuite extends SparkFunSuite with BeforeAndAfter with Matc
    // setReadable(...) does not work on Windows. Please refer JDK-6728842.
    assume(!Utils.isWindows)

-    class TestFsHistoryProvider extends FsHistoryProvider(createTestConf()) {
+    class TestFsHistoryProvider extends FsHistoryProvider(
+      createTestConf().set("spark.testing", "true")) {


What is your reasoning for adding this? It doesn't seem related to the other tests you added

The test suite refused to pass on my machine so decided to fix it.

Mentioned in commit comments...

Fixed race condition in a test (SPARK-3697: ignore files that cannot be
read.) where the number of mergeApplicationListings could be more than 1
since the FsHistoryProvider would spin up an executor that also calls
checkForLogs in parallel with the test unless spark.testing=true configured.

ajbozarth · 2017-08-01T20:05:43Z

Overall I like this as an option and the code looks good. Personally I use one log directory for all my logs (not just the SHS) so this wouldn't work for me, but I also run into dead files cluttering up my log dir and this would address that for larger servers.

jiangxb1987 · 2017-08-06T16:08:13Z

QQ: Do we want to default turn on the feature in some future version? If so, what pre-condition in your mind should be fulfilled? Currently we have had too much configurations that default to turn off, this might lead common users hard to tune their Spark cluster.

ajbozarth · 2017-08-06T17:24:14Z

This is something that should be defaulted to off, like I mentioned above many users like myself only use one log directory and when this is on it would delete any non-application logs in the log directory, including master and worker logs for example

jiangxb1987 · 2017-08-07T01:50:54Z

Yea, I'm just thinking whether it is possible we can have a perfect approach that we can be confident to turn it on by default.

ericvandenbergfb-zz · 2017-08-25T16:57:00Z

The default is off, so people can opt-in to more aggressive clean up.

Is this okay to be merged?

jiangxb1987 · 2017-11-06T08:12:06Z

@ericvandenbergfb Could you please rebase this to the latest master so we can continue review it? Also cc @vanzin @jerryshao

jerryshao · 2017-11-06T08:18:44Z

@ericvandenbergfb please also fix the PR title, thanks.

…up.untracked.history.files * 'master' of github.com:ericvandenbergfb/spark: (637 commits) [SPARK-22469][SQL] Accuracy problem in comparison with string and numeric [SPARK-22490][DOC] Add PySpark doc for SparkSession.builder [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column names in createDataFrame with Arrow [SPARK-22514][SQL] move ColumnVector.Array and ColumnarBatch.Row to individual files [SPARK-12375][ML] VectorIndexerModel support handle unseen categories via handleInvalid [SPARK-21087][ML] CrossValidator, TrainValidationSplit expose sub models after fitting: Scala [SPARK-22511][BUILD] Update maven central repo address [SPARK-22519][YARN] Remove unnecessary stagingDirPath null check in ApplicationMaster.cleanupStagingDir() [SPARK-20652][SQL] Store SQL UI data in the new app status store. [SPARK-20648][CORE] Port JobsTab and StageTab to the new UI backend. [SPARK-17074][SQL] Generate equi-height histogram in column statistics [SPARK-17310][SQL] Add an option to disable record-level filter in Parquet-side [SPARK-21911][ML][FOLLOW-UP] Fix doc for parallel ML Tuning in PySpark [SPARK-22377][BUILD] Use /usr/sbin/lsof if lsof does not exists in release-build.sh [SPARK-22487][SQL][FOLLOWUP] still keep spark.sql.hive.version [MINOR][CORE] Using bufferedInputStream for dataDeserializeStream [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFrame from Pandas [SPARK-21693][R][ML] Reduce max iterations in Linear SVM test in R to speed up AppVeyor build [SPARK-21720][SQL] Fix 64KB JVM bytecode limit problem with AND or OR ...

The history provider code was changed so I reimplemented the fix to clean up empty or corrupt history files that otherwise would stay around forever.

ericvandenbergfb-zz · 2017-11-16T21:32:05Z

See continuation of pull request at #19770

ericvandenbergfb-zz changed the title ~~[SPARK-21571][WEB UI] Spark history server leaves incomplete or unrea…~~ [SPARK-21571][Scheduler] Spark history server leaves incomplete or unrea… Aug 1, 2017

ajbozarth reviewed Aug 1, 2017

View reviewed changes

Eric Vandenberg added 2 commits November 15, 2017 12:17

[SPARK-21571][WEB UI] Spark history server cleans up untracked files.

aee2fd3

The history provider code was changed so I reimplemented the fix to clean up empty or corrupt history files that otherwise would stay around forever.

ericvandenbergfb-zz mentioned this pull request Nov 16, 2017

[SPARK-21571][WEB UI] Spark history server leaves incomplete or unreadable logs around forever #19770

Closed

ericvandenbergfb-zz closed this Nov 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-21571][Scheduler] Spark history server leaves incomplete or unrea… #18791

[SPARK-21571][Scheduler] Spark history server leaves incomplete or unrea… #18791

ericvandenbergfb-zz commented Aug 1, 2017

AmplabJenkins commented Aug 1, 2017

ajbozarth Aug 1, 2017

ericvandenbergfb-zz Aug 1, 2017

ajbozarth commented Aug 1, 2017

jiangxb1987 commented Aug 6, 2017

ajbozarth commented Aug 6, 2017

jiangxb1987 commented Aug 7, 2017

ericvandenbergfb-zz commented Aug 25, 2017

jiangxb1987 commented Nov 6, 2017

jerryshao commented Nov 6, 2017

ericvandenbergfb-zz commented Nov 16, 2017

Navigation Menu

[SPARK-21571][Scheduler] Spark history server leaves incomplete or unrea… #18791

[SPARK-21571][Scheduler] Spark history server leaves incomplete or unrea… #18791

Conversation

ericvandenbergfb-zz commented Aug 1, 2017

What changes were proposed in this pull request?

How was this patch tested?

AmplabJenkins commented Aug 1, 2017

ajbozarth Aug 1, 2017

Choose a reason for hiding this comment

ericvandenbergfb-zz Aug 1, 2017

Choose a reason for hiding this comment

ajbozarth commented Aug 1, 2017

jiangxb1987 commented Aug 6, 2017

ajbozarth commented Aug 6, 2017

jiangxb1987 commented Aug 7, 2017

ericvandenbergfb-zz commented Aug 25, 2017

jiangxb1987 commented Nov 6, 2017

jerryshao commented Nov 6, 2017

ericvandenbergfb-zz commented Nov 16, 2017