Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21571][Scheduler] Spark history server leaves incomplete or unrea… #18791

Conversation

ericvandenbergfb-zz
Copy link

…dable history files

around forever.

Fix logic

  1. checkForLogs excluded 0-size files so they stuck around forever.
  2. checkForLogs / mergeApplicationListing indefinitely ignored files
    that were not parseable/couldn't extract an appID, so they stuck around
    forever.

Only apply above logic if spark.history.fs.cleaner.aggressive=true.

Fixed race condition in a test (SPARK-3697: ignore files that cannot be
read.) where the number of mergeApplicationListings could be more than 1
since the FsHistoryProvider would spin up an executor that also calls
checkForLogs in parallel with the test unless spark.testing=true configured.

Added unit test to cover all cases with aggressive and non-aggressive
clean up logic.

What changes were proposed in this pull request?

The spark history server doesn't clean up certain history files outside the retention window leading to thousands of such files lingering around on our servers. The log checking and clean up logic skipped 0 byte files and expired inprogress or complete history files that weren't properly parseable (not able to extract an app id or otherwise parse...) Note these files most likely appeared to due aborted jobs or earlier spark/file system driver bugs. To mitigate this, FsHistoryProvider.checkForLogs now internally identifies these untracked files and will remove them if they expire outside the cleaner retention window.

This is currently controlled via configuration spark.history.fs.cleaner.aggressive=true to perform more aggressive cleaning.

How was this patch tested?

Implemented a unit test that exercises the above cases without and without the aggressive cleaning to ensure correct results in all cases. Note that FsHistoryProvider at one place uses the file system to get the current time and and at other times the local system time, this seems inconsistent/buggy but I did not attempt to fix in this commit. I was forced to change one of the method FsHistoryProvider.getNewLastScanTime() for the test to properly mock the clock.

Also ran a history server and touched some files to verify they were properly removed.

ericvandenberg@localhost /tmp/spark-events % ls -la
total 808K
drwxr-xr-x 8 ericvandenberg 272 Jul 31 18:22 .
drwxrwxrwt 127 root
-rw-r--r-- 1 ericvandenberg 0 Jan 1 2016 local-123.inprogress
-rwxr-x--- 1 ericvandenberg 342K Jan 1 2016 local-1501549952084
-rwxrwx--- 1 ericvandenberg 342K Jan 1 2016 local-1501549952084.inprogress
-rwxrwx--- 1 ericvandenberg 59K Jul 31 18:19 local-1501550073208
-rwxrwx--- 1 ericvandenberg 59K Jul 31 18:21 local-1501550473508.inprogress
-rw-r--r-- 1 ericvandenberg 0 Jan 1 2016 local-234

Observed in history server logs:

17/07/31 18:23:52 INFO FsHistoryProvider: Aggressively cleaned up 4 untracked history files.

ericvandenberg@localhost /tmp/spark-events % ls -la
total 120K
drwxr-xr-x 4 ericvandenberg 136 Jul 31 18:24 .
drwxrwxrwt 127 root 4.3K Jul 31 18:07 ..
-rwxrwx--- 1 ericvandenberg 59K Jul 31 18:19 local-1501550073208
-rwxrwx--- 1 ericvandenberg 59K Jul 31 18:22 local-1501550473508

…dable history files

around forever.

Fix logic
1. checkForLogs excluded 0-size files so they stuck around forever.
2. checkForLogs / mergeApplicationListing indefinitely ignored files
that were not parseable/couldn't extract an appID, so they stuck around
forever.

Only apply above logic if spark.history.fs.cleaner.aggressive=true.

Fixed race condition in a test (SPARK-3697: ignore files that cannot be
read.) where the number of mergeApplicationListings could be more than 1
since the FsHistoryProvider would spin up an executor that also calls
checkForLogs in parallel with the test.

Added unit test to cover all cases with aggressive and non-aggressive
clean up logic.
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@ericvandenbergfb-zz ericvandenbergfb-zz changed the title [SPARK-21571][WEB UI] Spark history server leaves incomplete or unrea… [SPARK-21571][Scheduler] Spark history server leaves incomplete or unrea… Aug 1, 2017
@@ -134,7 +134,8 @@ class FsHistoryProviderSuite extends SparkFunSuite with BeforeAndAfter with Matc
// setReadable(...) does not work on Windows. Please refer JDK-6728842.
assume(!Utils.isWindows)

class TestFsHistoryProvider extends FsHistoryProvider(createTestConf()) {
class TestFsHistoryProvider extends FsHistoryProvider(
createTestConf().set("spark.testing", "true")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your reasoning for adding this? It doesn't seem related to the other tests you added

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test suite refused to pass on my machine so decided to fix it.

Mentioned in commit comments...

Fixed race condition in a test (SPARK-3697: ignore files that cannot be
read.) where the number of mergeApplicationListings could be more than 1
since the FsHistoryProvider would spin up an executor that also calls
checkForLogs in parallel with the test unless spark.testing=true configured.

@ajbozarth
Copy link
Member

Overall I like this as an option and the code looks good. Personally I use one log directory for all my logs (not just the SHS) so this wouldn't work for me, but I also run into dead files cluttering up my log dir and this would address that for larger servers.

@jiangxb1987
Copy link
Contributor

QQ: Do we want to default turn on the feature in some future version? If so, what pre-condition in your mind should be fulfilled? Currently we have had too much configurations that default to turn off, this might lead common users hard to tune their Spark cluster.

@ajbozarth
Copy link
Member

This is something that should be defaulted to off, like I mentioned above many users like myself only use one log directory and when this is on it would delete any non-application logs in the log directory, including master and worker logs for example

@jiangxb1987
Copy link
Contributor

Yea, I'm just thinking whether it is possible we can have a perfect approach that we can be confident to turn it on by default.

@ericvandenbergfb-zz
Copy link
Author

The default is off, so people can opt-in to more aggressive clean up.

Is this okay to be merged?

@jiangxb1987
Copy link
Contributor

@ericvandenbergfb Could you please rebase this to the latest master so we can continue review it? Also cc @vanzin @jerryshao

@jerryshao
Copy link
Contributor

@ericvandenbergfb please also fix the PR title, thanks.

Eric Vandenberg added 2 commits November 15, 2017 12:17
…up.untracked.history.files

* 'master' of github.com:ericvandenbergfb/spark: (637 commits)
  [SPARK-22469][SQL] Accuracy problem in comparison with string and numeric
  [SPARK-22490][DOC] Add PySpark doc for SparkSession.builder
  [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics
  [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column names in createDataFrame with Arrow
  [SPARK-22514][SQL] move ColumnVector.Array and ColumnarBatch.Row to individual files
  [SPARK-12375][ML] VectorIndexerModel support handle unseen categories via handleInvalid
  [SPARK-21087][ML] CrossValidator, TrainValidationSplit expose sub models after fitting: Scala
  [SPARK-22511][BUILD] Update maven central repo address
  [SPARK-22519][YARN] Remove unnecessary stagingDirPath null check in ApplicationMaster.cleanupStagingDir()
  [SPARK-20652][SQL] Store SQL UI data in the new app status store.
  [SPARK-20648][CORE] Port JobsTab and StageTab to the new UI backend.
  [SPARK-17074][SQL] Generate equi-height histogram in column statistics
  [SPARK-17310][SQL] Add an option to disable record-level filter in Parquet-side
  [SPARK-21911][ML][FOLLOW-UP] Fix doc for parallel ML Tuning in PySpark
  [SPARK-22377][BUILD] Use /usr/sbin/lsof if lsof does not exists in release-build.sh
  [SPARK-22487][SQL][FOLLOWUP] still keep spark.sql.hive.version
  [MINOR][CORE] Using bufferedInputStream for dataDeserializeStream
  [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFrame from Pandas
  [SPARK-21693][R][ML] Reduce max iterations in Linear SVM test in R to speed up AppVeyor build
  [SPARK-21720][SQL] Fix 64KB JVM bytecode limit problem with AND or OR
  ...
The history provider code was changed so I reimplemented the fix to
clean up empty or corrupt history files that otherwise would stay
around forever.
@ericvandenbergfb-zz
Copy link
Author

See continuation of pull request at #19770

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants