New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21571][Scheduler] Spark history server leaves incomplete or unrea… #18791
[SPARK-21571][Scheduler] Spark history server leaves incomplete or unrea… #18791
Conversation
…dable history files around forever. Fix logic 1. checkForLogs excluded 0-size files so they stuck around forever. 2. checkForLogs / mergeApplicationListing indefinitely ignored files that were not parseable/couldn't extract an appID, so they stuck around forever. Only apply above logic if spark.history.fs.cleaner.aggressive=true. Fixed race condition in a test (SPARK-3697: ignore files that cannot be read.) where the number of mergeApplicationListings could be more than 1 since the FsHistoryProvider would spin up an executor that also calls checkForLogs in parallel with the test. Added unit test to cover all cases with aggressive and non-aggressive clean up logic.
Can one of the admins verify this patch? |
@@ -134,7 +134,8 @@ class FsHistoryProviderSuite extends SparkFunSuite with BeforeAndAfter with Matc | |||
// setReadable(...) does not work on Windows. Please refer JDK-6728842. | |||
assume(!Utils.isWindows) | |||
|
|||
class TestFsHistoryProvider extends FsHistoryProvider(createTestConf()) { | |||
class TestFsHistoryProvider extends FsHistoryProvider( | |||
createTestConf().set("spark.testing", "true")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is your reasoning for adding this? It doesn't seem related to the other tests you added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test suite refused to pass on my machine so decided to fix it.
Mentioned in commit comments...
Fixed race condition in a test (SPARK-3697: ignore files that cannot be
read.) where the number of mergeApplicationListings could be more than 1
since the FsHistoryProvider would spin up an executor that also calls
checkForLogs in parallel with the test unless spark.testing=true configured.
Overall I like this as an option and the code looks good. Personally I use one log directory for all my logs (not just the SHS) so this wouldn't work for me, but I also run into dead files cluttering up my log dir and this would address that for larger servers. |
QQ: Do we want to default turn on the feature in some future version? If so, what pre-condition in your mind should be fulfilled? Currently we have had too much configurations that default to turn off, this might lead common users hard to tune their Spark cluster. |
This is something that should be defaulted to off, like I mentioned above many users like myself only use one log directory and when this is on it would delete any non-application logs in the log directory, including master and worker logs for example |
Yea, I'm just thinking whether it is possible we can have a perfect approach that we can be confident to turn it on by default. |
The default is off, so people can opt-in to more aggressive clean up. Is this okay to be merged? |
@ericvandenbergfb Could you please rebase this to the latest master so we can continue review it? Also cc @vanzin @jerryshao |
@ericvandenbergfb please also fix the PR title, thanks. |
…up.untracked.history.files * 'master' of github.com:ericvandenbergfb/spark: (637 commits) [SPARK-22469][SQL] Accuracy problem in comparison with string and numeric [SPARK-22490][DOC] Add PySpark doc for SparkSession.builder [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column names in createDataFrame with Arrow [SPARK-22514][SQL] move ColumnVector.Array and ColumnarBatch.Row to individual files [SPARK-12375][ML] VectorIndexerModel support handle unseen categories via handleInvalid [SPARK-21087][ML] CrossValidator, TrainValidationSplit expose sub models after fitting: Scala [SPARK-22511][BUILD] Update maven central repo address [SPARK-22519][YARN] Remove unnecessary stagingDirPath null check in ApplicationMaster.cleanupStagingDir() [SPARK-20652][SQL] Store SQL UI data in the new app status store. [SPARK-20648][CORE] Port JobsTab and StageTab to the new UI backend. [SPARK-17074][SQL] Generate equi-height histogram in column statistics [SPARK-17310][SQL] Add an option to disable record-level filter in Parquet-side [SPARK-21911][ML][FOLLOW-UP] Fix doc for parallel ML Tuning in PySpark [SPARK-22377][BUILD] Use /usr/sbin/lsof if lsof does not exists in release-build.sh [SPARK-22487][SQL][FOLLOWUP] still keep spark.sql.hive.version [MINOR][CORE] Using bufferedInputStream for dataDeserializeStream [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFrame from Pandas [SPARK-21693][R][ML] Reduce max iterations in Linear SVM test in R to speed up AppVeyor build [SPARK-21720][SQL] Fix 64KB JVM bytecode limit problem with AND or OR ...
The history provider code was changed so I reimplemented the fix to clean up empty or corrupt history files that otherwise would stay around forever.
See continuation of pull request at #19770 |
…dable history files
around forever.
Fix logic
that were not parseable/couldn't extract an appID, so they stuck around
forever.
Only apply above logic if spark.history.fs.cleaner.aggressive=true.
Fixed race condition in a test (SPARK-3697: ignore files that cannot be
read.) where the number of mergeApplicationListings could be more than 1
since the FsHistoryProvider would spin up an executor that also calls
checkForLogs in parallel with the test unless spark.testing=true configured.
Added unit test to cover all cases with aggressive and non-aggressive
clean up logic.
What changes were proposed in this pull request?
The spark history server doesn't clean up certain history files outside the retention window leading to thousands of such files lingering around on our servers. The log checking and clean up logic skipped 0 byte files and expired inprogress or complete history files that weren't properly parseable (not able to extract an app id or otherwise parse...) Note these files most likely appeared to due aborted jobs or earlier spark/file system driver bugs. To mitigate this, FsHistoryProvider.checkForLogs now internally identifies these untracked files and will remove them if they expire outside the cleaner retention window.
This is currently controlled via configuration spark.history.fs.cleaner.aggressive=true to perform more aggressive cleaning.
How was this patch tested?
Implemented a unit test that exercises the above cases without and without the aggressive cleaning to ensure correct results in all cases. Note that FsHistoryProvider at one place uses the file system to get the current time and and at other times the local system time, this seems inconsistent/buggy but I did not attempt to fix in this commit. I was forced to change one of the method FsHistoryProvider.getNewLastScanTime() for the test to properly mock the clock.
Also ran a history server and touched some files to verify they were properly removed.
ericvandenberg@localhost /tmp/spark-events % ls -la
total 808K
drwxr-xr-x 8 ericvandenberg 272 Jul 31 18:22 .
drwxrwxrwt 127 root
-rw-r--r-- 1 ericvandenberg 0 Jan 1 2016 local-123.inprogress
-rwxr-x--- 1 ericvandenberg 342K Jan 1 2016 local-1501549952084
-rwxrwx--- 1 ericvandenberg 342K Jan 1 2016 local-1501549952084.inprogress
-rwxrwx--- 1 ericvandenberg 59K Jul 31 18:19 local-1501550073208
-rwxrwx--- 1 ericvandenberg 59K Jul 31 18:21 local-1501550473508.inprogress
-rw-r--r-- 1 ericvandenberg 0 Jan 1 2016 local-234
Observed in history server logs:
17/07/31 18:23:52 INFO FsHistoryProvider: Aggressively cleaned up 4 untracked history files.
ericvandenberg@localhost /tmp/spark-events % ls -la
total 120K
drwxr-xr-x 4 ericvandenberg 136 Jul 31 18:24 .
drwxrwxrwt 127 root 4.3K Jul 31 18:07 ..
-rwxrwx--- 1 ericvandenberg 59K Jul 31 18:19 local-1501550073208
-rwxrwx--- 1 ericvandenberg 59K Jul 31 18:22 local-1501550473508