[SPARK-55495][CORE] Fix `EventLogFileWriters.closeWriter` to handle `checkError` #54280

dongjoon-hyun · 2026-02-12T03:49:12Z

What changes were proposed in this pull request?

This PR aims to fix EventLogFileWriters.closeWriter to handle checkError. In general, we need the following three.

Do flush first before closing to isolate any problems at this layer.
Do PrintWriter.close and fallback to the underlying Hadoop file stream's close API.
Show warnings properly if checkError returns true.

Why are the changes needed?

Currently, Apache Spark's event log writer naively invokes PrintWriter.close() without error handling.

spark/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileWriters.scala

Line 80 in 4e1cb88

protected var writer: Option[PrintWriter] = None

spark/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileWriters.scala

Lines 133 to 135 in 4e1cb88

    
           protected def closeWriter(): Unit = { 
        
             writer.foreach(_.close()) 
        
           }

However, Java community recommends to use checkError in case of PrintWriter.flush and PrintWriter.close.

https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/io/PrintWriter.html#checkError()

When checkError returns true, a user can lose their event log. For example, the event log is not uploaded silently. Spark had better show a proper warning and tries to do the best efforts to flush or close the underlying Hadoop File streams at least.

Does this PR introduce any user-facing change?

No, this is a bug fix for the corner case.

How was this patch tested?

Pass the CIs with the newly added test cases.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Opus 4.5 on Claude Code

…checkError`

dongjoon-hyun · 2026-02-12T04:01:44Z

Could you review this PR, @HeartSaVioR , @HyukjinKwon , @viirya , @yaooqinn , @LuciferYang , @peter-toth , @pan3793 ?

core/src/main/scala/org/apache/spark/deploy/history/EventLogFileWriters.scala

dongjoon-hyun · 2026-02-12T05:41:17Z

Thank you, @pan3793 . If there is any issue for any known downstream projects, we may want to change like the following conservatively later. In the following alternative, there is no side-effect for normal successful case.

if (writer.exists(_.checkError())) {
  logWarning("Spark detects errors while closing event logs.")
  hadoopDataStream.foreach(_.close())
}

pan3793 · 2026-02-12T06:08:11Z

@dongjoon-hyun, yeah, I feel the alternative is better.

The change makes sense to me, soft +1, because I am not experienced with file systems in large scale other than HDFS, better to leave others have a look too.

dongjoon-hyun · 2026-02-12T06:21:02Z

Thank you for your thoughtful feedback, @pan3793 . 😄 I updated this PR with the alternative, too.

yaooqinn · 2026-02-12T07:11:34Z

core/src/main/scala/org/apache/spark/deploy/history/EventLogFileWriters.scala

+    hadoopDataStream.foreach(_.hflush())
+
+    // 2. Try to close and check the errors
    writer.foreach(_.close())


Catch exceptions from close() directly to ensure the fallback runs?

In the case of HDFS, it should presumably not throw an exception; instead, a boolean status would be set to true. However, I'm not sure whether third-party libraries will adhere to this convention.

yaooqinn · 2026-02-12T07:12:30Z

core/src/main/scala/org/apache/spark/deploy/history/EventLogFileWriters.scala

+    // 2. Try to close and check the errors
    writer.foreach(_.close())
+    if (writer.exists(_.checkError())) {
+      logWarning("Spark detects errors while closing event logs.")


Use structured logging?

This follows a structured logging recommendation for a constant string message. Did I miss something, @yaooqinn ?

spark/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala

Lines 50 to 55 in 59f3a16

* Constant String Messages:

* If you are logging a constant string message, use the log methods that accept a constant

* string.

* <p>

*

* logInfo("StateStore stopped")

yaooqinn · 2026-02-12T07:14:59Z

core/src/test/scala/org/apache/spark/deploy/history/EventLogFileWritersSuite.scala

+
+  override def close(): Unit = {
+    if (throwOnClose) {
+      throw new IOException("Simulated close error")


Shall we match the synthetic error msg in the tests make sure that we capture the right one

The test case matches Spark's code's exception message like the following. Instead of this test suite string.

assert(warningMessages.exists(_.contains("Spark detects errors while flushing")),

assert(warningMessages.exists(_.contains("Spark detects errors while closing")),

yaooqinn · 2026-02-12T07:19:05Z

core/src/main/scala/org/apache/spark/deploy/history/EventLogFileWriters.scala

+    if (writer.exists(_.checkError())) {
+      logWarning("Spark detects errors while flushing event logs.")
+    }
+    hadoopDataStream.foreach(_.hflush())


hadoopDataStream.foreach(_.hflush()) can throw unhandled IOException, shall we wrap a try-catch here. Maybe we can leverage something like Utils.logIOError for all fallback paths

This PR aims to avoid a silent failure inside checkError. For other propagatable unhandled IOException, SparkContext.stop already logs like the following. So, I didn't use try ... catch ... or Utils.tryLog... intentionally.

spark/core/src/main/scala/org/apache/spark/SparkContext.scala

Lines 2380 to 2382 in 59f3a16

Utils.tryLogNonFatalError {

_eventLogger.foreach(_.stop())

}

spark/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileWriters.scala

Lines 356 to 363 in 59f3a16

override def stop(): Unit = {

closeWriter()

val appStatusPathIncomplete = getAppStatusFilePath(logDirForAppPath, appId, appAttemptId,

inProgress = true)

val appStatusPathComplete = getAppStatusFilePath(logDirForAppPath, appId, appAttemptId,

inProgress = false)

renameFile(appStatusPathIncomplete, appStatusPathComplete, overwrite = true)

}

In my case, the silent failure happens before renameFile. So, the last log file is not uploaded correctly and inprogress file remains. As a result, SHS shows running stages always because it's not finished.

dongjoon-hyun · 2026-02-12T07:58:09Z

Thank you for your review comments, @LuciferYang and @yaooqinn .

yaooqinn

LGTM, thank you for the patch and explanation

[SPARK-55495][CORE] Fix EventLogFileWriters.closeWriter to handle `…

d5ef65a

…checkError`

Add test cases

5e524c4

pan3793 reviewed Feb 12, 2026

View reviewed changes

core/src/main/scala/org/apache/spark/deploy/history/EventLogFileWriters.scala Outdated Show resolved Hide resolved

Address comments

fa1757c

LuciferYang approved these changes Feb 12, 2026

View reviewed changes

yaooqinn reviewed Feb 12, 2026

View reviewed changes

yaooqinn approved these changes Feb 12, 2026

View reviewed changes

peter-toth approved these changes Feb 12, 2026

View reviewed changes

	protected def closeWriter(): Unit = {
	writer.foreach(_.close())
	}

	* Constant String Messages:
	* If you are logging a constant string message, use the log methods that accept a constant
	* string.
	* <p>
	*
	* logInfo("StateStore stopped")

	override def stop(): Unit = {
	closeWriter()
	val appStatusPathIncomplete = getAppStatusFilePath(logDirForAppPath, appId, appAttemptId,
	inProgress = true)
	val appStatusPathComplete = getAppStatusFilePath(logDirForAppPath, appId, appAttemptId,
	inProgress = false)
	renameFile(appStatusPathIncomplete, appStatusPathComplete, overwrite = true)
	}

[SPARK-55495][CORE] Fix EventLogFileWriters.closeWriter to handle checkError #54280

Are you sure you want to change the base?

[SPARK-55495][CORE] Fix EventLogFileWriters.closeWriter to handle checkError #54280

Conversation

dongjoon-hyun commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun commented Feb 12, 2026

Uh oh!

Uh oh!

dongjoon-hyun commented Feb 12, 2026

Uh oh!

pan3793 commented Feb 12, 2026

Uh oh!

dongjoon-hyun commented Feb 12, 2026

Uh oh!

yaooqinn Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

yaooqinn Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

yaooqinn Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

yaooqinn Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Feb 12, 2026

Uh oh!

yaooqinn left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-55495][CORE] Fix `EventLogFileWriters.closeWriter` to handle `checkError` #54280

[SPARK-55495][CORE] Fix `EventLogFileWriters.closeWriter` to handle `checkError` #54280

dongjoon-hyun commented Feb 12, 2026 •

edited

Loading

dongjoon-hyun Feb 12, 2026 •

edited

Loading