Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Feb 12, 2026

What changes were proposed in this pull request?

This PR aims to fix EventLogFileWriters.closeWriter to handle checkError. In general, we need the following three.

  1. Do flush first before closing to isolate any problems at this layer.
  2. Do PrintWriter.close and fallback to the underlying Hadoop file stream's close API.
  3. Show warnings properly if checkError returns true.

Why are the changes needed?

Currently, Apache Spark's event log writer naively invokes PrintWriter.close() without error handling.

protected var writer: Option[PrintWriter] = None

protected def closeWriter(): Unit = {
writer.foreach(_.close())
}

However, Java community recommends to use checkError in case of PrintWriter.flush and PrintWriter.close.

When checkError returns true, a user can lose their event log. For example, the event log is not uploaded silently. Spark had better show a proper warning and tries to do the best efforts to flush or close the underlying Hadoop File streams at least.

Does this PR introduce any user-facing change?

No, this is a bug fix for the corner case.

How was this patch tested?

Pass the CIs with the newly added test cases.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Opus 4.5 on Claude Code

@dongjoon-hyun
Copy link
Member Author

Could you review this PR, @HeartSaVioR , @HyukjinKwon , @viirya , @yaooqinn , @LuciferYang , @peter-toth , @pan3793 ?

@dongjoon-hyun
Copy link
Member Author

Thank you, @pan3793 . If there is any issue for any known downstream projects, we may want to change like the following conservatively later. In the following alternative, there is no side-effect for normal successful case.

if (writer.exists(_.checkError())) {
  logWarning("Spark detects errors while closing event logs.")
  hadoopDataStream.foreach(_.close())
}

@pan3793
Copy link
Member

pan3793 commented Feb 12, 2026

@dongjoon-hyun, yeah, I feel the alternative is better.

The change makes sense to me, soft +1, because I am not experienced with file systems in large scale other than HDFS, better to leave others have a look too.

@dongjoon-hyun
Copy link
Member Author

Thank you for your thoughtful feedback, @pan3793 . 😄 I updated this PR with the alternative, too.

hadoopDataStream.foreach(_.hflush())

// 2. Try to close and check the errors
writer.foreach(_.close())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catch exceptions from close() directly to ensure the fallback runs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of HDFS, it should presumably not throw an exception; instead, a boolean status would be set to true. However, I'm not sure whether third-party libraries will adhere to this convention.

// 2. Try to close and check the errors
writer.foreach(_.close())
if (writer.exists(_.checkError())) {
logWarning("Spark detects errors while closing event logs.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use structured logging?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follows a structured logging recommendation for a constant string message. Did I miss something, @yaooqinn ?

* Constant String Messages:
* If you are logging a constant string message, use the log methods that accept a constant
* string.
* <p>
*
* logInfo("StateStore stopped")


override def close(): Unit = {
if (throwOnClose) {
throw new IOException("Simulated close error")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we match the synthetic error msg in the tests make sure that we capture the right one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case matches Spark's code's exception message like the following. Instead of this test suite string.

assert(warningMessages.exists(_.contains("Spark detects errors while flushing")),
assert(warningMessages.exists(_.contains("Spark detects errors while closing")),

if (writer.exists(_.checkError())) {
logWarning("Spark detects errors while flushing event logs.")
}
hadoopDataStream.foreach(_.hflush())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hadoopDataStream.foreach(_.hflush()) can throw unhandled IOException, shall we wrap a try-catch here. Maybe we can leverage something like Utils.logIOError for all fallback paths

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR aims to avoid a silent failure inside checkError. For other propagatable unhandled IOException, SparkContext.stop already logs like the following. So, I didn't use try ... catch ... or Utils.tryLog... intentionally.

Utils.tryLogNonFatalError {
_eventLogger.foreach(_.stop())
}

override def stop(): Unit = {
closeWriter()
val appStatusPathIncomplete = getAppStatusFilePath(logDirForAppPath, appId, appAttemptId,
inProgress = true)
val appStatusPathComplete = getAppStatusFilePath(logDirForAppPath, appId, appAttemptId,
inProgress = false)
renameFile(appStatusPathIncomplete, appStatusPathComplete, overwrite = true)
}

In my case, the silent failure happens before renameFile. So, the last log file is not uploaded correctly and inprogress file remains. As a result, SHS shows running stages always because it's not finished.

@dongjoon-hyun
Copy link
Member Author

Thank you for your review comments, @LuciferYang and @yaooqinn .

Copy link
Member

@yaooqinn yaooqinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for the patch and explanation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants