Skip to content

Comments

[SPARK-38446][Core] Fix deadlock between ExecutorClassLoader and FileDownloadCallback caused by Log4j#35765

Closed
yaooqinn wants to merge 1 commit intoapache:masterfrom
yaooqinn:SPARK-38446
Closed

[SPARK-38446][Core] Fix deadlock between ExecutorClassLoader and FileDownloadCallback caused by Log4j#35765
yaooqinn wants to merge 1 commit intoapache:masterfrom
yaooqinn:SPARK-38446

Conversation

@yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Mar 8, 2022

What changes were proposed in this pull request?

While log4j.ignoreTCL/log4j2.ignoreTCL is false, which is the default, it uses the context ClassLoader for the current Thread, see org.apache.logging.log4j.util.LoaderUtil.loadClass. While ExecutorClassLoader try to loadClass through remotely though the FileDownload, if error occurs, we will long on debug level, and log4j...LoaderUtil will be blocked by ExecutorClassLoader acquired classloading lock.

Fortunately, it only happens when ThresholdFilter's level is debug.

or we can set log4j.ignoreTCL/log4j2.ignoreTCL to true, but I don't know what else it will cause.

So in this PR, I simply remove the debug log which cause this deadlock

Why are the changes needed?

fix deadlock

Does this PR introduce any user-facing change?

no

How was this patch tested?

apache/kyuubi#2046 (comment), with a ut in kyuubi project, resolved(https://github.com/apache/incubator-kyuubi/actions/runs/1950222737)

Additional Resources

ut.jstack.txt

@github-actions github-actions bot added the CORE label Mar 8, 2022
@yaooqinn
Copy link
Member Author

yaooqinn commented Mar 8, 2022

@dongjoon-hyun
Copy link
Member

cc @viirya

}

override def onFailure(streamId: String, cause: Throwable): Unit = {
logDebug(s"Error downloading stream $streamId.", cause)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the deadlock happens when we use DEBUG log level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the log4j will loadClass again with ExecutorClassLoader, if the loglevel does not satisfy, it is skipped

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the deadlock occurs, I see something very similar with #10337 (comment)

@cloud-fan
Copy link
Contributor

cc @Ngone51 @jiangxb1987

@viirya
Copy link
Member

viirya commented Mar 8, 2022

Hmm, after I read the description, I think shouldn't we fix it at ExecutorClassLoader? E.g., falling back to the default class loader as log4j2.ignoreTCL?

@yaooqinn
Copy link
Member Author

yaooqinn commented Mar 9, 2022

thanks, @viirya,
Firstly we cannot fall back directly before ExecutorClassLoader tries remote classes via log4j2.ignoreTCL, it fails in the middle of fetching remote classes. Then we try to log a debug message where we shall skip via log4j2.ignoreTCL. As we cannot determine in ExecutorClassLoader whether the caller for loading classes is from Log4j. I don't know how we play the trick to ignore log4j2.ignoreTCL first then respect it later.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Apr 3, 2022

cc @MaxGekk since he is a release manager.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Since this is a removal of log, there is no functional regression .
Let me merge this and I hope this can unblock Apache Kyuubi project and Apache Spark 3.3.0 release.

Thank you, @yaooqinn , @srowen , @viirya , @cloud-fan .

dongjoon-hyun pushed a commit that referenced this pull request Apr 3, 2022
…DownloadCallback caused by Log4j

### What changes were proposed in this pull request?

While `log4j.ignoreTCL/log4j2.ignoreTCL` is false, which is the default, it uses the context ClassLoader for the current Thread, see `org.apache.logging.log4j.util.LoaderUtil.loadClass`. While ExecutorClassLoader try to loadClass through remotely though the FileDownload, if error occurs, we will long on debug level, and `log4j...LoaderUtil` will be blocked by ExecutorClassLoader acquired classloading lock.

Fortunately, it only happens when ThresholdFilter's level is `debug`.

or we can set `log4j.ignoreTCL/log4j2.ignoreTCL` to true, but I don't know what else it will cause.

So in this PR, I simply remove the debug log which cause this deadlock

### Why are the changes needed?

fix deadlock

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

apache/kyuubi#2046 (comment), with a ut in kyuubi project, resolved(https://github.com/apache/incubator-kyuubi/actions/runs/1950222737)

### Additional Resources

[ut.jstack.txt](https://github.com/apache/spark/files/8206457/ut.jstack.txt)

Closes #35765 from yaooqinn/SPARK-38446.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit aef6745)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Apr 3, 2022
…DownloadCallback caused by Log4j

### What changes were proposed in this pull request?

While `log4j.ignoreTCL/log4j2.ignoreTCL` is false, which is the default, it uses the context ClassLoader for the current Thread, see `org.apache.logging.log4j.util.LoaderUtil.loadClass`. While ExecutorClassLoader try to loadClass through remotely though the FileDownload, if error occurs, we will long on debug level, and `log4j...LoaderUtil` will be blocked by ExecutorClassLoader acquired classloading lock.

Fortunately, it only happens when ThresholdFilter's level is `debug`.

or we can set `log4j.ignoreTCL/log4j2.ignoreTCL` to true, but I don't know what else it will cause.

So in this PR, I simply remove the debug log which cause this deadlock

### Why are the changes needed?

fix deadlock

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

apache/kyuubi#2046 (comment), with a ut in kyuubi project, resolved(https://github.com/apache/incubator-kyuubi/actions/runs/1950222737)

### Additional Resources

[ut.jstack.txt](https://github.com/apache/spark/files/8206457/ut.jstack.txt)

Closes #35765 from yaooqinn/SPARK-38446.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit aef6745)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Apr 3, 2022
…DownloadCallback caused by Log4j

### What changes were proposed in this pull request?

While `log4j.ignoreTCL/log4j2.ignoreTCL` is false, which is the default, it uses the context ClassLoader for the current Thread, see `org.apache.logging.log4j.util.LoaderUtil.loadClass`. While ExecutorClassLoader try to loadClass through remotely though the FileDownload, if error occurs, we will long on debug level, and `log4j...LoaderUtil` will be blocked by ExecutorClassLoader acquired classloading lock.

Fortunately, it only happens when ThresholdFilter's level is `debug`.

or we can set `log4j.ignoreTCL/log4j2.ignoreTCL` to true, but I don't know what else it will cause.

So in this PR, I simply remove the debug log which cause this deadlock

### Why are the changes needed?

fix deadlock

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

apache/kyuubi#2046 (comment), with a ut in kyuubi project, resolved(https://github.com/apache/incubator-kyuubi/actions/runs/1950222737)

### Additional Resources

[ut.jstack.txt](https://github.com/apache/spark/files/8206457/ut.jstack.txt)

Closes #35765 from yaooqinn/SPARK-38446.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit aef6745)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@yaooqinn
Copy link
Member Author

yaooqinn commented Apr 3, 2022

thanks @dongjoon-hyun and all

kazuyukitanimura pushed a commit to kazuyukitanimura/spark that referenced this pull request Aug 10, 2022
…DownloadCallback caused by Log4j

### What changes were proposed in this pull request?

While `log4j.ignoreTCL/log4j2.ignoreTCL` is false, which is the default, it uses the context ClassLoader for the current Thread, see `org.apache.logging.log4j.util.LoaderUtil.loadClass`. While ExecutorClassLoader try to loadClass through remotely though the FileDownload, if error occurs, we will long on debug level, and `log4j...LoaderUtil` will be blocked by ExecutorClassLoader acquired classloading lock.

Fortunately, it only happens when ThresholdFilter's level is `debug`.

or we can set `log4j.ignoreTCL/log4j2.ignoreTCL` to true, but I don't know what else it will cause.

So in this PR, I simply remove the debug log which cause this deadlock

### Why are the changes needed?

fix deadlock

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

apache/kyuubi#2046 (comment), with a ut in kyuubi project, resolved(https://github.com/apache/incubator-kyuubi/actions/runs/1950222737)

### Additional Resources

[ut.jstack.txt](https://github.com/apache/spark/files/8206457/ut.jstack.txt)

Closes apache#35765 from yaooqinn/SPARK-38446.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit aef6745)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants