Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43534][BUILD] Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided #41195

Closed
wants to merge 4 commits into from

Conversation

wangyum
Copy link
Member

@wangyum wangyum commented May 17, 2023

What changes were proposed in this pull request?

This PR adds log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided.

Why are the changes needed?

To fix log issue.

How to reproduce this issue:

  1. Build Spark:

    ./dev/make-distribution.sh --name provided --tgz -Phive -Phive-thriftserver -Pyarn -Phadoop-provided
    tar -zxf spark-3.5.0-SNAPSHOT-bin-provided.tgz 
    
  2. Copy the following jars to spark-3.5.0-SNAPSHOT-bin-provided/jars:

    guava-14.0.1.jar
    hadoop-client-api-3.3.5.jar
    hadoop-client-runtime-3.3.5.jar
    hadoop-shaded-guava-1.1.1.jar
    hadoop-yarn-server-web-proxy-3.3.5.jar
    slf4j-api-2.0.7.jar
    
  3. Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-provided/conf:

    rootLogger.level = info
    rootLogger.appenderRef.file.ref = File
    rootLogger.appenderRef.stderr.ref = console
    
    appender.console.type = Console
    appender.console.name = console
    appender.console.target = SYSTEM_ERR
    appender.console.layout.type = PatternLayout
    appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n
    
    appender.file.type = RollingFile
    appender.file.name = File
    appender.file.fileName = /tmp/spark/logs/spark.log
    appender.file.filePattern = /tmp/spark/logs/spark.%d{yyyyMMdd-HH}.log
    appender.file.append = true
    appender.file.layout.type = PatternLayout
    appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n
    appender.file.policies.type = Policies
    appender.file.policies.time.type = TimeBasedTriggeringPolicy
    appender.file.policies.time.interval = 1
    appender.file.policies.time.modulate = true
    appender.file.policies.size.type = SizeBasedTriggeringPolicy
    appender.file.policies.size.size = 256M
    appender.file.strategy.type = DefaultRolloverStrategy
    appender.file.strategy.max = 100
    
  4. Start Spark thriftserver: sbin/start-thriftserver.sh.

  5. The log file is empty: cat /tmp/spark/logs/spark.log.

  6. Copy the following jars to spark-3.5.0-SNAPSHOT-bin-provided/jars:

    log4j-1.2-api-2.20.0.jar
    log4j-slf4j2-impl-2.20.0.jar
    
  7. Restart Spark thriftserver: sbin/start-thriftserver.sh.

  8. The log file is not empty: cat /tmp/spark/logs/spark.log.

This is because hadoop classpath does not contain these jars. So these jars are needed even if hadoop-provided is activated
image

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual test.

@github-actions github-actions bot added the BUILD label May 17, 2023
@srowen
Copy link
Member

srowen commented May 17, 2023

Can you explain this more - your steps to reproduce require you to remove files from the build, and that causes the problem? but the JARs were there before you removed them. How does removing the scope change the result?

@wangyum
Copy link
Member Author

wangyum commented May 17, 2023

Thank you @srowen. I have updated the PR description.

@srowen
Copy link
Member

srowen commented May 17, 2023

It seems weird that log4j 2 config works, if you add log4j 1.x. Maybe so, just trying to figure out if this is really what's going on and if we have to let log4j 1.x back in? because then we have both log4j in the distribution

@Kimahriman
Copy link
Contributor

Maybe similar reason I made #37694 a while ago? Basically Spark logging setup assumes log4j2, but with hadoop provided you get 1.x from Hadoop. So you get weird logging behavior. We just have both providers in the class path and slf seems to pick the 2.x one

@wangyum wangyum requested a review from viirya May 18, 2023 02:38
Comment on lines 762 to 764
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-1.2-api</artifactId>
<version>${log4j.version}</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks okay. This is bridge between log4j1 and log4j2. In case of hadoop-provided, we still need this as hadoop classpath doesn't provide it.

@srowen
Copy link
Member

srowen commented May 19, 2023

Seems reasonable then. Let's just get the tests to run again.

@Kimahriman
Copy link
Contributor

Kimahriman commented May 19, 2023

Actually hit a new issue related to this after finally being able to test out Spark 3.4 from the Delta release. Because of the bump to slf4j 2, it seems log4j-slf4j2-impl doesn't get recognized, so it always falls back to the log4j/reload4j 1.x for logging. I had to also include slf4j-api 2.0.6 to get Spark's built-in log handling to work correctly (i.e. SparkContext.setLogLevel)

@wangyum
Copy link
Member Author

wangyum commented May 20, 2023

@Kimahriman Do you have a way to reproduce?

@Kimahriman
Copy link
Contributor

Nevermind it looks like including log4j-1.2-api actually fixes the issue, I wasn't including that before. In fact, just including log4j-1.2-api fixes the issue I made for https://issues.apache.org/jira/browse/SPARK-40246. That might not fix custom log4j2 configs, but at least makes SparkContext.setLogLevel work with a hadoop-provided build

@srowen
Copy link
Member

srowen commented May 20, 2023

Merged to master

@srowen srowen closed this in 5d03950 May 20, 2023
Kimahriman pushed a commit to Kimahriman/spark that referenced this pull request Jun 20, 2023
…path if active hadoop-provided

### What changes were proposed in this pull request?

This PR adds `log4j-1.2-api` and `log4j-slf4j2-impl` to classpath if active `hadoop-provided`.

### Why are the changes needed?

To fix log issue.

How to reproduce this issue:
1. Build Spark:
   ```
   ./dev/make-distribution.sh --name provided --tgz -Phive -Phive-thriftserver -Pyarn -Phadoop-provided
   tar -zxf spark-3.5.0-SNAPSHOT-bin-provided.tgz
   ```
2. Copy the following jars to spark-3.5.0-SNAPSHOT-bin-provided/jars:
   ```
   guava-14.0.1.jar
   hadoop-client-api-3.3.5.jar
   hadoop-client-runtime-3.3.5.jar
   hadoop-shaded-guava-1.1.1.jar
   hadoop-yarn-server-web-proxy-3.3.5.jar
   slf4j-api-2.0.7.jar
   ```
3. Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-provided/conf:
   ```
   rootLogger.level = info
   rootLogger.appenderRef.file.ref = File
   rootLogger.appenderRef.stderr.ref = console

   appender.console.type = Console
   appender.console.name = console
   appender.console.target = SYSTEM_ERR
   appender.console.layout.type = PatternLayout
   appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n

   appender.file.type = RollingFile
   appender.file.name = File
   appender.file.fileName = /tmp/spark/logs/spark.log
   appender.file.filePattern = /tmp/spark/logs/spark.%d{yyyyMMdd-HH}.log
   appender.file.append = true
   appender.file.layout.type = PatternLayout
   appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n
   appender.file.policies.type = Policies
   appender.file.policies.time.type = TimeBasedTriggeringPolicy
   appender.file.policies.time.interval = 1
   appender.file.policies.time.modulate = true
   appender.file.policies.size.type = SizeBasedTriggeringPolicy
   appender.file.policies.size.size = 256M
   appender.file.strategy.type = DefaultRolloverStrategy
   appender.file.strategy.max = 100
   ```
4. Start Spark thriftserver: `sbin/start-thriftserver.sh`.

5. The log file is empty: `cat /tmp/spark/logs/spark.log`.

6. Copy the following jars to spark-3.5.0-SNAPSHOT-bin-provided/jars:
   ```
   log4j-1.2-api-2.20.0.jar
   log4j-slf4j2-impl-2.20.0.jar
   ```
7. Restart Spark thriftserver: `sbin/start-thriftserver.sh`.

8. The log file is not empty: `cat /tmp/spark/logs/spark.log`.

This is because hadoop classpath does not contain these jars. So these jars are needed even if `hadoop-provided` is activated
![image](https://github.com/apache/spark/assets/5399861/d53ff28c-05d7-40d7-891f-069449315217)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes apache#41195 from wangyum/SPARK-43534.

Lead-authored-by: Yuming Wang <wgyumg@gmail.com>
Co-authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants