Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUTCH-2885 Upgrade to Log4j2 #692

Merged
merged 5 commits into from
Aug 4, 2021
Merged

NUTCH-2885 Upgrade to Log4j2 #692

merged 5 commits into from
Aug 4, 2021

Conversation

lewismc
Copy link
Member

@lewismc lewismc commented Jul 10, 2021

PR for https://issues.apache.org/jira/browse/NUTCH-2885 ready for review.
I feel that this simplifies the logging configuration with precise XML syntax... ultimately less code which I feel is readable.
The configuration uses a RollingFileAppender with the cron triggering policy configured to trigger every day at midnight. Archives are stored in a directory based on the current year and month. All files under the base directory that match the */nutch-*.log.gz glob and are 60 days old or older are deleted at rollover time. Additionally, we retain the ConsoleAppender configuration so everything is also written to STDOUT.
I was motivated to work on this issue because I am performing a trade study which is evaluating logz.io. Specifically, this will allow configuring nutch to use the logzio-log4j2-appender.
Comments welcome.

@lewismc
Copy link
Member Author

lewismc commented Jul 10, 2021

I've verified that this patch, with some minor additional configuration, enables me to write logs locally, into logz.io and also into enterprise instance of Splunk. The latter two really help with alerts and notifications if something goes wrong i.e. ParseException.

@lewismc
Copy link
Member Author

lewismc commented Jul 10, 2021

I put some documentation together for this as well
https://cwiki.apache.org/confluence/display/NUTCH/Logging

@lewismc
Copy link
Member Author

lewismc commented Jul 29, 2021

Anyone able to give this a look? Thank you

Copy link
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @lewismc.

enables me to write logs locally, into logz.io and also into enterprise instance of Splunk

Sounds great!

I've found that in (pseudo-)distributed mode using Hadoop 3.3.1 there is no effect at all - don't know why, eventually the solution of HADOOP-12956 is required.

One point: the log4j.properties made only the Nutch tools to write to stdout and most of the Hadoop classes were even set to log on level WARN. Now the console output includes a many messages hardly useful to locate the usual Nutch issues, e.g.

2021-08-02 15:27:14,318 INFO o.a.h.m.t.r.MergeManagerImpl [pool-5-thread-1] Merged 2 segments, 162 bytes to disk to satisfy reduce memory limit
2021-08-02 15:27:14,318 INFO o.a.h.m.t.r.MergeManagerImpl [pool-5-thread-1] Merging 1 files, 164 bytes from disk
2021-08-02 15:27:14,319 INFO o.a.h.m.t.r.MergeManagerImpl [pool-5-thread-1] Merging 0 segments, 0 bytes from memory into reduce
2021-08-02 15:27:14,319 INFO o.a.h.m.Merger [pool-5-thread-1] Merging 1 sorted segments
2021-08-02 15:27:14,319 INFO o.a.h.m.Merger [pool-5-thread-1] Down to the last merge-pass, with 1 segments left of total size: 110 bytes
2021-08-02 15:27:14,319 INFO o.a.h.m.LocalJobRunner [pool-5-thread-1] 2 / 2 copied.
2021-08-02 15:27:14,341 INFO o.a.h.i.c.CodecPool [pool-5-thread-1] Got brand-new compressor [.deflate]

Maybe we should (optionally) provide a configuration with a less verbose logging?

ivy/ivy.xml Outdated Show resolved Hide resolved
conf/log4j2.xml Outdated Show resolved Hide resolved
@lewismc
Copy link
Member Author

lewismc commented Aug 3, 2021

Thanks @sebastian-nagel , I addressed the following two isses

  • formatting in ivy.xml
  • introduced logic for org.apache.hadoop classes to only log WARN and worse... I agree with you this is much better.

Regarding your comments on HADOOP-12956, I also note following entries in STDOUT

log4j:WARN No appenders could be found for logger (org.apache.htrace.core.Tracer).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

Does HADOOP-12956 stop us from implementing this and then revisiting it once HADOOP-12956 has been included in a forthcoming Hadoop release?

@sebastian-nagel
Copy link
Contributor

for org.apache.hadoop classes to only log WARN

+1

Regarding the warning about missing appenders: I've also seen it but only in the logs of the ApplicationMaster and nevertheless the syslog file contains log messages obviously written via log4j. So, I don't know whether it's an issue at all and esp. for us because in the ApplicationMaster no Nutch classes are involved.

Does HADOOP-12956 stop us from implementing this

I do not think so. But looks like the upgrade has no effect in (pseudo)distributed mode and the log4j.properties in $HADOOP_HOME/etc/hadoop/ (or /etc/hadoop/conf/) takes precedence.

@lewismc
Copy link
Member Author

lewismc commented Aug 4, 2021

Thanks for your thoughts @sebastian-nagel
I think this patch is ready to be proposed for merging then... wdyt?

- allow to set log file and directory via system properties `hadoop.log.file` and `hadoop.log.dir`
@sebastian-nagel
Copy link
Contributor

Hi @lewismc, yes: the branch is ready to be merged. I've added the change to allow to override log file and folder via Java properties.

@lewismc lewismc merged commit e4b7be9 into apache:master Aug 4, 2021
@lewismc lewismc deleted the NUTCH-2885 branch August 4, 2021 17:01
sebastian-nagel pushed a commit to sebastian-nagel/nutch that referenced this pull request Sep 22, 2021
* NUTCH-2885 Upgrade to Log4j2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants