Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUTCH-2978 -- upgrade to log4j2 throughout #772

Merged
merged 5 commits into from Sep 17, 2023

Conversation

tballison
Copy link
Contributor

Thanks for your contribution to Apache Nutch! Your help is appreciated!

Before opening the pull request, please verify that

  • there is an open issue on the Nutch issue tracker which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes.
  • the issue ID (NUTCH-2978)
    • is referenced in the title of the pull request
    • and placed in front of your commit messages surrounded by square brackets ([NUTCH-XXXX] Issue or pull request title)
  • commits are squashed into a single one (or few commits for larger changes)
  • Java source code follows Nutch Eclipse Code Formatting rules
  • Nutch is successfully built and unit tests pass by running ant clean runtime test
  • there should be no conflicts when merging the pull request branch into the recent master branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled master branch.
  • if new dependencies are added,
    • are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
    • are LICENSE-binary and NOTICE-binary updated accordingly?

We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Nutch in general, please sign up for the Nutch mailing list. Thanks!

@tballison
Copy link
Contributor Author

If folks could test this out on their workloads, that'd be fantastic! It works on mine, but I'm really hesitant to merge until someone else runs it. Thank you!

@tballison
Copy link
Contributor Author

I'll merge this in a day or so unless anyone has objections.

@sebastian-nagel
Copy link
Contributor

sebastian-nagel commented Sep 14, 2023

I'll merge this in a day or so unless anyone has objections.

Give me a few more days, over the weekend. I'd like to test it at least on a pseudo-distributed Hadoop setup. If this is successful, then a failure on a fully distributed Hadoop cluster is rather unlikely.

Hadoop uses reload4j and puts the jars likely in front of the classpath. There might be some side effects. Also Nutch task logs are necessarily created via reload4j.

@tballison
Copy link
Contributor Author

Y, of course. That'd be fantastic. Thank you!

@sebastian-nagel
Copy link
Contributor

+1

A test with the pseudo-distributed Hadoop setup was successful:

  • Nutch tools work properly, no issues
  • as expected, Hadoop puts slf4j-api-1.7.36.jar and slf4j-reload4j-1.7.36.jar in the classpath in front of the Nutch job jars
  • consequently, task logs are formatted using the format defined in $HADOOP_HOMe/etc/hadoop/log4j.properties
  • (the good thing) log messages from Nutch classes appear in the task logs, e.g.
     2023-09-17 07:29:21,726 INFO [FetcherThread] org.apache.nutch.fetcher.FetcherThread: FetcherThread 33 fetching https://nutch.apache.org/ (queue crawl delay=5000ms)
    
  • the log format defined in $NUTCH_HOME/conf/log4j2.xml is only applied to the logs of the Yarn job client, e.g.
    2023-09-17 07:29:32,432 INFO fetcher.Fetcher: Fetcher: finished at 2023-09-17 07:29:32, elapsed: 00:00:25
    
  • in addition, I've included two PDFs, a XLSX and a ePub document, to test the Tika parser: the docs were successfully parsed using Tika 2.3.0 - if necessary I can repeat the test for NUTCH-2959

@tballison
Copy link
Contributor Author

tballison commented Sep 17, 2023 via email

@tballison tballison merged commit d81be51 into apache:master Sep 17, 2023
1 check passed
@tballison tballison deleted the NUTCH-2978 branch September 17, 2023 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants