Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable ingest-attachment logging #93878

Merged
merged 10 commits into from Feb 27, 2023

Conversation

joegallo
Copy link
Contributor

@joegallo joegallo commented Feb 16, 2023

Closely related to #91964 and #93608

We run the actual text extraction part of the ingest-attachment module inside a highly restricted environment, in order to protect ourselves from security bugs in dependencies:

return AccessController.doPrivileged(
(PrivilegedExceptionAction<String>) () -> TIKA_INSTANCE.parseToString(new ByteArrayInputStream(content), metadata, limit),
RESTRICTED_CONTEXT
);

That environment is so locked down that logging from inside of it won't work reliably and correctly. Additionally, our logging there is an enormous mess. Some dependencies are using log4j directly (like Elasticsearch actually already does), some are using slf4j or commons-logging (both entirely unconfigured).

Additionally, because of the way logging fails, it can end up repeatedly throwing exceptions internally while accomplishing very little (at great cost of time). Fixing or working around the logging issues will significantly improve the performance of text extraction.

This PR does not solve the problem of the limited permissions (that's #93714), rather it silences all logging during text extraction. It arranges our dependencies better, though, so that once we fix log4j for ourselves, we can use that here, too, and reroute slf4j logging to use log4j as well.


As a matter of inside baseball, we'd like an ordinary code review from within the @elastic/es-data-management team, but we're also going to get a 10,000 ft level review from @seanstory as well as a "no objections" review from @rjernst.

Since we're not using that library anymore.
https://github.com/qos-ch/slf4j/tree/master/jcl-over-slf4j is where
the implementation lives, there's a LICENSE there but no notice, nor
do I see a notice in the directory above that.
https://github.com/qos-ch/slf4j/tree/master/slf4j-nop is where
the implementation lives, there's a LICENSE there but no notice, nor
do I see a notice in the directory above that.
Otherwise the LLRC brings it in and we end up in jar-hell.
It won't work anyway because we're in a limited context where log4j
and other logging libraries don't have the permissions necessary to
log successfully.
@joegallo joegallo added >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team v8.7.0 v8.8.0 v8.6.3 labels Feb 16, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @joegallo, I've created a changelog YAML for you.

@joegallo
Copy link
Contributor Author

joegallo commented Feb 16, 2023

Some dependencies are using log4j directly (like Elasticsearch actually already does), some are using slf4j or commons-logging (both entirely unconfigured).

joegallo@galactic:~/Code/elastic/elasticsearch $ cat result-log4j.txt | grep -B1 -E '(jar$|log4j.Logger)' | grep -v -- '--' | grep -B3 log4j.Logger | grep jar
distribution/archives/linux-aarch64-tar/build/distributions/elasticsearch-8.8.0-SNAPSHOT/modules/ingest-attachment/poi-5.2.3.jar
distribution/archives/linux-aarch64-tar/build/distributions/elasticsearch-8.8.0-SNAPSHOT/modules/ingest-attachment/poi-ooxml-5.2.3.jar
distribution/archives/linux-aarch64-tar/build/distributions/elasticsearch-8.8.0-SNAPSHOT/modules/ingest-attachment/poi-scratchpad-5.2.3.jar
distribution/archives/linux-aarch64-tar/build/distributions/elasticsearch-8.8.0-SNAPSHOT/modules/ingest-attachment/xmlbeans-5.1.1.jar
joegallo@galactic:~/Code/elastic/elasticsearch $ cat result-commons-logging.txt | grep -B2 -E '(jar$|commons.logging)' | grep -v -- -- | grep -B3 -E 'commons/logging' | grep jar$
distribution/archives/linux-aarch64-tar/build/distributions/elasticsearch-8.8.0-SNAPSHOT/modules/ingest-attachment/fontbox-2.0.27.jar
distribution/archives/linux-aarch64-tar/build/distributions/elasticsearch-8.8.0-SNAPSHOT/modules/ingest-attachment/pdfbox-2.0.27.jar

I did a bit of javap -v analysis of the classes in the jars in the modules/ingest-attachment directory of an unzipped Elasticsearch tar to see what's logging how, and poi and xmlbeans are logging via log4j, while pdfbox and fontbox are using commons-logging. Everything else either uses slf4j or doesn't log.

@joegallo
Copy link
Contributor Author

joegallo commented Feb 16, 2023

Note to self that we need to use the Co-authored-by syntax when merging this so that @masseyke gets the proper attribution.

edit: Success!

Copy link
Member

@seanstory seanstory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masseyke and I had a brief discussion of this PR offline. I'd had two questions he answered.

Q: Why have modules/ingest-attachment/src/main/config/log4j2.properties set all these classes to off if you're just using a no-op logger?
A: Because that config file is for log4j, and the no-op logger is just for slf4j. We aren't using a bridge to route slf4j to log4j or vice-versa.

Q: Why not bridge all log APIs to a single no-op log impl?
A: Eventually we might bridge SLF4j apis to the Log4j impl, but not until after the TikaImpl permissions are fixed. But even then, we'll likely still leave these specific classes on "off" logging, since their output is next to useless, and just adds noise. This change would be valuable even without the performance implications.

@masseyke
Copy link
Member

masseyke commented Feb 16, 2023

I thought I'd throw in my notes from talking with @seanstory because I probably won't remember this in 6 months:

Before this PR:

  • Some tika libraries use commons-logging, which writes to java util logging, which writes to stderr, which is captured by log4j, which fails because of restricted permissions.
  • Some tika libraries use slf4j, which writes to stderr, which is captured by log4j, which fails because of restricted permissions.
  • Some tika libraries use log4j, which fails because of restricted permissions.

With this PR:

  • Some tika libraries use commons-logging, which is bridged to slf4j, which is using a no-op implementation so the logs are lost.
  • Some tika libraries use slf4j, which writes to stderr, which is captured by log4j, which is using a no-op implementation so the logs are lost.
  • Some tika libraries use log4j. Since we have turned off log4j logging for all of those packages that we know about in log4j2.properties, the logs are lost.

In the future after we have fixed the #93714 and changed slf4j to use the log4j bridge instead of the no-op implementation:

  • Some tika libraries use commons-logging, which is bridged to slf4j, which is bridged to log4j. Since we have turned off log4j logging for all of those packages that we know about in log4j2.properties, the logs are lost.
  • Some tika libraries use slf4j, which is bridged to log4j. Since we have turned off log4j logging for all of those packages that we know about in log4j2.properties, the logs are lost.
  • Some tika libraries use log4j. Since we have turned off log4j logging for all of those packages that we know about in log4j2.properties, the logs are lost.

Note that even in the future state, we plan to have most tika logging turned off in log4j2.properties because it has not been very useful in the context of the ingest attachment processor. But a user could turn it back on using the cluster settings API (or editing the file directly if they have access).

Also note that this problem is exacerbated by log rollover (since it requires log4j to have additional permissions, and it slows things down a little, and it causes more stack traces). We were able to make things fail spectacularly before this PR by setting the rollover log size to something much lower than the default.

@joegallo
Copy link
Contributor Author

We were able to make things fail spectacularly before this PR by setting the rollover log size to something much lower than the default.

Note for future github archaeologists:

diff --git a/distribution/src/config/log4j2.properties b/distribution/src/config/log4j2.properties
index 36b5b03d9a1..bfdd7e215b0 100644
--- a/distribution/src/config/log4j2.properties
+++ b/distribution/src/config/log4j2.properties
@@ -18,7 +18,7 @@ appender.rolling.policies.time.type = TimeBasedTriggeringPolicy
 appender.rolling.policies.time.interval = 1
 appender.rolling.policies.time.modulate = true
 appender.rolling.policies.size.type = SizeBasedTriggeringPolicy
-appender.rolling.policies.size.size = 128MB
+appender.rolling.policies.size.size = 1KB
 appender.rolling.strategy.type = DefaultRolloverStrategy
 appender.rolling.strategy.fileIndex = nomax
 appender.rolling.strategy.action.type = Delete

^ this did the trick nicely.

@gmarouli gmarouli self-requested a review February 21, 2023 17:08
Copy link
Contributor

@gmarouli gmarouli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great job @joegallo & @masseyke! And @masseyke thank you for the walk-through.

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@joegallo joegallo removed the v8.6.3 label Feb 27, 2023
@joegallo joegallo merged commit 7ae08e2 into elastic:main Feb 27, 2023
@joegallo joegallo deleted the ingest-attachment-logging-fix-nop branch February 27, 2023 14:54
joegallo added a commit that referenced this pull request Feb 27, 2023
Co-authored-by: Keith Massey <keith.massey@elastic.co>
saarikabhasi pushed a commit to saarikabhasi/elasticsearch that referenced this pull request Apr 10, 2023
Co-authored-by: Keith Massey <keith.massey@elastic.co>
pgomulka added a commit that referenced this pull request May 5, 2023
currently when Elasticsearch starts up there are warnings on the console complaining about SLF4J providers not found.
this is emitted twice
1 - when repository-azure module is loaded
2 - when xpack-security (depending on transport-netty) module is loaded 

This commits adds a slf4j-nop to fix that warning

related to #93714
and #93878
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team v8.7.0 v8.8.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants