Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log summarised description of StartupExceptions #44536

Closed

Conversation

tvernum
Copy link
Contributor

@tvernum tvernum commented Jul 18, 2019

When a Node fails to start, it will throw a StartupException on the
main thread which will cause the process to exit.
Previously these were simply logged in the same way as any other
uncaught exception, which would potentially result in long stack
traces with the key details (the primary cause) being nested somewhere
in the middle of the log lines. This was particularly true if the
failure was due to an exception being thrown within a plugin - the
primary cause may well have been wrapped in two or three other
exceptions before it was logged.

This commit adds a new summarised description whenever there is an
uncaught StartupException. This summary is logged before and after the
standard stack trace logging to make it more prominent and increase
the likelihood that it will be noticed and understood.

The summary focuses on printing messages from ElasticsearchExceptions
as these are the most likely to hold clear, specific and actionable
information and also prints the message for each cause of the
ElasticsearchException which may contain the precise details (e.g. the
pathname in a FileNotFoundException or AccessDeniedException).

Resolves: #34895

When a Node fails to start, it will throw a StartupException on the
main thread which will cause the process to exit.
Previously these were simply logged in the same way as any other
uncaught exception, which would potentially result in long stack
traces with the key details (the primary cause) being nested somewhere
in the middle of the log lines. This was particularly true if the
failure was due to an exception being thrown within a plugin - the
primary cause may well have been wrapped in two or three other
exceptions before it was logged.

This commit adds a new summarised description whenever there is an
uncaught StartupException. This summary is logged before and after the
standard stack trace logging to make it more prominent and increase
the likelihood that it will be noticed and understood.

The summary focuses on printing messages from ElasticsearchExceptions
as these are the most likely to hold clear, specific and actionable
information and also prints the message for each cause of the
ElasticsearchException which may contain the precise details (e.g. the
pathname in a FileNotFoundException or AccessDeniedException).

Resolves: elastic#34895
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@tvernum
Copy link
Contributor Author

tvernum commented Jul 18, 2019

As an example (the test case has similar, but more contrived examples) here is the log message you get if you have the wrong password for a keystore for SSL:

[2019-07-18T10:30:34,375][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node01] an exception was thrown that prevented this node from starting (java.lang.IllegalStateException: failed to load plugin class [org.elasticsearch.xpack.core.XPackPlugin])
this was caused by:
 * failed to initialize a KeyManagerFactory (exception)
   - caused by: keystore password was incorrect (i_o_exception)
     - caused by: failed to decrypt safe contents entry: javax.crypto.BadPaddingException: Given final block not properly padded. Such issues can arise if a bad key is used during decryption. (unrecoverable_key_exception)

and this is a missing CA file:

[2019-07-18T12:47:58,050][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node01] an exception was thrown that prevented this node from starting (java.lang.IllegalStateException: failed to load plugin class [org.elasticsearch.xpack.core.XPackPlugin])
this was caused by:
 * failed to initialize a TrustManagerFactory (exception)
   - caused by: /path/to/elasticsearch-8.0.0-SNAPSHOT/config/tls/ca/dne.crt (no_such_file_exception)

some of those still need tidying up - we should include the path to the keystore when we fail to decrypt it, but they're separate to this change.

@albertzaharovits
Copy link
Contributor

Getting the root ElasticsearchException (and it's 3 immediate causes) from a StartupException and formatting them as a banner message sounds like a very good strategy to me! I think that, if we find cases where this strategy does not work, we've discovered a case where we should wrap the exception in an ElasticsearchException.

Copy link
Contributor

@albertzaharovits albertzaharovits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t like the approach here of changing the uncaught exception handler, as this change overloads its purpose (from handling uncaught exceptions and deciding whether or not to exit, to handling uncaught exceptions, and formatting some of them in a special way, and also still deciding whether or not to exit). Another problem that I have with this change is that it only applies to uncaught StartupException but that doesn’t cover if we catch them and handle them specially already. If we were to make this change, I would rather see it that we customize how StartupException is displayed (it’s printStackTrace method, etc.). I say if because I still need to think about whether or not this even a change that I am in favor of, but let’s see how it looks done differently, I don’t think it’s too much to move away from changing the uncaught exception handler.

@colings86 colings86 added v7.5.0 and removed v7.4.0 labels Aug 30, 2019
@tvernum
Copy link
Contributor Author

tvernum commented Oct 18, 2019

I've finally found some cycles to come back to this, and I'm a bit stuck.
Here's where we're at.
(The summary lays out my thoughts, and should be enough info, but I wrote up all the background details in an expandable section at the end).


Key Summary (TL;DR)

  • It is very hard for plugins to report user config problems in user friendly ways. They mostly end up being screens worth of stack traces. Users struggle to read them.
  • XPack makes this worse by constructing the SSLService during plugin construction rather than createComponents. That causes the exceptions to get wrapped extra times. We should try and move it out of the constructor.
  • The way we handle & log exceptions on startup is a bit strange (we detect them in Bootstrap, which causes the node/JVM to exit, but we actually log them via ElasticsearchUncaughtExceptionHandler)
  • I can't find a way to generate user-friendly logs in ElasticsearchUncaughtExceptionHandler without causing it to take on responsibilities that we don't want it to have (which I'd argue is because relying on this class as the primary way in which we log node start up failures is already a responsibility it shouldn't have).
  • Setting validators are a partial solution to some of those problems, but they're still not great, they're currently tricky to use for complex validation, and it would take a lot of work to move all the existing config checks to run within validators.

Full details

The problem

  • If there is a configuration problem (or similar) in a plugin/module then the current behaviour is for that to be reported in console & logs as an exception with a long stack trace. As far as I can tell (and I've done a lot of looking) we don't currently have any way for a plugin to report an user error other than via that exception mechanism (with one partial exception - Setting.Validator - which is discussed below).
  • Experience show that many users struggle to read those exceptions. The stack trace is useful to us, but for some percentage of users the stack trace is so overwhelming that they can't self-diagnose and they need to come to us to explain it.
  • The stack trace is several times worse if the plugin throws an exception during construction. Exceptions thrown in createComponents are not wrapped as many times, so the stack trace is a bit more manageable
  • XPack constructs the SSLService in the plugin constructor. SSL configuration issues are one of the most common problems in security, and they cause exceptions during plugin construction. I think we need to move the SSL service to be initialised in createComponents but I need to work out why it's been done this way and how much needs to change to fix it.

The current implementation

  • The console exceptions are printed via the UncaughtExceptionHandler. In the log files, we print the exception twice, once explictly in Bootstrap, and then a second time in the ElasticsearchUncaughtExceptionHandler. On the console, we only print it once, because we have code that explicitly removes the console appender before logging the exception in Bootstrap. From the comments this seems to be related to Guice and wanting trimmed traces on Console, but full traces in the logs.
  • The trace from ElasticsearchUncaughtExceptionHandler is the last thing printed (both to Console & log files). If we want to have a summary at the end of the log file/console (so that it is the last thing the user sees, and doesn't scroll off screen when the stack trace is printed) we either need to do it with the UncaughtExceptionHandler (which could mean that the logic is somewhere else, but it is triggered by this handler), or we stop using the UncaughtExceptionHandler for these exceptions.
  • I can't find a way to get StartupException to automatically print out a summary when it is logged by ElasticsearchUncaughtExceptionHandler because log4j has its own stack trace handling, and I don't believe we can easily hook into that and do something different.

Options to resolve the problem:

  • We could tackle the "how to plugins report problems" part of the problem, which would mean going back to the original issue (Simplify messages for startup errors in plugins #34895), and approaching the question as "how should a plugin/module report an error on startup, when the cause of that error is most likely a user configuration problem?"
    • The answer to that might be "use setting validators" (This is the partial exception mentioned above). We should move more validation to settings rather than components, but that is harder than it sounds right now. (I'll explain some of the challenges below)
  • We could stop using ElasticsearchUncaughtExceptionHandler as the exception logger for startup exceptions on the main thread. This is probably as simple as not making them uncaught. Somewhere between Bootstrap and Elasticsearc.main we could handle them and log them appropriately, and trigger an exit without throwing the exception up to the JVM. This treats exceptions on node start as being special (which they are, since they stop the node) with a appropriate logging behaviour that reflect this. This make sense to me, because the ElasticsearchUncaughtExceptionHandler actually considers these exceptions to be non-fatal, even though they actually cause a JVM exit (because they're on the main thread), so the contract of the ElasticsearchUncaughtExceptionHandler is being violated.
  • Alternatively, there's probably something we could do to pull the logic that determines how to log these exceptions out of ElasticsearchUncaughtExceptionHandler so that they could still be handled there, without changing its core purpose. However, I don't think that's the direction to take.

About Setting Validators:

  • An configuration problem that is detected/handled via a setting validator gives a far better experience than exceptions that are thrown during createComponents. For example here how we report a setting validation failure from a module (I just hacked this together so it's not a great error message):

    [2019-10-16T16:28:22,992][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [node01] uncaught exception in thread [main]
    org.elasticsearch.bootstrap.StartupException: java.lang.IllegalArgumentException: Invalid LDAP filter: [not-a-valid-filter]
    	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:163) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    	at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:150) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    	at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:125) ~[elasticsearch-cli-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    	at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-cli-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:115) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    Caused by: java.lang.IllegalArgumentException: Invalid LDAP filter: [not-a-valid-filter]
    	at org.elasticsearch.xpack.core.security.authc.ldap.LdapUserSearchSessionFactorySettings.lambda$static$1(LdapUserSearchSessionFactorySettings.java:40) ~[?:?]
    	at org.elasticsearch.common.settings.Setting.get(Setting.java:446) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    	at org.elasticsearch.common.settings.Setting.get(Setting.java:427) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    	at org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:559) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    	at org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:475) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    	at org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:446) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    	at org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:417) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    	at org.elasticsearch.common.settings.SettingsModule.<init>(SettingsModule.java:160) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    	at org.elasticsearch.common.settings.SettingsModule.<init>(SettingsModule.java:66) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    	at org.elasticsearch.node.Node.<init>(Node.java:353) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    	at org.elasticsearch.node.Node.<init>(Node.java:254) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    	at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:219) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:219) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:339) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    	... 6 more
    
    • That's still long, but at least it's only one screenful and it's only wrapped once (in StartupException) so it's easier to find the key information in the trace.
    • I would still like to be able to report something more user-friendly like we do with Bootstrap checks, where failures are printed in a list in human readable form.
    • If there are multiple validation failures they end up as suppressed exceptions which means the trace ends up being longer than a page of text, and the top exception scrolls off the screen. - It's still handled by the ElasticsearchUncaughtExceptionHandler and logged as WARN even though it stops the node.
  • The API for validators is feels like it is optimized for simple validations, not the complex issues that we run into with thing like configuring SSL (these aren't hard to fix, I just want to capture them somewhere)

    • The validator doesn't get the setting key passed in, and AbstractScopedSettings.validate doesn't add the setting name to the thrown exception. So, if you want the error to explain which setting is invalid (and you should) then you need to construct a new validator instance for every Setting key, which makes Affix settings cumbersome.
    • The validator doesn't get the configPath passed in, so validation of files is tricky - you can't validate that a file (e.g. SSL certs) exist unless you know where to look for it. The plugin object does get the configPath during construction, so it's possible to store it in a field and then use that construct the setting object in getSettings(), but then Setting objects cannot be static fields (which is the common pattern).
  • Bootstrap checks run before settings validation, so if a bootstrap check uses a Setting that has a validator, and the validator fails, you end up with the original ugly stack trace.

  • Right now, XPack builds the SSLService on construction which is before setting validation. So if we put validators on SSL settings they'll fire to early and produce ugle stack traces. This is an X-Pack problem not a core problem, but it might be hard to solve.

@jimczi jimczi added v7.6.0 and removed v7.5.0 labels Nov 12, 2019
@rjernst
Copy link
Member

rjernst commented Nov 23, 2019

@tvernum In regards to one underlying issue:

Right now, XPack builds the SSLService on construction which is before setting validation.

I think we need to solve this regardless. We shouldn't be consuming settings in plugin constructors (long term we should remove this ability) since we have not yet even registered settings at that point.

Looking at the consumers of the "shared ssl service", it looks like monitoring and watcher both use it to construct http clients. I wonder if we should have this constructable in core (similar to NodeClient?), and xpack security registers a hook to make it an ssl client. In short, I think this is a solvable problem, and we should first focus on that.

@tvernum
Copy link
Contributor Author

tvernum commented Nov 26, 2019

@rjernst I assume (and PluginsService.sortBundles seems to confirm) that I can rely on x-pack-core being called before any x-pack children.
That is XPackPlugin.createComponents will be called before the equivalent method in the Monitor / Watcher / Security plugins.

If so, I think moving the creation of the SSL service to XPackPlugin.createComponents is straight forward.
The places where it is used are:

  • createComponents in Monitoring, Security, Watcher (which will be safe, assuming child module ordering is safe)
  • getRestHandlerWrapper in Security (which is called from ActionModule which is constructed after createComponents completes)
  • getTransports and getHttpTransports in Security (which are called from NetworkModule, after createComponents)
  • to construct the bootstrap checks in Security, which are incorrectly done in the constructor, but I believe can be easily deferred to createComponents or getBootstrapChecks.

Is the contract around the order in which plugin methods get called documented anywhere (even if the docs are in the form of a test)? I've run into similar questions in the past while trying to ensure that an object constructed in a particular method can depend on other objects having been constructed first.

I wonder if we should have this constructable in core (similar to NodeClient?)

From memory, core doesn't (and shouldn't) depend on HTTP client, and there's dependency conflict issues if we tried to pull it in (different plugins can use different versions of http client).

@rjernst
Copy link
Member

rjernst commented Nov 27, 2019

Is the contract around the order in which plugin methods get called documented anywhere

No, and in fact, the intention was for there to be no contract. Ordering across plugins was never meant to be guaranteed. The sorting you mentioned was specifically added to control the hierarchy of plugin classloaders specifically to allow SPI across plugins, but that order does not gaurantee anything about the iteration order when calling plugin methods.

Historically, however, we have relied on createComponents being called before certain other plugin methods, and so what you propose is ok for now at least. In fact, it used to be this way when x-pack was all one plugin.

tvernum added a commit to tvernum/elasticsearch that referenced this pull request Nov 28, 2019
XPackPlugin created an SSLService within the plugin contructor.
This has 2 negative consequences:

1. The service may be constructed based on a partial view of settings.
   Other plugins are free to add setting values via the
   additionalSettings() method, but this (necessarily) happens after
   plugins have been constructed.

2. Any exceptions thrown during the plugin construction are handled
   differently than exceptions thrown during "createComponents".
   Since SSL configurations exceptions are relatively common, it is
   far preferable for them to be thrown and handled as part of the
   createComponents flow.

This commit moves the creation of the SSLService to
XPackPlugin.createComponents, and alters the sequence of some other
steps to accommodate this change.

Relates: elastic#44536
tvernum added a commit that referenced this pull request Dec 17, 2019
XPackPlugin created an SSLService within the plugin contructor.
This has 2 negative consequences:

1. The service may be constructed based on a partial view of settings.
   Other plugins are free to add setting values via the
   additionalSettings() method, but this (necessarily) happens after
   plugins have been constructed.

2. Any exceptions thrown during the plugin construction are handled
   differently than exceptions thrown during "createComponents".
   Since SSL configurations exceptions are relatively common, it is
   far preferable for them to be thrown and handled as part of the
   createComponents flow.

This commit moves the creation of the SSLService to
XPackPlugin.createComponents, and alters the sequence of some other
steps to accommodate this change.

Relates: #44536
tvernum added a commit to tvernum/elasticsearch that referenced this pull request Dec 30, 2019
XPackPlugin created an SSLService within the plugin contructor.
This has 2 negative consequences:

1. The service may be constructed based on a partial view of settings.
   Other plugins are free to add setting values via the
   additionalSettings() method, but this (necessarily) happens after
   plugins have been constructed.

2. Any exceptions thrown during the plugin construction are handled
   differently than exceptions thrown during "createComponents".
   Since SSL configurations exceptions are relatively common, it is
   far preferable for them to be thrown and handled as part of the
   createComponents flow.

This commit moves the creation of the SSLService to
XPackPlugin.createComponents, and alters the sequence of some other
steps to accommodate this change.

Relates: elastic#44536
Backport of: elastic#49667
@polyfractal polyfractal added v7.7.0 and removed v7.6.0 labels Jan 15, 2020
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
XPackPlugin created an SSLService within the plugin contructor.
This has 2 negative consequences:

1. The service may be constructed based on a partial view of settings.
   Other plugins are free to add setting values via the
   additionalSettings() method, but this (necessarily) happens after
   plugins have been constructed.

2. Any exceptions thrown during the plugin construction are handled
   differently than exceptions thrown during "createComponents".
   Since SSL configurations exceptions are relatively common, it is
   far preferable for them to be thrown and handled as part of the
   createComponents flow.

This commit moves the creation of the SSLService to
XPackPlugin.createComponents, and alters the sequence of some other
steps to accommodate this change.

Relates: elastic#44536
@bpintea bpintea added the v7.8.0 label Mar 25, 2020
@elasticsearchmachine
Copy link
Collaborator

Hi @tvernum, I've created a changelog YAML for you.

@tvernum tvernum closed this Jan 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Logging Log management and logging utilities >enhancement Team:Core/Infra Meta label for core/infra team v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplify messages for startup errors in plugins