Interaction between flood stage and system indices #64251

jaymode · 2020-10-27T21:26:10Z

When a node hits the flood stage watermark, all indices on that node get the index.blocks.read_only_allow_delete setting applied with a value of true. This currently applies to system indices as well as data indices. When this happens, system operations that require writes will begin to fail, which is acceptable for certain non-critical actions but for critical actions we need to consider whether failure is the right thing to do. In an effort to reduce the scope of actions that could bypass the flood stage read only block, I have attempted to enumerate what I believe we should consider as critical operations that would otherwise fail.

Critical Actions

Authentication

An item that would fail once the flood stage is hit is the ability to authenticate when using SAML, OpenID Connect, or delegated PKI authentication and to a certain extent Kerberos authentication. SAML, OpenID Connect, and delegated PKI authentication results in the generation of an access and refresh token that are used for subsequent access to Elasticsearch; if the document cannot be written to the security index then the authentication will fail. For Kerberos, Elasticsearch itself does not require the use of tokens for subsequent authentication but it will have a significant performance impact if tokens are not used. Kerberos authentication using Kibana requires the token service to be enabled so it will appear as users cannot authenticate using kerberos if users are accessing Elasticsearch through Kibana.

A workaround could be to use the built-in users or a file realm user. Built-in users can be disabled via the API and if this is the case then unless we allow enabling/disabling of a user to bypass the watermark then we cannot rely on built-in users. Additionally, there is a setting that completely disables our reserved realm, which contains the built-in users and that is another reason why we should not rely on them being available. A file based realm is our recommendation for recovery but we do not require one to be enabled and should not make recovering from being over the flood stage more difficult than it needs to be.

Credential Invalidation / Logout

In the event of the security system index becoming read only, invalidation of API keys and tokens fail. We should do our best to keep these operations available as they may be needed to stop an influx of data that is pushing the cluster to the flood stage uncontrollably.

SAML and OpenID Connect logout also need the ability to write data to an index as the tokens used are invalidated as part of the logout operation.

Disabling user

Along the same lines as above, it may become necessary to disable a user temporarily while attempting to get a cluster back up and running as a means to stop data from coming in until the cluster can be rebalanced and have any additional resources that may be needed.

Identity Provider operations

There are probably some actions within this plugin that we may want to allow bypassing a watermark, but I am not familiar enough with the details of this to truly provide a recommendation. @tvernum @jkakavas any thoughts?

Proposal

I'd like to propose that we allow a system index plugin to opt-in actions that would be allowed to bypass the flood stage so that they can allow data to be written to an index. I've only identified security components as those that would bypass the flood stage (as of writing) and currently believe that the Security plugin would opt-in the following transport actions:

TransportSamlLogoutAction
TransportSamlAuthenticateAction
TransportSamlInvalidateSessionAction
TransportInvalidateApiKeyAction
TransportSetEnabledAction
TransportDelegatePkiAuthenticationAction
TransportOpenIdConnectAuthenticateAction
TransportOpenIdConnectLogoutAction

An item worth consideration is a limit on the amount that we should allow critical operations to push past the flood stage; I don't think we should allow for the critical operations to push the disk out of space but if the configuration uses byte values, how far past the flood stage do we allow the critical operations to go?

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-10-27T21:26:12Z

Pinging @elastic/es-core-infra (:Core/Infra/Core)

elasticmachine · 2020-10-27T21:26:12Z

Pinging @elastic/es-security (:Security/Authentication)

tvernum · 2020-10-28T05:12:44Z

Identity Provider operations

There's nothing that writes to the security indices here.

jaymode added >enhancement :Core/Infra/Core Core issues without another label :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) labels Oct 27, 2020

elasticmachine added Team:Core/Infra Meta label for core/infra team Team:Security Meta label for security team labels Oct 27, 2020

jaymode mentioned this issue Oct 27, 2020

System Indices #50251

Open

23 tasks

rjernst added the needs:triage Requires assignment of a team area label label Dec 3, 2020

rjernst removed the needs:triage Requires assignment of a team area label label Dec 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interaction between flood stage and system indices #64251

Interaction between flood stage and system indices #64251

jaymode commented Oct 27, 2020

elasticmachine commented Oct 27, 2020

elasticmachine commented Oct 27, 2020

tvernum commented Oct 28, 2020

Interaction between flood stage and system indices #64251

Interaction between flood stage and system indices #64251

Comments

jaymode commented Oct 27, 2020

Critical Actions

Authentication

Credential Invalidation / Logout

Disabling user

Identity Provider operations

Proposal

elasticmachine commented Oct 27, 2020

elasticmachine commented Oct 27, 2020

tvernum commented Oct 28, 2020