Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for proxied PKI authentication #34396

Closed
jaymode opened this issue Oct 11, 2018 · 22 comments · Fixed by #45906
Closed

Support for proxied PKI authentication #34396

jaymode opened this issue Oct 11, 2018 · 22 comments · Fixed by #45906
Assignees
Labels
>feature :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc)

Comments

@jaymode
Copy link
Member

jaymode commented Oct 11, 2018

The PKI realm currently relies on the TLS handshake with the client for authentication, which works in most cases. However, there are cases where end user requests may be proxied by another server such as Kibana. In these cases, the PKI realm would not have access to the TLS handshake with the client and the clients certificate directly. In order to enable end user PKI authentication for this use case we would need to add support for proxied PKI (PPKI MITM AAS as @jkakavas puts it). The requirements for this are:

  • The proxied client certificate must be placed in an HTTP header
  • The client connection proxying the certificate must be authenticated using PKI
  • The client subject authenticated using PKI must be allowed to proxy pki (configured at the realm level)
  • There needs to be some level of auditing for this; can we safely re-purpose run as?

cc @kobelb @clintongormley

@jaymode jaymode added >feature help wanted adoptme :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) labels Oct 11, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security

@alexbrasetvik
Copy link
Contributor

Some of the material here might be of interest: https://docs.google.com/presentation/d/1ko8xveyspZsZ6RWcXwHizPnt3R1EAGFdV__a2ZTGUDM/edit?usp=sharing

@tvernum
Copy link
Contributor

tvernum commented Mar 25, 2019

The client connection proxying the certificate must be authenticated using PKI

Do we specifically mean that it must use the PKI realm, or simply that it must have a validated client certificate?

@jaymode
Copy link
Member Author

jaymode commented Mar 25, 2019

Do we specifically mean that it must use the PKI realm, or simply that it must have a validated client certificate?

My initial thought was that we should require the connection to be authenticated with the PKI Realm and the proxied client certificate should be authenticated with the PKI realm as well. Upon further thought, I do not feel that aspect is a hard requirement. I think the following should be properties that we keep though:

  • the connection from the proxying app has authentication
  • the authenticated user from the proxying app needs to be authorized to proxy the certificate

@albertzaharovits
Copy link
Contributor

albertzaharovits commented Jun 3, 2019

The description above is of a variant of run-as functionality; there is one authenticated client (kibana) that authorizes as another principal. We already have the privilege model to define which principals can authorize as which one (the run_as privilege). The way to convey the authorization principal is through the es-security-runas-user HTTP header.
Therefore, one option is to extend the existing run_as functionality to make it work in the case that the authorization principal is defined by a X509v3 certificate. There are de-facto standards to convey the client certificate (or fields of it) from the proxy (kibana) to the service provider (elasticsearch); eg the X-SSL-CERT family of headers of nginx.

Implementation hurdles:

  • there are possibly multiple identities bound to the certificate; we need to build capabilities into ES that allow the administrator to define THE way to parse out the principal.
  • if we wish to allow extracting the principal in terms of the client's trusted certificate chain it will become very complex; not least because passing cert chains using the de-facto headers is not available.

Qualms:

  • Kibana is more than a proxy doing authentication as part of the TLS termination. Kibana does many more requests to ES for a single incoming request; there is also the notion of background jobs running as the user. Therefore we would be stretching the X-SSL-CERT outside the de-facto use case (there is no inherent security problem with that, but if we go with a standard practice let's follow the use case the practice is used for, otherwise find another standard).

Note that this alternative will not add or extend any authentication realm, it will extend an authorization feature (the run_as impersonation).

Another alternative is to create a new API that implements the delegation functionality of the PKI realm. In ES we broadly have two types of realms, the ones that validate secrets, such as passwords, and realms that consume authentication statements from third party services (SAML, OIDC, Kerberos). The latter do not do authentication but validate an authentication statement (claims, tickets). Upon successful validation, ES issues a token usable by kibana to authorize as the user in the claims/ticket.

In this proposal ES will be validating the certificate chain of the client that connected over TLS to kibana and afterwards will emit a token to be used by kibana to authorize as the user. Kibana then stores the token on the client as a cookie, and forwards it to ES as a header.

Implementation hurdles:

  • The same problem with extracting a principal from a certificate chain on the ES side.
  • Kibana has to make sure the token in the cookie is associated with the current TLS client certificate; The underlying TLS connection can change without a proper logout ceremony. Kibana has to detect that the token in the cookie was issue for a different client certificate than the currently used one (after a new TLS handshake) and re-request a new token.

Qualms:

  • Certificate chains may look like signed assertions of identity, but they lack the crucial short time validity. Without a limited lifetime, the cert chain used as an identity assertion provides little guarantees. Therefore, this is not a true realm, offering a false sense of security (unusable outside Kibbana) and relying on Kibana to validate the association of the secret private key with the certificate (during TLS termination).

Another alternative is to completely delegate authentication to Kibana, including parsing of the principal. This is a variant of the first alternative. In this way there is little to do on the ES side, but do the role-mapping to the principal (basically run-as with the option to not lookup the user). The principal can be conveyed to ES via headers (X-SSL-CERT friends) or fancy JWT tokens. Both of these require TLS between Kibana and ES. The only narrow advantage of JWT is the ability to communicate more complex attributes about the user that can be utilized in role mappings.

RFC @elastic/es-security @elastic/kibana-security

@albertzaharovits
Copy link
Contributor

When discussing options, I propose we argue about what the considered option is bringing over simply using the existing run-as ES feature.

I will shortly describe the setup with run-as, and let us consider this as the baseline implementation.

Kibana does TLS termination and parses the principal out of the DN in the Subject attribute of the client's certificate and then forwards that principal value in the es-security-runas-user header to ES.
Kibana has to make sure that the header value of every request to ES reflects truthfully the client cert on the TLS session between the client and Kibana. Similarly to ES's PKI realm, Kibana should permit the configuration of additional truststores that must validate the client cert, in addition to the HTTP layer validation. There ought to be mandatory TLS between Kibana and ES. ES is not aware that Kibana validates PKI certs (eg in audit logs) but can do the LDAP lookup of the principal by rebuilding the DN with the rules of the individual LDAP realms. ES also has to be configured to permit run_as for the Kibana principal

I believe this satisfies most of the requirements, and I would suggest we go with this. Alternatively, let us find compelling use cases not satisfied by this setup to motivate implementation variations suggested in #34396 (comment).

@kobelb
Copy link
Contributor

kobelb commented Jun 3, 2019

Previous discussion about allowing Kibana to utilize the run-as header have ended with us deciding that we shouldn't grant this privilege to the kibana_system role because of the security ramifications. If I understand correctly, the other solutions which have been proposed don't require the relaxing of this consideration, and prevent Kibana from being able to "run as" arbitrary users.

@albertzaharovits
Copy link
Contributor

It does not have to be the kibana_system the principal that run_as; it could be some dedicated user that is configured on ES to run_as as a defined set of users. The username and the ES privilege could be made part of the configuration procedure.

But, in essence, the issue remains: We have to trust Kibana with forwarding the identity of the client, because the client cannot offer ES any proofs. We trust that Kibana keeps its keypair for TLS to ES a secret.

@jkakavas
Copy link
Member

jkakavas commented Jun 4, 2019

A few comments for now to hopefully help the discussion forward, I'm not sure I've reached a conclusion yet as to what I prefer from the given options:

there are possibly multiple identities bound to the certificate; we need to build capabilities into ES that allow the administrator to define THE way to parse out the principal.

Can you elaborate ? Client certificates will have 1 subject. I get the point about parsing (CN/DNs/emails) but not the multiple identities one.

It does not have to be the kibana_system the principal that run_as; it could be some dedicated user

The actual threat would remain though. Someone with access to Kibana (and thus to this user's credentials) will be allowed to run_as arbitrary users.

that is configured on ES to run_as as a defined set of users

This is tricky. Wouldn't that mean that setting up PKI in Kibana would require you to know up front all your users that might authenticate with a client certificate/key and pre-configure ES to allow the service user with run_as as any of these?

We trust that Kibana keeps its keypair for TLS to ES a secret.

Do you mean that we should require TLS mutual authentication from Kibana to ES ? What would the benefit be for this to be a strict requirement ? Or could the same be said for kibana/service user credentials ?

But, in essence, the issue remains: We have to trust Kibana with forwarding the identity of the client, because the client cannot offer ES any proofs.

The difference is that in the run_as with es-security-runas-user approach, if that trust is breached, then a malicious user can impersonate any ES native user.

If, alternatively, we have Kibana only validate that the user who authenticated has access to the private key that corresponds to this certificate, and ES verify the certificate based on its configuration, a malicious user with access to Kibana would also need to forge a certificate from a CA that ES trusts ( as ES will still validate the certificate signature ) or steal the client certificate from a user. Given that client certificates are not meant to be secret, this is not a huge mitigation factor, but people don't usually publish their client certs.

Certificate chains may look like signed assertions of identity, but they lack the crucial short time validity. Without a limited lifetime, the cert chain used as an identity assertion provides little guarantees. Therefore, this is not a true realm, offering a false sense of security

I'm not sure I agree with this. The lack of short time validity is potentially missing from all client certificates - not relavant to this implementation aspect only - and, still, these are widely used for authentication. There are even PKI setups/deployments that issue short lived certificates on demand when they are used for client authentication.

@tvernum
Copy link
Contributor

tvernum commented Jun 5, 2019

Thanks @albertzaharovits, that's a very clear write up.

Disclosure: I still prefer the option to create a new API that implements the delegation functionality of the PKI realm, but I'm open to changing my mind.

there are possibly multiple identities bound to the certificate; we need to build capabilities into ES that allow the administrator to define THE way to parse out the principal

The same problem with extracting a principal from a certificate chain on the ES side.

We already have a way to do this in PKI realms (a regex on the DN). It's simple, but has not (so far) been an issue.
For me this is an argument in favour of a delegated PKI realm. If we want to improve our certificate-to-user implementation then we just do it in the PKI realm and it would be available for both Kibana and ES terminated TLS. I don't think we want to have multiple ways to solve the same underlying problem of extracting an ES-capable user identity from a certificate.

Note that this alternative will not add or extend any authentication realm

This for me is actually a negative (but it is a design choice that we could change).
Identity & Authentication in Elasticsearch is based on realms. Here we have a pseudo-authentication mechanism (pseudo in that it doesn't happen within Elasticsearch itself, but it does take place within the Elastic Stack) that doesn't use realms. I don't like the idea of having something that's different when it doesn't need to be.
And, in order for run-as to work the user has to exist in some realm. So this propsal requires that some sort of realm exist for the users that authenticate in this way. It might be native or LDAP, but it would not support ephemeral users (with role mapping) in the way Elasticsearch PKI does - unless we took the 3rd option you propose.

Kibana has to make sure the token in the cookie is associated with the current TLS client certificate

This is a good point, but is true in any of the alternatives. Kibana stores identity information in the sid cookie. And surprising, broken things will happen if the information in sid falls out of sync with the TLS client cert. I would propose putting a cryptographic hash of the cert into the cookie and, on each request, validate that it the cookie value still matches the TLS value.

Even in the run_as approach, it might be possible to ensure that the connection to ES always reflect the client certificate, but that would not protect Kibana feature controls (menu options, etc). No matter what solution we come up with, Kibana will need to detect client-cert changes and force a logout-login cycle.

Certificate chains may look like signed assertions of identity, but they lack the crucial short time validity.

Since (per @jkakavas's commen above) this is fundamentally true of certificates, I assume your concern is that we don't know when Kibana performed the TLS handshake and we might be subject to replay attacks. If this is something we are particularly concerned about then we can add the necessary information into the API. The body of the "swap this certificate for a token" API could include a timestamp & a signature of some sort.
I don't know that the risk is sufficient to require that (I'd need to think some more) but it's not an unsolvable problem.

Similarly to ES's PKI realm, Kibana should permit the configuration of additional truststores that must validate the client cert, in addition to the HTTP layer validation.

This seems like a downside for this option. It would require that Kibana duplicate functionality that Elasticsearch already has.
Under option 1, Kibana must replicate all the trust options that ES has, or else Kibana PKI would be (feature wise) an inferior offering to Elasticsearch PKI. Under the 2nd option, Kibana just needs to have enough configuration to perform a handshake. That could just be the corporate root CA, and Elasticsearch can perform additional trust checks (which could include doing different role mapping, or lookup in a different LDAP directory) depending on which CA signed the client cert.

ES also has to be configured to permit run_as for the Kibana principal

In the simplest case of using the existing es-security-runas-user header for this, it effectively grants Kibana superuser privileges.
Kibana would be able to run-as elastic because run-as doesn't have restrictions on the realm or the roles that the target user has.
We could build additional controls into run-as to solve this (I would lean towards "run as any user from realm X/Y/Z"), but that is additional engineering work, and we would have to work out how you configure it. The kibana_system role is fixed, so we'd need to have some mechanism by which you could specify which realms were permitted when the kibana principal tried to run-as.
This would bring run-as more in line with authorization_realm (which specify which realms to use for lookups).

An alternative is to take the other approach of putting the whole cert into the header. At least then it is only possible for Kibana to run-as a user that can be looked up by certificate (but then this is tying the implementation to a PKI realm which it sound like you want to avoid).

We have to trust Kibana with forwarding the identity of the client, because the client cannot offer ES any proofs

I think the "trust Kibana" terminology is proving to be too vague. It doesn't explicitly describe the risks we're trying to mitigate, and so it leads to all options appearing to be the same "we have to trust Kibana" even though they have different risk profiles.

The main issue that I think we should be identifying and controlling for is if an attacker manages to extract the kibana username+password from Kibana (e.g. through an RCE) and then uses that to do a privilege escalation on Elasticsearch.
If kibana has the ability to run_as any user then this escalation is trivial - you just send es-security-runas-user: elastic. So how would we protect this? One option that you allude to is to enforce TLS from Kibana to Elasticsearch, yet that only provides any real protection if we are using Kibana's client cert as its proof of identity. TLS by itself doesn't do much there, and even requiring a client cert might not (depending on the trust anchors), it has to be locked down to a specific keypair.

It is possible to do that. It would require that we don't put any username/password into kibana.yml, ideally disable the kibana user and then map the Kibana cert's DN to the kibana_system role. But, then we've switched the secret that the RCE needs to extract from being the Kibana username+password to being the Kibana cert+key. That's better, but it still means Kibana has a secret that end up being the keys-to-the-kingdom, and the right RCE will get you there.

What I'd like for us to have is some sort of protection that the stack admin can enable that would limit the set of users/roles that Kibana can escalate too.
Linking Kibana PKI to an ES realm provides an option for that. If the proxy_pki token grant has been configured against a proxy_pki realm, then that realm can be configured to never grant superuser (right now that means, don't create a role mapping for superuser, but we could look at putting something more explicit in if needed). While we cannot (and would not) enforce that ourselves, it does mean that the stack admin can manage this risk themselves.

There are other options that would be useful for any of the alternatives. For example IP restrictions on some part of this (e.g. the kibana user) so that the kibana credentials are not useful outside of the machine(s) on which Kibana runs.

Outstanding Issue
We still don't know how people expect to map these certificates to a user-with-roles in ES/Kibana. But for me that's an argument for aligning this with PKI realms because:

  • We have something that works already and people seem OK with it.
  • It has (slightly) more than the minimum feature set. We support role-mapping by username and/or DN, and we support LDAP/native user lookup through authorization_realms.
  • Hooking into the realm infrastructure means that any new features we build there are automatically (or mostly automatically) available to Kibana PKI (e.g. when we added authorization_realms). This would include possible future development like considering additional data within the certificate chain (as mentioned in your earlier comment), extracting additional metadata from certificates (for role mapping) or having a mapping-function for PKI-principal to LDAP-principal (for authorization_realms).

@albertzaharovits
Copy link
Contributor

Thanks @tvernum and @jkakavas for the valuable feedback! There is a lot to unpack, so rather than individually answering each question and suggestion I will try to elaborate on the proposals.

Generally, I think we (me, Tim and Ioannis) are all OK with extending the PKI realm to delegate authentication inside the Stack by having Kibana pass the client cert from the TLS handshake to ES in exchange for a token, i.e. Tim's original proposal, i.e. second alternative from #34396 (comment).

I also believe that given TIm's argument in

Kibana stores identity information in the sid cookie. And surprising, broken things will happen if the information in sid falls out of sync with the TLS client cert. I would propose putting a cryptographic hash of the cert into the cookie and, on each request, validate that it the cookie value still matches the TLS value.

and given that the Kibana server is more than a proxy (crafts ES requests, background jobs, etc) we can dismiss my argument that hoisting session information from the transport layer to the application layer is not required unless we go with the currently favored approach (ie it is required anyway).

Moreover, I also agree with

I think the "trust Kibana" terminology is proving to be too vague. It doesn't explicitly describe the risks we're trying to mitigate, and so it leads to all options appearing to be the same "we have to trust Kibana" even though they have different risk profiles.

and I believe we can all surmise that a trusted cert chain is more difficult to craft (or snoop from a compromised Kibana server) than a substring from its Subject field (ie the run-as principal) which would probably be scattered everywhere in the Kibana's memory since it has to go with every request to ES.

What I still don't like is that, for a lack of a better word, we "fake" the delegation and try to cover for it by saying it is "inside the Stack" and if we have to, we can jump in and do a threat analysis. Inside ES, all the other realms that delegate authentications (SAML, OIDC, Kerberos) accept timestamped signed assertions. Upon successful validation they release a token. Client certificates are signed identity assertions but are not time bound. This is crucial. My argument is that toggling the delegated authn on a PKI realm, as proposed herein, would be akin to disabling the timestamp validation on the SAML realm or the OIDC realm. We can put controls in place to compensate for that (TIm's IP firewall proposal) and narrow the pool of users, which leans me to OK this proposal, but this is a genuine conceptual problem with practical fixes.
The other proposals (run-as types), although inferior from the Stack perspective, acknowledges this weakness and passes the responsibility to the Kibana server to do the authentication. I think we discussed the technical problems of these alternatives (consistency with ES's PKI realm, Kibana is more exposed, passing cert chains in headers, ...) and I agree the run-as type of solution is impractical. But at least it does not have the conceptual weakness.

I wish to propose a variant of the favored approach, that I believe makes true the delegation concept. Let us make Kibana construct a JWT type of JSON, that contains the client certificate chain from the TLS session, a timestamp and a signature with a shared secret (could be the kibana_system user's password, or a dedicated secret between Kibana and ES) and let it exchange that for a token. We could probably standardize this as a JWT (it is pretty lax) but we don't really have to. What it is important is that a standalone ES server does not delegate authentication without validating signed timestamped assertions. Maybe an attacker that can capture a client certificate and the kibana_system password also has the capability to discover the signing secret and craft the JWT, so there is no practical improvement, but ES taken separately handles delegation by the books. I feel strongly about this, what do others think?

@tvernum
Copy link
Contributor

tvernum commented Jun 17, 2019

Inside ES, all the other realms that delegate authentications (SAML, OIDC, Kerberos) accept timestamped signed assertions ... Client certificates are signed identity assertions but are not time bound ... toggling the delegated authn on a PKI realm, as proposed herein, would be akin to disabling the timestamp validation on the SAML realm or the OIDC realm.

I think this mixing different concepts.

SAML assertions are typically sent via an intermediary, thus they are inherently subject to replay attacks. The design of a SAML assertion is intended to counter this risk by including time constraints that make the assertion fail if replayed at a later date, but the primary need for this is because the assertion may travel via an untrusted channel (e.g. the user's browser).

If we have a direct back channel to the authentication server (AS), then a timestramp would be technically unnecessary, as we could pass a session token to the AS and enquire whether the session is still valid. That would mitigate the same "hold and replay" attack without using a timestamp in the message (at the cost of being more chatty and having a runtime dependency on the AS).

Similarly if we obtain a message directly from the AS and we are confident in its integrity then signing is unnessary. SAML recognises this in making signatures mandatory for the Web profile (HTTP-POST/Redirect) but accepting that these are unneccessary when the IdP and SP are communicating on a transport that has end-to-end protection.
Signing allows us to know the a message originated from a particular party even though it has passed through untrusted intermediaries.

So, if the kibana server is communicating directly with ES, over an authenticated TLS connection, then I don't see why we need signatures or timestamps.

  • The fact that the message was sent now is sufficient timestamp to know that the certificate is valid for a current session.
  • The fact that the kibana user authenticated is sufficient evidence that it originated from kibana
  • The TLS connection provide message integrity and prevents tampering.

I'm not in any way opposed to having a signed & timestamped protocol, but I just don't see what risk it is mitigating that isn't already covered by a direct connection with TLS + authc.

Timestamps are easy enough. If we have a reason to need one, then we can implement one. But we also need to think about clock skew, so it's not a trivial case of having Kibana set an expiry 60 seconds into the future.

If we need signing, then I'd like to consier why TLS client auth is not the correct solution for it. That will provide signatures at the transport layer and we can enforce either a pre-shared key (certificate pinning) or the more typical issuer (CA) checks. Technically, of course, requests sent over TLS are already signed using a session key, but if we think we need something more than that, then I would hope that client certificates would be sufficient for that.

@tvernum
Copy link
Contributor

tvernum commented Jun 17, 2019

That said, I think we're at the point where we can and should get into some implementataion.
I think the only current points of contention are:

  • what metadata (timestamp) Kibana would need to send alonside the certificate chain
  • whether that payload requires a signature
  • what additional validation we would perform on
    • the payload (metadata)
    • signature (if any)
    • connection (does it require client certificates)

We can start building the basics without having answered those questions.
To start with we'd need:

  • A proxy written in Java that implements what we expect of Kibana (for integ testing)
  • An endpoint of some sort for that to call (either a new token grant_type, or a whole new endpoint)
  • The actual authc against PKI realms & exchange for the token
  • Some sort of controls to enable/disable this feature for some/all realms and some calling user/role (e.g. kibana_system). I don't think we've decided what this would look like, but I'm pretty sure we do not want to ship it as "always on", given how it changes the trust model between Kibana & ES

If we have available engineering cycles (and I think we do), then let's get on with what we do know and see how we go.

@jkakavas
Copy link
Member

jkakavas commented Jul 1, 2019

I totally agree with Tim's comments on #34396 (comment).

I don't see what structuring the message from Kibana to ES in a JWT and signing this JWT buys us. Requiring authentication ( kibana - or another user with required privileges - user credentials ) on top of TLS offers the same guarantees with no need for additional implementation.

Client certificates are signed identity assertions but are not time bound. This is crucial.

I still can't see what is the issue with the time validity of the client certificate in this specific case. Could you please explain the threat you see and why you consider this such high risk / or why it is not mitigated by tls and authentication ?

@albertzaharovits
Copy link
Contributor

I totally agree with Tim's comments on #34396 (comment).

I also agree with Tim.

I still can't see what is the issue with the time validity of the client certificate in this specific case.
Could you please explain the threat you see and why you consider this such high risk / or why it is not mitigated by tls and authentication ?

A certificate chain is an assertion of identity and has a time validity. SAML and OIDC identity assertions also have time validities, but those are much more restricted compared to certificates. My observation was that in the PKI delegation flow we are using certificates more as SAML and OIDC assertions. Although the twist, which Tim pointed out, is that signatures are not required (but also not forbidden) in SAML for backchannel communications.

Everything on top of TLS and authentication has confidentiality, authentication (mutual?), integrity and replay protections. There is nothing more needed.

But I would also want us to consider ES outside of the Kibana context. There is a PKI realm installed on ES which allows delegation (which is implemented as we all seem to agree). Basic credentials over TLS are crack-able, the way a pure PKI realm is not. So a PKI realm which supports delegation is more exposed. How do we limit that? Tim mentioned some alternatives, all good ones.

But one alternative is also signing the "assertion" with another secret besides the kibana_system password. This secret has another scope than that password, so it could have a larger length/entropy, is not stored in the .security index, cannot be changed by the API, could be different for different nodes, could be different for different realms, does not complicate the Kibana setup in the way mutual TLS would...

@bizybot
Copy link
Contributor

bizybot commented Jul 8, 2019

We discussed one way of mitigating leaked Kibana credentials using out of band authentication after the TLS mutual authentication between Kibana and Client. In case the Kibana credentials were leaked, those could be used with any user's public certificate and use them to exchange token with ES. In case of out of band authentication, ES will communicate OTP to the client via some channel (for ex. extract email address from the certificate and then send OTP via email just one way to do this) which the client will need to use to create assertion stating Kibana is allowed to act on its behalf.

We have decided not to do this for the following reasons:-

  • complicates the UX for the login scenario
  • in some environments, the other channel may not be available

@albertzaharovits
Copy link
Contributor

TL;DR I reckon we do NOT need any restricted authn schemes for the proxying subject, for the delegate-PKI feature. But we need to investigate further authz options as part of our permission model refactoring in #44048. But this is not critically important for this feature.

Citing Jay from #34396 (comment) with

I think the following should be properties that we keep though:

  • the connection from the proxying app has authentication
  • the authenticated user from the proxying app needs to be authorized to proxy the certificate

I think we need to start discussing the authentication and authorization aspect of the PKI delegation feature. This also ties in with Tim's

Some sort of controls to enable/disable this feature for some/all realms and some calling user/role (e.g. kibana_system). I don't think we've decided what this would look like, but I'm pretty sure we do not want to ship it as "always on", given how it changes the trust model between Kibana & ES

from #34396 (comment) .

On the authentication side I believe we have to make a decision whether we feel the need to enforce any authentication scheme (the prominent option being client authn TLS) and restrict the BASIC method (which kibana_system normally uses). The rationale for this is that some schemes are more secure than others (eg. the kibana_system password can be brute forced, but a private key cannot). TLS with client authn is presumably the prominent option because, in this case, there is definitely a PKI realm configured on ES so this strong authn scheme does entail lesser configuration burdens (on the ES server at least). But this is not generally true because there are policies around certificate handling, and this does not even consider the configuration burden for the proxying app which could be load balanced. Also for the authentication aspect, I had previously been peddling a form of HMAC at the application logic level. It is also a form of authentication because it implies a shared secret between ES and the proxying app. But it is a form of authentication at the "feature-level" because it does not use the authn framework in ES Security, and it would have to happen after that one (given the request processing order). There are a few benefits, all originating because the scope of the secret is at the feature level. For example, if the secret is a new realm secret setting, then realms and nodes can have different values for it allowing finer grained control over the proxying identity. Moreover it does not encumber the kibana_system with more privileges, and it allows for different password/secret policies (the PKI delegation secret cannot be changed by the password API). However, all these benefits exist only when specifically comparing to kibana_system. But in the general case, this form of "feature-level" authentication is redundant. For example, there could be a separate file-based user that the proxying app uses solely for this feature, and this would have all the advantages as any other authn layer at the "feature-level".

Overall, I do not think we need any specific forms of restricted authentication for the PKI delegation feature. This is because, and this refers to the next point, the es-admin can create role-mapping rules such as only certain users will be authorized to use this feature. And the es-admin has all the capacity to tune the authentication strength. Given the previous examples, the privileges could be granted only to a certain delegate-pki user which is part of some PKI or file-based realm. Moreover, if we are concerned about over encumbering kibana_system with privileges, we should recommend a different system user for this feature on the Kibana's side.

In terms of authorization the more fine grained the better. Hence the best is a way for the es-admin to be able to grant the "delegate-pki" privilege to some specific principals such that they can "delegate-authenticate" as some specific principals. Given that #44106 introduces a new transport action, and using the usual role mapping rules, it is easy for the es-admin to restrict the principal doing the delegation (the proxying app), but it is not possible, given the current state of the authorization framework, to define permissions that restrict the "delegated-authenticated" user. We could "bake" this part of the authorization at the realm level. In this case there would be a PKI realm setting namespaces, used to define these permissions (who can authenticate as who, in the case of this realm, using the delegation feature). This is ugly from the es-admin pov because it splits authorization configuration in two places, and it is kludgy from an engineering perspective because it mixes authorization with application logic. We might also improve our permission model to work outside the request "boundary", such as allowing to authorize and inspect on the response as well. I think it would be wonderful if we can achieve something like that as part of the effort in #44048 . My stance is that we can ship this feature without a complete authz scheme, without allowing to specify the "authenticated-as" principals in the permission model. But work on this item separately, as a follow-up, after we settled on #44048 .

@albertzaharovits
Copy link
Contributor

We have discussed the matter of restricting the delegation feature in our weekly team meeting.

There would be a cluster privilege granting the authentication delegatee user (ie kibana_system) the ability to get access tokens for any user that the PKI realms normally authenticate. This privilege will most likely stand on its own, not be included with the others apart from all.
For the es-admin to restrict the users that can be authenticated by the delegatee, there would be user metadata fields populated exclusively when authn has been performed in this way, so that role mapping rules can differentiate such users. Hence, role mapping rules can exclude particular users to be authenticated by the proxy by not mapping any roles to them.

@ShazCho
Copy link

ShazCho commented Jul 23, 2019

Hi, I’d like to contribute some thoughts on this.

  1. Is the suggestion here that there would be a master certificate that connects to ElasticSearch via the TPS handshake? If so is there a requirement for that master certificate to be authenticated via the pki realm?
  2. Can we authenticate via another realm? Then the TLS handshake incorporates the certificate for the client connecting to Kiran’s.

Thanks

@ShazCho
Copy link

ShazCho commented Jul 23, 2019

Specifically I meant to mention the master account being authenticated via Kerberos then TLS used for certificate pki authentication during the proxy engagement.

@albertzaharovits
Copy link
Contributor

Hi @ShazCho,

Yes, in the current proposal, the proxy user doing the delegation can be authenticated by "any" realm, Kerberos included, not only PKI ("any" is quoted because some realms require browser interaction (SAML, OIDC), whereas the user doing the delegation works as a system user, therefore the scheme to achieve delegation in those cases is probably wrong).

@ShazCho
Copy link

ShazCho commented Jul 23, 2019

Ahh thanks. So I’ve been waiting for Kibana PKI for a while however the system_user from my perspective needs to be approved by a variety of different mechanisms depending on each implementation I do eg pki, Kerberos, saml, oauth. Does this mean that system_user can only be authenticated via pki?

albertzaharovits added a commit that referenced this issue Aug 26, 2019
This commit introduces PKI realm delegation. This feature
supports the PKI authentication feature in Kibana.

In essence, this creates a new API endpoint which Kibana must
call to authenticate clients that use certificates in their TLS
connection to Kibana. The API call passes to Elasticsearch the client's
certificate chain. The response contains an access token to be further
used to authenticate as the client. The client's certificates are validated
by the PKI realms that have been explicitly configured to permit
certificates from the proxy (Kibana). The user calling the delegation
API must have the delegate_pki privilege.

Closes #34396
albertzaharovits added a commit that referenced this issue Aug 27, 2019
This commit introduces PKI realm delegation. This feature
supports the PKI authentication feature in Kibana.

In essence, this creates a new API endpoint which Kibana must
call to authenticate clients that use certificates in their TLS
connection to Kibana. The API call passes to Elasticsearch the client's
certificate chain. The response contains an access token to be further
used to authenticate as the client. The client's certificates are validated
by the PKI realms that have been explicitly configured to permit
certificates from the proxy (Kibana). The user calling the delegation
API must have the delegate_pki privilege.

Closes #34396
kaypeter87 added a commit to kaypeter87/elasticsearch that referenced this issue Sep 9, 2019
* Put error message from inside the process into the exception that is thrown when the process doesn't start correctly. (#45846)

* update bwcVersions

* [DOCS] Reformat match query (#45152)

* Fix update-by-query script examples (#43907)

Two examples had swapped the order of lang and code when creating a
script.

Relates #43884

* Adjusting ML usage object serialization bwc version (#45874)

* Fsync translog without writeLock before rolling (#45765)

Today, when rolling a new translog generation, we block all write
threads until a new generation is created. This choice is perfectly 
fine except in a highly concurrent environment with the translog 
async setting. We can reduce the blocking time by pre-sync the 
current generation without writeLock before rolling. The new step 
would fsync most of the data of the current generation without 
blocking write threads.

Close #45371

* Add node.processors setting in favor of processors (#45855)

This commit namespaces the existing processors setting under the "node"
namespace. In doing so, we deprecate the existing processors setting in
favor of node.processors.

* Remove binary file accidentally committed

🤦‍♀️

* Fix TransportSnapshotsStatusAction ThreadPool Use (#45824)

In case of an in-progress snapshot this endpoint was broken because
it tried to execute repository operations in the callback on a
transport thread which is not allowed (only generic or snapshot
pool are allowed here).

* Enable testing against JDK 14 (#45178)

This commit enables testing against JDK 14.

* [DOCS] Add anchor to  version types list. (#45886)

* Adding a warning to from-size.asciidoc

Customers occasionally discover a known behavior in Elasticsearch's pagination that does not appear to be documented. This warning is intended to educate customers of this behavior while still highlighting alternative solutions.

* Remove redundant Java check from Sys V init (#45793)

In the Sys V init scripts, we check for Java. This is not needed, since
the same check happens in elasticsearch-env when starting up. Having
this duplicate check has bitten us in the past, where we made a change
to the logic in elasticsearch-env, but missed updating it here. Since
there is no need for this duplicate check, we remove it from the Sys V
init scripts.

* Update joda to 2.10.3 (#45495)

* Allow partial request body reads in AWS S3 retries tests (#45847)

This commit changes the tests added in #45383 so that the fixture that 
emulates the S3 service now sometimes consumes all the request body 
before sending an error, sometimes consumes only a part of the request 
body and sometimes consumes nothing. The idea here is to beef up a bit 
the tests that writes blob because the client's retry logic relies on 
marking and resetting the blob's input stream.

This pull request also changes the testWriteBlobWithRetries() so that it 
(rarely) tests with a large blob (up to 1mb), which is more than the client's 
default read limit on input streams (131Kb).

Finally, it optimizes the ZeroInputStream so that it is a bit more effective 
(now works using an internal buffer and System.arraycopy() primitives).

* Move testRetentionLeasesClearedOnRestore (#45896)

* [DOCS] Reformat put mapping API docs (#45709)

* Fix RemoteClusterConnection close race (#45898)

Closing a `RemoteClusterConnection` concurrently with trying to connect
could result in double invoking the listener.

This fixes
RemoteClusterConnectionTest#testCloseWhileConcurrentlyConnecting

Closes #45845

* [ML][Transforms] fix doSaveState check (#45882)

* [ML][Transforms] fix doSaveState check

* removing unnecessary log statement

* [ML] Improve progress reportings for DF analytics (#45856)

Previously, the stats API reports a progress percentage
for DF analytics tasks that are running and are in the
`reindexing` or `analyzing` state.

This means that when the task is `stopped` there is no progress
reported. Thus, one cannot distinguish between a task that never
run to one that completed.

In addition, there are blind spots in the progress reporting.
In particular, we do not account for when data is loaded into the
process. We also do not account for when results are written.

This commit addresses the above issues. It changes progress
to being a list of objects, each one describing the phase
and its progress as a percentage. We currently have 4 phases:
reindexing, loading_data, analyzing, writing_results.

When the task stops, progress is persisted as a document in the
state index. The stats API now reports progress from in-memory
if the task is running, or returns the persisted document
(if there is one).

* Expose the ability to cancel async requests in REST high-level client (#45688)

This commits makes all the async methods in the high level client return the `Cancellable` object that the low level client now exposes.

Relates to #45379 
Closes #44802

* Fix IngestService to respect original document content type (#45799)

This PR modifies the logic in IngestService to preserve the original content type 
on the IndexRequest, such that when a document with a content type like SMILE 
is submitted to a pipeline, the resulting document that is persisted will remain in 
the original content type (SMILE in this case).

* Change `{var}` convention to `<var>` (#45904)

* Fix bugs in Painless SCatch node (#45880)

This fixes two bugs:
- A recently introduced bug where an NPE will be thrown if a catch block is 
empty.
- A long-time bug where an NPE will be thrown if multiple catch blocks in a 
row are empty for the same try block.

* Update translog checkpoint after marking ops as persisted (#45634)

If two translog syncs happen concurrently, then one can return before
its operations are marked as persisted. In general, this should not be
an issue; however, peer recoveries currently rely on this assumption.

Closes #29161

* [DOCS] Reformat get index API docs (#45758)

* [DOCS] Reformat delete index API docs (#45755)

* Handle multiple loopback addresses (#45901)

AbstractSimpleTransportTestCase.testTransportProfilesWithPortAndHost
expects a host to only have a single IPv4 loopback address, which isn't
necessarily the case. Allow for >= 1 address.

* [DOCS] Relocate Ingest API docs to REST API section (#45812)

* [ML][Transforms] adjusting when and what to audit (#45876)

* [ML][Transforms] adjusting when and what to audit

* Update DataFrameTransformTask.java

* removing unnecessary audit message

* Remove processors setting (#45905)

The processors setting was deprecated in version 7.4.0 of Elasticsearch
for removal in Elasticsearch 8.0.0. This commit removes the processors
setting.

* Remove translating processors in Docker entrypoint (#45923)

Now that processors is no longer a valid Elasticsearch setting, this
commit removes translation for it in the Docker entrypoint.

* Deprecate the pidfile setting (#45938)

This commit deprecates the pidfile setting in favor of node.pidfile.

* Adjust node.pidfile version in cluster formation

Now that the deprecation of pidfile has been backported to 7.4.0, this
commit adjusts the version-conditional logic in cluster formation tasks
for setting pidfile versus node.pidfile.

* Remove non task aware execute methods from TransportAction (#45821)

The TransportAction class has several ways to execute the action, some
of which will create a task. This commit removes those non task aware
variants in favor of handling task creation inside NodeClient for local
actions.

* Remove the pidfile setting (#45940)

The pidfile setting was deprecated in version 7.4.0 of Elasticsearch for
removal in Elasticsearch 8.0.0. This commit removes the pidfile setting.

* Allow Transport Actions to indicate authN realm (#45767)

This commit allows the Transport Actions for the SSO realms to
indicate the realm that should be used to authenticate the
constructed AuthenticationToken. This is useful in the case that
many authentication realms of the same type have been configured
and where the caller of the API(Kibana or a custom web app) already
know which realm should be used so there is no need to iterate all
the realms of the same type.
The realm parameter is added in the relevant REST APIs as optional
so as not to introduce any breaking change.

* re-enable BWC tests after merging #45767 (#45948)

* Fix plaintext on TLS port logging (#45852)

Today if non-TLS record is received on TLS port generic exception will
be logged with the stack-trace.
SSLExceptionHelper.isNotSslRecordException method does not work because
it's assuming that NonSslRecordException would be top-level.
This commit addresses the issue and the log would be more concise.

* Add Test Logging for #45953 (#45957)

Adding some logging to track down #45953 and making the failing assertion log more detail

* [DOCS] Reformat create index API docs (#45749)

* Fix SnapshotStatusApisIT (#45929)

The snapshot status when blocking can still be INIT in rare cases when
the new cluster state that has the snapshot in `STARTED` hasn't yet
become visible.
Fixes #45917

* Fix Broken HTTP Request Breaking Channel Closing (#45958)

This is essentially the same issue fixed in #43362 but for http request
version instead of the request method. We have to deal with the
case of not being able to parse the request version, otherwise
channel closing fails.

Fixes #43850

* Refactor RepositoryCredentialsTests (#45919)

This commit refactors the S3 credentials tests in 
RepositoryCredentialsTests so that it now uses a single 
node (ESSingleNodeTestCase) to test how secure/insecure 
credentials are overriding each other. Using a single node 
makes it much easier to understand what each test is actually 
testing and IMO better reflect how things are initialized.

It also allows to fold into this class the test 
testInsecureRepositoryCredentials which was wrongly located 
in S3BlobStoreRepositoryTests. By moving this test away, the 
S3BlobStoreRepositoryTests class does not need the 
allow_insecure_settings option anymore and thus can be 
executed as part of the usual gradle test task.

* [DOCS] Reformat get settings API docs (#45924)

* Better logging for TLS message on non-secure transport channel (#45835)

This commit enhances logging for 2 cases:

1. If non-TLS enabled node receives transport message from TLS enabled
node on transport port.
2. If non-TLS enabled node receives HTTPs request on transport port.

* Relax translog assertion in testRestoreLocalHistoryFromTranslog (#45943)

Since #45473, we trim translog below the local checkpoint of the safe
commit immediately if soft-deletes enabled. In
testRestoreLocalHistoryFromTranslog, we should have a safe commit after
recoverFromTranslog is called; then we will trim translog files which
contain only operations that are at most the global checkpoint.

With this change, we relax the assertion to ensure that we don't put
operations to translog while recovering history from the local translog.

* Consider artifact repositories backed by S3 secure (#45950)

Since credentials are required to access such a repository, and these
repositories are accessed over an encrypted protocol (https), this
commit adds support to consider S3-backed artifact repositories as
secure. Additionally, we add tests for this functionality.

* Build: Support `console-result` language (#45937)

This adds support for verifying that snippets with the `console-result`
language are valid json. It also switches the response snippets on the
`docs/get` page from `js` to `console-result` which will allow clients
to provide "alternatives" for them like they can now do with
`// CONSOLE` snippets.

* [DOCS] Reformat indices exists API docs (#45918)

* [DOCS] Reformat get field mapping API docs (#45700)

* Add Cumulative Cardinality agg (and Data Science plugin) (#43661)

This adds a pipeline aggregation that calculates the cumulative
cardinality of a field.  It does this by iteratively merging in the
HLL sketch from consecutive buckets and emitting the cardinality up
to that point.

This is useful for things like finding the total "new" users that have
visited a website (as opposed to "repeat" visitors).

This is a Basic+ aggregation and adds a new Data Science plugin
to house it and future advanced analytics/data science aggregations.

* [DOCS] Correct `IIF` conditional section title (#45979)

* Fix typo in plugin name, add to allowed settings

* PKI realm authentication delegation (#45906)

This commit introduces PKI realm delegation. This feature
supports the PKI authentication feature in Kibana.

In essence, this creates a new API endpoint which Kibana must
call to authenticate clients that use certificates in their TLS
connection to Kibana. The API call passes to Elasticsearch the client's
certificate chain. The response contains an access token to be further
used to authenticate as the client. The client's certificates are validated
by the PKI realms that have been explicitly configured to permit
certificates from the proxy (Kibana). The user calling the delegation
API must have the delegate_pki privilege.

Closes #34396

* [ML] fixing bug where analytics process starts with 0 rows (#45879)

The native process requires that there be a non-zero number of rows to analyze. If the flag --rows 0 is passed to the executable, it throws and does not start.

When building the configuration for the process we should not start the native process if there are no rows.

Adding some logging to indicate what is occurring.

* [ML] add supported types to no fields error message (#45926)

* [ML] add supported types to no fields error message

* adding supported types to logger debug

* Range Field support for Histogram and Date Histogram aggregations(#45395)

 * Add support for a Range field ValuesSource, including decode logic for range doc values and exposing RangeType as a first class enum
 * Provide hooks in ValuesSourceConfig for aggregations to control ValuesSource class selection on missing & script values
 * Branch aggregator creation in Histogram and DateHistogram based on ValuesSource class, to enable specialization based on type.  This is similar to how Terms aggregator works.
 * Prioritize field type when available for selecting the ValuesSource class type to use for an aggregation

* [TEST] wait for search task to be cancelled in SearchRestCancellationIT (#45978)

SearchRestCancellationIT aborts an http request, and then checks that
the corresponding search task has been cancelled on the server-side.
There are no guarantees that the task has already been marked cancelled
after the `cancel` calls returns, and there is no easy wait for that.

This commit introduces an assertBusy to try and wait for the search task
to be marked cancelled.

Closes #45911

* Remove node settings from blob store repositories (#45991)

This commit starts from the simple premise that the use of node settings
in blob store repositories is a mistake. Here we see that the node
settings are used to get default settings for store and restore throttle
rates. Yet, since there are not any node settings registered to this
effect, there can never be a default setting to fall back to there, and
so we always end up falling back to the default rate. Since this was the
only use of node settings in blob store repository, we move them. From
this, several places fall out where we were chaining settings through
only to get them to the blob store repository, so we clean these up as
well. That leaves us with the changeset in this commit.

* [DOCS] Streamline GS search topic. (#45941)

* Streamline GS search topic.

* Added missing comma.

* Update docs/reference/getting-started.asciidoc 

Co-Authored-By: István Zoltán Szabó <istvan.szabo@elastic.co>

* Add test for CopyBytesSocketChannel (#45873)

Currently we use a custom CopyBytesSocketChannel for interfacing with
netty. We have integration tests that use this channel, however we never
verify the read and write behavior in the face of potential partial
writes. This commit adds a test for this behavior.

* Do not create engine under IndexShard#mutex (#45263)

Today we create new engines under IndexShard#mutex. This is not ideal
because it can block the cluster state updates which also execute under
the same mutex. We can avoid this problem by creating new engines under
a separate mutex.

Closes #43699

* Fix compilation in CumulativeCardinalityAggregatorTests (#46000)

Some generics were specified at too fine-grained a level.

* [DOCS] Streamlined GS aggs section. (#45951)


* [DOCS] Streamlined GS aggs section.

* Update docs/reference/getting-started.asciidoc

Co-Authored-By: James Rodewig <james.rodewig@elastic.co>

* Don't use assemble task on root project (#45999)

The root project uses the base plugin to get a clean task, but does not
actually need the assemble task. This commit changes the root project to
use the lifecycle-base plugin, which while still creating the assemble
task, won't add any dependencies to it.

* [DOCS] Fix typo. (#46006)

* [TEST] wait for http channels to be closed in ESIntegTestCase (#45977)

We recently added a check to `ESIntegTestCase` in order to verify that
no http channels are being tracked when we close clusters and the
REST client. Close listeners though are invoked asynchronously, hence
this check may fail if we assert before the close listener that removes
the channel from the map is invoked.

With this commit we add an `assertBusy` so we try and wait for the map
to be empty.

Closes #45914
Closes #45955

* Add `manage_own_api_key` cluster privilege (#45897)

The existing privilege model for API keys with privileges like
`manage_api_key`, `manage_security` etc. are too permissive and
we would want finer-grained control over the cluster privileges
for API keys. Previously APIs created would also need these
privileges to get its own information.

This commit adds support for `manage_own_api_key` cluster privilege
which only allows api key cluster actions on API keys owned by the
currently authenticated user. Also adds support for retrieval of
the API key self-information when authenticating via API key
without the need for the additional API key privileges.
To support this privilege, we are introducing additional
authentication context along with the request context such that
it can be used to authorize cluster actions based on the current
user authentication.

The API key get and invalidate APIs introduce an `owner` flag
that can be set to true if the API key request (Get or Invalidate)
is for the API keys owned by the currently authenticated user only.
In that case, `realm` and `username` cannot be set as they are
assumed to be the currently authenticated ones.

The changes cover HLRC changes, documentation for the API changes.

Closes #40031

* Partly revert globalInfo.ready check (#45960)

This check was introduced in #41392 but had the unwanted side-effect
that the keystore settings in such blocks would note be added in the
node's keystore. Given that we have a mid-term plan for FIPS testing
that would made such checks unnecessary, and that the conditional
in these two cases is not really that important, this change removes
this conditional logic so that full-cluster-restart and rolling
upgrade tests will run with PEM files for key/certificate material
no matter if we're in a FIPS JVM or not.

Resolves: #45475

* [ML] Add option to regression to randomize training set (#45969)

Adds a parameter `training_percent` to regression. The default
value is `100`. When the parameter is set to a value less than `100`,
from the rows that can be used for training (ie. those that have a
value for the dependent variable) we randomly choose whether to actually
use for training. This enables splitting the data into a training set and
the rest, usually called testing, validation or holdout set, which allows
for validating the model on data that have not been used for training.

Technically, the analytics process considers as training the data that
have a value for the dependent variable. Thus, when we decide a training
row is not going to be used for training, we simply clear the row's
dependent variable.

* Disallow partial results when shard unavailable (#45739)

Searching with `allowPartialSearchResults=false` could still return
partial search results during recovery. If a shard copy fails
with a "shard not available" exception, the failure would be ignored and
a partial result returned. The one case where this is known to happen
is when a shard copy is recovering when searching, since
`IllegalIndexShardStateException` is considered a "shard not available"
exception.

Relates to #42612

* [DOCS] Reformat open index API docs (#45921)

* Fix RegressionTests#fromXContent (#46029)

* The `trainingPercent` must be between `1` and `100`, not `0` and `100` which is causing test failures

* [DOCS] Separate and reformat close index API docs (#45922)

* Remove already exist assertion while renew ccr lease (#46009)

If a CCR lease is disappeared while we are renewing it, then we will
issue asyncAddRetentionLease to add that lease. And if
asyncAddRetentionLease takes longer than retentionLeaseRenewInterval,
then we can issue another asyncAddRetentionLease request. One of
asyncAddRetentionLease requests will fail with
RetentionLeaseAlreadyExistsException, hence trip the assertion.

Closes #45192

* Watcher max_iterations with foreach action execution (#45715)

Prior to this commit the foreach action execution had a hard coded 
limit to 100 iterations. This commit allows the max number of 
iterations to be a configuration ('max_iterations') on the foreach 
action. The default remains 100.

* [DOCS] Reformat update index settings API docs (#45931)

* Always add Java-9 style file permissions (#46050)

Java 9 removed pathname canonicalization, which means that we need to
add permissions for the path and also the real path when adding file
permissions. Since master requires a minimum runtime of JDK 11, we no
longer need conditional logic here to apply this pathname
canonicalization with our bares hands. This commit removes that
conditional pathname canonicalization.

* [ML][HLRC] Add data frame analytics regression analysis (#46024)

* [ML] Support boolean fields for DF analytics (#46037)

This commit adds support for `boolean` fields in data frame
analytics (and currently both outlier detection and regression).
The analytics process expects `boolean` fields to be encoded as
integers with 0 or 1 value.

* Add a few notes on Cancellable to the LLRC and HLRC docs. (#45912)

Add a section to both the low level and high level client documentation on asynchronous usage and `Cancellable` added for #44802 

Co-Authored-By: Lee Hinman <dakrone@users.noreply.github.com>

* [DOCS] [8.0] Add upgrade matrix to docs (#46027)

* [DOCS] Add index alias exists API docs (#46042)

* Few clean ups in ESBlobStoreRepositoryIntegTestCase (#46068)

* Add XContentType as parameter to HLRC ART#createServerTestInstance (#46036)

Add XContentType as parameter to the
AbstractResponseTestCase#createServerTestInstance method.

In the case a server side response class serializes xcontent as
bytes then the test needs to know what xcontent type was randomily selected.

This change is needed in #45970

* Fix rollover alias in SLM history index template (#46001)

This commit adds the `rollover_alias` setting required for ILM to work
correctly to the SLM history index template and adds assertions to the
SLM integration tests to ensure that it works correctly.

* Handle no-op document level failures (#46083)

Today we assume that document failures can not occur for no-ops. This
assumption is bogus, as they can fail for a variety of reasons such as
the Lucene index having reached the document limit. Because of this
assumption, we were asserting that such a document-level failure would
never happen. When this bogus assertion is violated, we fail the node, a
catastrophe. Instead, we need to treat this as a fatal engine exception.

* Remove plugins dir reference from docs (#46047)

While the plugin installation directory used to be settable, it has not
been so for several major versions. This commit removes a lingering
reference to the plugins directory in upgrade docs.

closes #45889

* Fix rest-api-spec dep for external plugins (#45949)

This commit fixes the maven coordinates for the rest-api-spec jar. It
was accidentally by #45107.

closes #45891

* Use float instead of double for query vectors. (#46004)

Currently, when using script_score functions like cosineSimilarity, the query
vector is treated as an array of doubles. Since the stored document vectors use
floats, it seems like the least surprising behavior for the query vectors to
also be float arrays.

In addition to improving consistency, this change may help with some
optimizations we have been considering around vector dot product.

* Add Circle Processor (#43851)

add circle-processor that translates circles to polygons

* [ML] Throw an error when a datafeed needs CCS but it is not enabled for the node (#46044)

Though we allow CCS within datafeeds, users could prevent nodes from accessing remote clusters. This can cause mysterious errors and difficult to troubleshoot.

This commit adds a check to verify that `cluster.remote.connect` is enabled on the current node when a datafeed is configured with a remote index pattern.

* Muting org.elasticsearch.client.MachineLearningIT.testEstimateMemoryUsage (#46099)

* [DOCS] Adds search-related query parameters to the common parameters. (#46057)

@szabosteve Merging so I can make some additions. Will incorporate the comments from @jrodewig.

* Move netty numDirectArenas to jvm.options (#46104)

We currently configure io.netty.allocator.numDirectArenas to be 0 in the
jvm erconomics class. This is a config that we always want to set, so it
makes sense to move it to jvm.options.

* Handle delete document level failures (#46100)

Today we assume that document failures can not occur for deletes. This
assumption is bogus, as they can fail for a variety of reasons such as
the Lucene index having reached the document limit. Because of this
assumption, we were asserting that such a document-level failure would
never happen. When this bogus assertion is violated, we fail the node, a
catastrophe. Instead, we need to treat this as a fatal engine exception.

* [DOCS] Reformats delete by query API (#46051)

* Reformats delete by query API

* Update docs/reference/docs/delete-by-query.asciidoc

Co-Authored-By: James Rodewig <james.rodewig@elastic.co>

* Updated common parms includes.

* Flush engine after big merge (#46066)

Today we might carry on a big merge uncommitted and therefore
occupy a significant amount of diskspace for quite a long time
if for instance indexing load goes down and we are not quickly
reaching the translog size threshold. This change will cause a
flush if we hit a significant merge (512MB by default) which
frees diskspace sooner.

* Docs _cat/health verification fix (#46064)

The _cat/health call in getting-started assumes that the master task max
wait time is always 0 (-), however, the test could sometimes run into a
short wait time (like some ms). Fixed to allow this.

* Do not throw an exception if the process finished quickly but without any error. (#46073)

* [DOCS] Reformats URI search request (#45844)

* [DOCS] Reformats URI search request.

Co-Authored-By: James Rodewig <james.rodewig@elastic.co>

Co-Authored-By: debadair <debadair@elastic.co>

* DOC: Update SQL docs for DbVis and Workbench/J (#45981)

Refresh the setup for the new versions of DbVisualizer and SQL
Workbench/J which have Elasticsearch JDBC support out of the box.

* Upgrade to Azure SDK 8.4.0 (#46094)

* Upgrading to 8.4.0 here which brings bulk deletes to be used in a follow up PR

* Use better matchers in AbstractSimpleTransportTestCase (#45899)

Convert most of the assertions to use Hamcrest matchers, as they give much
more context if an assertion fails.

* Refactor auditor-related classes (#45893)

* Unmute the test now that the fix for the underlying cause is merged in. (#46117)

* Replace MockAmazonS3 usage in S3BlobStoreRepositoryTests by a HTTP server (#46081)

This commit removes the usage of MockAmazonS3 in S3BlobStoreRepositoryTests 
and replaces it by a HttpServer that emulates the S3 service. This allows the 
repository tests to use the real Amazon's S3 client under the hood in tests and will 
allow to test the behavior of the snapshot/restore feature for S3 repositories by 
simulating random server-side internal errors.

The HTTP server used to emulate the S3 service is intentionally simple and minimal 
to keep things understandable and maintainable. Testing full client options on the 
server side (like authentication, chunked encoding etc) remains the responsibility 
of the AmazonS3Fixture.

* Avoid overshooting watermarks during relocation (#46079)

Today the `DiskThresholdDecider` attempts to account for already-relocating
shards when deciding how to allocate or relocate a shard. Its goal is to stop
relocating shards onto a node before that node exceeds the low watermark, and
to stop relocating shards away from a node as soon as the node drops below the
high watermark.

The decider handles multiple data paths by only accounting for relocating
shards that affect the appropriate data path. However, this mechanism does not
correctly account for _new_ relocating shards, which are unwittingly ignored.
This means that we may evict far too many shards from a node above the high
watermark, and may relocate far too many shards onto a node causing it to blow
right past the low watermark and potentially other watermarks too.

There are in fact two distinct issues that this PR fixes. New incoming shards
have an unknown data path until the `ClusterInfoService` refreshes its
statistics. New outgoing shards have a known data path, but we fail to account
for the change of the corresponding `ShardRouting` from `STARTED` to
`RELOCATING`, meaning that we fail to find the correct data path and treat the
path as unknown here too.

This PR also reworks the `MockDiskUsagesIT` test to avoid using fake data paths
for all shards. With the changes here, the data paths are handled in tests as
they are in production, except that their sizes are fake.

Fixes #45177

* AwaitsFix for #46124

* Revert "Use better matchers in AbstractSimpleTransportTestCase (#45899)"

This reverts commit 38cf581d360bdf50b1b1f1b21607887d8c91cf36.

* Revert "AwaitsFix for #46124"

This reverts commit 71ead7552df1fbdfab2c0e72015496f53b29ab20.

* [DOCS] [PUT DFA] Documents inline the child params of source and dest (#45649)

* [DOCS] [PUT DFA] Documents inline the child params of source and dest.

* [DOCS] Fixes indentation issues and amends dfa definitions.

* Only verify global checkpoint if translog sync occurred (#45980)

We only sync translog if the given offset hasn't synced yet. We can't
verify the global checkpoint from the latest translog checkpoint unless
a sync has occurred.

Closes #46065
Relates #45634

* Start testing against AdoptOpenJDK (#45666)

This commit adds AdoptOpenJDK to the testing matrix.

* [DOCS] Reformats analyze API (#45986)

* [DOCS] Add get index alias API docs (#46046)

* Validate SLM policy ids strictly (#45998)

This uses strict validation for SLM policy ids, similar to what we use
for index names.

Resolves #45997

* More Efficient Ordering of Shard Upload Execution (#42791)

* Change the upload order of of snapshots to work file by file in parallel on the snapshot pool instead of merely shard-by-shard
* Inspired by #39657

* [DOCS] Correct custom analyzer callouts (#46030)

* Rename `data-science` plugin to `analytics` (#46092)

This renames the "data-science" plugin to "analytics".
Also removes the enabled flag

* [DOCS] Separate add index alias API docs (#46086)

* [DOCS] Reformat update index aliases API docs (#46093)

* [ML] Regression dependent variable must be numeric (#46072)

* [ML] Regression dependent variable must be numeric

This adds a validation that the dependent variable of a regression
analysis must be numeric.

* Address review comments and fix some problems

In addition to addressing the review comments, this
commit fixes a few issues I found during testing.

In particular:

- if there were mappings for required fields but they were
not included we were not reporting the error
- if explicitly included fields had unsupported types we were
not reporting the error

Unfortunately, I couldn't get those fixed without refactoring
the code in `ExtractedFieldsDetector`.

* Ensure top docs optimization is fully disabled for queries with unbounded max scores. (#46105)

When a query contains a mandatory clause that doesn't track the max score per
block, we disable the max score optimization. Previously, we were doing this by
wrapping the collector with a FilterCollector that always returned
ScoreMode.COMPLETE.

However we weren't adjusting totalHitsThreshold, so the collector could still
call Scorer#setMinCompetitiveScore. It is against the method contract to call
setMinCompetitiveScore when the score mode is COMPLETE, and some scorers like
ReqOptSumScorer throw an error in this case.

This commit tries to disable the optimization by always setting
totalHitsThreshold to max int, as opposed to wrapping the collector.

* [DOCS] Add "index template exists" API docs (#46095)

* [DOCS] Add "delete index template" API docs (#46101)

* Remove classic similarity (#46078)

This commit removes the `classic` similarity from code and docs in master (8.0). The `classic` similarity cannot be used on indices created after 7.0.

Closes #46058

* Add package docs for bundled jdk location (#46153)

This commit expands the documented directory layout of the rpm and deb
packages to include the bundled jdk.

closes #45150

* bump version (#46158)

* Set netty system properties in BuildPlugin (#45881)

Currently in production instances of Elasticsearch we set a couple of
system properties by default. We currently do not apply all of these
system properties in tests. This commit applies these properties in the
tests.

* Remove insecure settings (#46147)

This commit removes the oxymoron of insecure secure settings from the
code base. In particular, we remove the ability to set the access_key
and secret_key for S3 repositories inside the repository definition (in
the cluster state). Instead, these settings now must be in the
keystore. Thus, it also removes some leniency where these settings could
be placed in the elasticsearch.yml, would not be rejected there, but
would not be consumed for any purpose.

* Inject random errors in S3BlobStoreRepositoryTests (#46125)

This commit modifies the HTTP server used in S3BlobStoreRepositoryTests 
so that it randomly returns server errors for any type of request executed by
 the SDK client. It is now possible to verify that the repository tests are s
uccessfully completed even if one or more errors were returned by the S3 
service in response of a blob upload, a blob deletion or a object listing request 
etc.

Because injecting errors forces the SDK client to retry requests, the test limits
 the maximum errors to send in response for each request at 3 retries.

* Forbid settings without a namespace (#45947)

This commit forbids settings that are not in any namespace, all setting
names must now contain a dot.

* Enhanced logging when transport is misconfigured to talk to HTTP port (#45964)

If a node is misconfigured to talk to remote node HTTP port (instead of
transport port) eventually it will receive an HTTP response from the
remote node on transport port (this happens when a node sends
accidentally line terminating byte in a transport request).
If this happens today it results in a non-friendly log message and a
long stack trace.
This commit adds a check if a malformed response is HTTP response. In
this case, a concise log message would appear.

* Fix wrong URL encoding in watcher HTTP client (#45894)

The test assumption was calling the wrong method resulting in a URL
encoding before returning the data.

Closes #44970

* Fix translog stats in testPrepareIndexForPeerRecovery (#46137)

When recovering a shard locally, we use a translog snapshot from
newSnapshotFromGen which consists of all readers from a certain
generation. In the test, we use newSnapshotFromMinSeqNo for the
expectation. The snapshot of this method includes only readers
containing operations in the requesting range.

Closes #46022

* Make Snapshot Logic Write Metadata after Segments (#45689)

* Write metadata during snapshot finalization after segment files to prevent outdated metadata in case of dynamic mapping updates as explained in #41581
* Keep the old behavior of writing the metadata beforehand in the case of mixed version clusters for BwC reasons
   * Still overwrite the metadata in the end, so even a mixed version cluster is fixed by this change if a newer version master does the finalization
* Fixes #41581

* [TEST] Mute PinnedQueryBuilderIT.testPinnedPromotions (#46175)

Relates #46174

* Move plugin.mandatory to installing plugins docs

This commit moves the plugin.mandatory settings from the plugin
directory page in the docs to the installing plugins page in the docs.

* Move plugin.mandatory to its own page

This commit takes the reworking of plugin.mandatory docs even farther by
taking this setting to its own page.

* Add test tasks for unpooled and direct buffer pooling to netty (#46049)

Some netty behavior is controlled by system properties. While we want to
test with the defaults for Elasticsearch for most tests, within netty we
want to ensure these netty settings exhibit correct behavior. This
commit adds variants of test and integTest tasks for netty which set the
unpooled and direct buffer pooled allocators.

relates #45881

* Stabilize SLM REST Tests (#46195)

Unfortunately, #42791 destabilized SLM tests because those tests use
rate limiting the snapshot write rate to a very low value globally.
Now that the various files in a snapshot get uploaded in parallel
this can lead to a few threads in parallel way overshooting the low
value throughput value used by the rate limiter and then making it
wait for minutes which times out the tests that then try to abort
the snapshot (see #21759 for details, aborting a snapshot only
happens when writing bytes to the repository).

For now the old behavior of the test from before my changes can
be restored by moving to a single threaded snapshot pool but
we should find a better way of testing the SLM behaviour here in
a follow-up.

* Clarify default behavior of auto_create_index (#46134)

Be specific about the default behaviour of `action.auto_create_index` when a list is given.

* Mute SnapshotLifeCycleIT (#46207)

Relates #46205

* Remove Unused Method from BlobStoreRepository (#46204)

This method isn't used anymore and I forgot to delete it.

* Allow ingest processors access to node client. (#46077)

This is the first PR that merges changes made to server module from
the enrich branch (see #32789) into the master branch.

The plan is to merge changes made to the server module separately from
the pr that will merge enrich into master, so that these changes can
be reviewed in isolation.

* SQL: Fix issue with DataType for CASE with NULL (#46173)

Previously, if the DataType of all the WHEN conditions of a CASE
statement is NULL, then it was set to NULL even if the ELSE clause
has a non-NULL data type, e.g.:
```
CASE WHEN a = 1 THEN NULL
           WHEN a = 5 THEN NULL
ELSE 'foo'
```

Fixes: #46032

* Mute 2 tests in S3BlobStoreRepositoryTests (#46221)

Muted testSnapshotAndRestore and testMultipleSnapshotAndRollback

Relates #46218 and #46219

* Cleanup BlobStoreRepository Abort and Failure Handling (#46208)

Aborts and failures were handled in a somewhat unfortunate way in #42791:
Since the tasks for all files are generated before uploading they are all executed when a snapshot is aborted and lead to a massive number of failures added to the original aborted exception.
In the case of failures the situation was not very reasonable as well. If one blob fails uploading the snapshot logic would upload all the remaining files as well and then fail (when previously it would just fail all following files).
I fixed both of the above issues, by just short-circuiting all remaining tasks for a shard in case of an exception in any one upload.

* Test fix for PinnedQueryBuilderIT (#46187)

Fix test issue to stabilise scoring through use of DFS search mode.
Randomised index-then-delete docs introduced by the test framework likely caused an imbalance in IDF scores across shards. Also made number of shards used in test a random number for added test coverage.

Closes #46174

* Wait for all Rec. to Stop on Node Close (#46178)

* Wait for all Rec. to Stop on Node Close

* This issue is in the `RecoverySourceHandler#acquireStore`. If we submit the store release to the generic threadpool while it is getting shut down we never complete the futue we wait on (in the generic pool as well) and fail to ever release the store potentially.
* Fixed by waiting for all recoveries to end on node close so that we aways have a healthy thread pool here
* Closes #45956

*  Disable request throttling in S3BlobStoreRepositoryTests (#46226)

When some high values are randomly picked up - for example the number 
of indices to snapshot or the number of snapshots to create - the tests in S3BlobStoreRepositoryTests can generate a high number of requests to 
the internal S3 server.

In order to test the retry logic of the S3 client, the internal server is 
designed to randomly generate random server errors. When many
 requests are made, it is possible that the S3 client reaches its maximum 
number of successive retries capacity. Then the S3 client will stop 
retrying requests until enough retry attempts succeed, but it means 
that any request could fail before reaching the max retries count and 
make the test fail too.

Closes #46217
Closes #46218
Closes #46219

* Sync translog without lock when trim unreferenced readers (#46203)

With this change, we can avoid blocking writing threads when trimming
unreferenced readers; hence improving the translog writing performance
in async durability mode.

Close #46201

* Add debug assertions for userhome not existing (#46206)

The elasticsearch user should not have a homedir, yet we have seen this
particular test fail rather frequently with a failed check that the
userhome does not exist. This commit adds some additional assertions on
the presumptive userhome to narrow down where it might be created.

relates #45903

* Remove duplicate line in SearchAfterBuilder (#45994)

* reset queryGeometry in ShapeQueryTests (#45974)

* [ML-DataFrame] Fix off-by-one error in checkpoint operations_behind (#46235)

Fixes a problem where operations_behind would be one less than
expected per shard in a new index matched by the data frame
transform source pattern.

For example, if a data frame transform had a source of foo*
and a new index foo-new was created with 2 shards and 7 documents
indexed in it then operations_behind would be 5 prior to this
change.

The problem was that an empty index has a global checkpoint
number of -1 and the sequence number of the first document that
is indexed into an index is 0, not 1.  This doesn't matter for
indices included in both the last and next checkpoints, as the
off-by-one errors cancelled, but for a new index it affected
the observed result.

* Fixed synchronizing REST API inflight breaker names with internal variable (#40878)

The internal configuration settings were like that: network.breaker.inflight_requests
But the exposed REST API had the value names with underscore like that: network.breaker.in_flight_requests
This was now corrected to without underscores like that: network.breaker.inflight_requests

* [DOCS] Add delete index alias API docs (#46080)

* [ML][Transforms] fixing stop on changes check bug (#46162)

* [ML][Transforms] fixing stop on changes check bug

* Adding new method finishAndCheckState to cover race conditions in early terminations

* changing stopping conditions in `onStart`

* allow indexer to finish when exiting early

* Fix testSyncFailsIfOperationIsInFlight (#46269)

testSyncFailsIfOperationIsInFlight could fail due to the index request
spawing a GCP sync (new since 7.4). Test now waits for it to finish
before testing that flushed sync fails.

* [ML] Unmute testStopOutlierDetectionWithEnoughDocumentsToScroll (#46271)

The test seems to have been failing due to a race condition between
stopping the task and refreshing the destination index. In particular,
we were going forward with refreshing the destination index even
though the task stopped in the meantime. This was fixed in
request.

Closes #43960

* [ML][Transforms] protecting doSaveState with optimistic concurrency (#46156)

* [ML][Transforms] protecting doSaveState with optimistic concurrency

* task code cleanup

* Suppress warning from background sync on relocated primary (#46247)

If a primary as being relocated, then the global checkpoint and
retention lease background sync can emit unnecessary warning logs.
This side effect was introduced in #42241.

Relates #40800
Relates #42241

* Add CumulativeCard pipeline agg to pipeline index (#46279)

The Cumulative Cardinality docs weren't linked
from the pipeline index page

* Add more assertions and cleanup to setup passwords tests (#46289)

This commit is a followup to #46206 to continue debugging failures in an
elasticsearch homedir being created. A couple more assertions are added
as well as a final cleanup at the end of the previous test to the one
that fails.

* Multi-get requests should wait for search active (#46283)

When a shard has fallen search idle, and a non-realtime multi-get
request is executed, today such requests do not wait for the shard to
become search active and therefore such requests do not wait for a
refresh to see the latest changes to the index. This also prevents such
requests from triggering the shard as non-search idle, influencing the
behavior of scheduled refreshes. This commit addresses this by attaching
a listener to the shard search active state for multi-get requests. In
this way, when the next scheduled refresh is executed, the multi-get
request will then proceed.

* [ML][Transforms] fixing listener being called twice (#46284)

* Mute testRecoveryFromFailureOnTrimming

Tracked at #46267

* Move MockRespository into test framework (#46298)

This moves the `MockRespository` class into `test/framework/src/main` so
it can be used across all modules and plugins in tests.

* First round of optimizations for vector functions. (#46294)

This PR merges the `vectors-optimize-brute-force` feature branch, which makes
the following changes to how vector functions are computed:
* Precompute the L2 norm of each vector at indexing time. (#45390)
* Switch to ByteBuffer for vector encoding. (#45936)
* Decode vectors and while computing the vector function. (#46103) 
* Use an array instead of a List for the query vector. (#46155)
* Precompute the normalized query vector when using cosine similarity. (#46190)

Co-authored-by: Mayya Sharipova <mayya.sharipova@elastic.co>

* Initialize document subset bit set cache used for DLS (#46211)

This commit initializes DocumentSubsetBitsetCache even if DLS
is disabled. Previously it would throw null pointer when querying
usage stats if we explicitly disabled DLS as there would be no instance of DocumentSubsetBitsetCache to query. It is okay to initialize
DocumentSubsetBitsetCache which will be empty as the license enforcement
would prevent usage of DLS feature and it will not fail when accessing usage stats.

Closes #45147

* [ML-DataFrame] unmute tests for debuging purposes (#46121)

unmute testGetCheckpointStats

closes #45238

* SQL: Fix issue with IIF function when condition folds (#46290)

Previously, when the condition (1st argument) of the IIF function could
be evaluated (folded) to false, the `IfConditional` was eliminated which
caused `IndexOutOfBoundsException` to be thrown when `info()` and
`resolveType()` methods where called.

Fixes: #46268

* [DOCS] Reformats multi search API (#46256)

* [DOCS] Reformats multi search API.

Co-Authored-By: James Rodewig <james.rodewig@elastic.co>

* Remove stack trace logging in Security(Transport|Http)ExceptionHandler (#45966)

As per #45852 comment we no longer need to log stack-traces in
SecurityTransportExceptionHandler and SecurityHttpExceptionHandler even
if trace logging is enabled.

* [DOCS] Reformats request body search API (#46254)

* [DOCS] Reformats request body search API.
Co-Authored-By: James Rodewig <james.rodewig@elastic.co>

* Reenable+Fix testMasterShutdownDuringFailedSnapshot (#46303)

Reenable this test since it was fixed by #45689 in production
code (specifically, the fact that we write the `snap-` blobs
without overwrite checks now).
Only required adding the assumed blocking on index file writes
to test code to properly work again.

* Closes #25281

* DOCS Link to kib reference from es reference on PKI authn (#46260)

* Quote the task name in reproduction line printer (#46266)

Some tasks have `#` for instance that doesn't play well with some shells
( e.x. zsh )

* [DOCS] Reformats search shards API (#46240)

* [DOCS] Reformats search shards API
Co-Authored-By: James Rodewig <james.rodewig@elastic.co>

* Fix SearchService.createContext exception handling (#46258)

An exception from the DefaultSearchContext constructor could leak a
searcher, causing future issues like shard lock obtained exceptions. The
underlying cause of the exception in the constructor has been fixed, but
as a safety precaution we also fix the exception handling in
createContext.

Closes #45378

* Bwc testclusters all (#46265)

Convert all bwc projects to testclusters

* Adjacency_matrix aggregation optimisation. (#46257)

Avoid pre-allocating ((N * N) - N) / 2 “BitsIntersector” objects given N filters.
Most adjacency matrices will be sparse and we typically don’t need to allocate all of these objects - can save a lot of allocations when the number of filters is high.

Closes #46212

* [DOCS] Reformats search template and multi search template APIs (#46236)

* [DOCS] Reformats search template and multi search template APIs.
Co-Authored-By: James Rodewig <james.rodewig@elastic.co>

* Improve documentation for X-Opaque-ID (#46167)

this field can be present in search slow logs and deprecation logs. The
docs describes how to enable this functionality and what expect in logs.
closes #44851

* [DOCS] Add "get index template" API docs (#46296)

* Do not send recovery requests with CancellableThreads (#46287)

Previously, we send recovery requests using CancellableThreads because
we send requests and wait for responses in a blocking manner. With async
recovery, we no longer need to do so. Moreover, if we fail to submit a
request, then we can release the Store using an interruptible thread
which can risk invalidating the node lock.

This PR is the first step to avoid forking when releasing the Store.

Relates #45409
Relates #46178

* Build: Enable testing without magic comments (#46180)

Previously we only turned on tests if we saw either `// CONSOLE` or
`// TEST`. These magic comments are difficult for the docs build to deal
with so it has moved away from using them where possible. We should
catch up. This adds another trigger to enable testing: marking a snippet
with the `console` language. It looks like this:

```
[source,console]
----
GET /
----
```

This saves a line which is nice, I guess. But it is more important to me
that this is consistent with the way the docs build works now.

Similarly this enables response testing when you mark a snippet with the
language `console-result`. That looks like:
```
[source,console-result]
----
{
  "result": "0.1"
}
----
```

`// TESTRESPONSE` is still available for situations like `// TEST`: when
the response isn't *in* the console-result language (like `_cat`) or
when you want to perform substitutions on the generated test.

Should unblock #46159.

* Docs for translog, history retention and flushing (#46245)

This commit updates the docs about translog retention and flushing to reflect
recent changes in how peer recoveries work. It also adds some docs to describe
how history is retained for replay using soft deletes and shard history
retention leases.

Relates #45473

* [DOCS] Reformat "put index template" API docs (#46297)

* Add test that get triggers shard search active (#46317)

This commit is a follow-up to a change that fixed that multi-get was not
triggering a shard to become search active. In that change, we added a
test that multi-get properly triggers a shard to become search
active. This commit is a follow-up to that change which adds a test for
the get case. While get is already handled correctly in production code,
there was not a test for it. This commit adds one. Additionally, we
factor all the search idle tests from IndexShardIT into a separate test
class, as an effort to keep related tests together instead of a single
large test class containing a jumble of tests, and also to keep test
classes smaller for better parallelization.

* Document support of OIDC Implicit flow in Kibana. (#45693)

* [DOCS] Replace "// CONSOLE" comments with [source,console] (#46159)

* [DOCS] Identify reloadable EC2 Discovery Plugin settings (#46102)

* [ML] testFullClusterRestart waiting for stable cluster (#46280)

* [ML] waiting for ml indices before waiting task assignment testFullClusterRestart

* waiting for a stable cluster after fullrestart

* removing unused imports

* [ML][Transforms] fixing rolling upgrade continuous transform test (#45823)

* [ML][Transforms] fixing rolling upgrade continuous transform test

* adjusting wait assert logic

* adjusting wait conditions

* muting test (#46343)

* Decouple shard allocation awareness from search and get requests (#45735)

With this commit, Elasticsearch will no longer prefer using shards in the same location
(with the same awareness attribute values) to process `_search` and `_get` requests.
Instead, adaptive replica selection (the default since 7.0) should route requests more efficiently
using the service time of prior inter-node communications. Clusters with big latencies between
nodes should switch to cross cluster replication to isolate nodes within the same zone.
Note that this change only targets 8.0 since it is considered as breaking. However a follow up
pr should add an option to activate this behavior in 7.x in order to allow users to opt-in early.

Closes #43453

* Revert "Sync translog without lock when trim unreferenced readers (#46203)"

Unfortunately, with this change, we won't clean up all unreferenced
generations when reopening. We assume that there's at most one
unreferenced generation when reopening translog. The previous
implementation guarantees this assumption by syncing translog every time
after we remove a translog reader. This change, however, only syncs
translog once after we have removed all unreferenced readers (can be
more than one) and breaks the assumption.

Closes #46267

This reverts commit fd8183ee51d7cf08d9def58a2ae027714beb60de.

* [DOCS] Identify reloadable S3 repository plugin settings (#46349)

* Unmute testRecoveryFromFailureOnTrimming

Tracked at #46267

* [DOCS] Identify reloadable GCS repository plugin settings (#46352)

* [DOCS] Synchs Watcher API titles with better HLRC titles (#46328)

* Add repository integration tests for Azure (#46263)

Similarly to what had been done for S3 (#46081) and GCS (#46255) 
this commit adds repository integration tests for Azure, based on an 
internal HTTP server instead of mocks.

* Replace mocked client in GCSBlobStoreRepositoryTests by HTTP server (#46255)

This commit removes the usage of MockGoogleCloudStoragePlugin in 
GoogleCloudStorageBlobStoreRepositoryTests and replaces it by a 
HttpServer that emulates the Storage service. This allows the repository 
tests to use the real Google's client under the hood in tests and will allow 
us to test the behavior of the snapshot/restore feature for GCS repositories 
by simulating random server-side internal errors.

The HTTP server used to emulate the Storage service is intentionally simple 
and minimal to keep things understandable and maintainable. Testing full 
client options on the server side (like authentication, chunked encoding 
etc) remains the responsibility of the GoogleCloudStorageFixture.

* Mute failing SamlAuthenticationIT tests (#46369)

see #44410

* Enable Debug Logging for Master and Coordination Packages (#46363)

In order to track down #46091:
* Enables debug logging in REST tests for `master` and `coordination` packages
since we suspect that issues are caused by failed and then retried publications

* Quiet down shard lock failures (#46368)

These were actually never intended to be logged at the warning level but made visible by a refactoring in #19991, which introduced a new exception type but forgot to adapt some of the consumers of the exception.

* [ML][Transforms] allow executor to call start on started task (#46347)

* [DOCS] Reformat index segments API docs (#46345)

* [DOCS] Re-add versioning to put template docs (#46384)

Adds documentation for index template versioning
accidentally removed with #46297.

* [ML][Transforms] update supported aggs docs (#46388)

* Support geotile_grid aggregation in composite agg sources (#45810)

Adds support for `geotile_grid` as a source in composite aggs. 

Part of this change includes adding a new docFormat of `GEOTILE` that formats a hashed `long` value into a geotile formatting string `zoom/x/y`.

* Refactor AllocatedPersistentTask#init(), move rollup logic out of ctor (#46288)

This makes the AllocatedPersistentTask#init() method protected so that
implementing classes can perform their initialization logic there,
instead of the constructor.  Rollup's task is adjusted to use this
init method.

It also slightly refactors the methods to se a static logger in the 
AllocatedTask instead of passing it in via an argument.  This is 
simpler, logged messages come from the task instead of the 
service, and is easier for tests

* [DOCS] Update snippets in security APIs (#46191)

* [DOCS] Identify reloadable Azure repository plugin settings (#46358)

* [DOCS] Reformats Watcher APIs using template (#46152)

* Add docs on upgrading the keystore (#46331)

This commit adds a note to the docs regarding upgrading the keystore.

* [ML] Fixing instance serialization version for bwc (#46403)

* [DOCS] Reformat index stats API docs (#46322)

* Adjusting bwc serialization after backport (#46400)

* Clarify error message on keystore write permissions (#46321)

When the Elasticsearch process does not have write permissions to
upgrade the Elasticsearch keystore, we bail with an error message that
indicates there is a filesystem permissions problem. This commit
clarifies that error message by pointing out the directory where write
permissions are required, or that the user can also run the
elasticsearch-keystore upgrade command manually before starting the
Elasticsearch process. In this case, the upgrade would not be needed at
runtime, so the permissions would not be needed then.

* Revert "Refactor AllocatedPersistentTask#init(), move rollup logic out of ctor (#46288)"

This reverts commit d999942c6dfd931266d01db24d3fb26b29cf8f64.

* reuse mock client to avoid probles with thread context closed errors (#46398)

* [DOCS] Replace "// TESTRESPONSE" magic comments with "[source,console-result] (#46295)

* [ML-DataFrame] improve error message for timeout case in stop (#46131)

improve error message if stopping of transform times out.

related #45610

* Fix usage of randomIntBetween() in testWriteBlobWithRetries (#46380)

This commit fixes the usage of randomIntBetween() in the test 
testWriteBlobWithRetries, when the test generates a random array  
of a single byte.

* cleanup static member

* Resolve the incorrect scroll_current when delete or close index (#45226)

Resolve the incorrect current scroll for deleted or closed index

* [ML] Extract DataFrameAnalyticsTask into its own class (#46402)

This refactors `DataFrameAnalyticsTask` into its own class.
The task has quite a lot of functionality now and I believe it would
make code more readable to have it live as its own class rather than
an inner class of the start action class.

* Mute CcrRollingUpgradeIT.testUniDirectionalIndexFollowing and testUniDirectionalIndexFollowing (#46429)

Relates #46416

* Mute SSLClientAuthTests.testThatHttpFailsWithoutSslClientAuth()

Tracked in #46230

* Add yet more logging around index creation (#46431)

Further investigation into #46091, expanding on #46363, to add even more
detailed logging around the retry behaviour during index creation.

* [Transform] simplify class structure of indexer (#46306)

simplify transform task and indexer

 - remove redundant transform id
 - moving client data frame indexer (and builder) into a separate file

* [ML] Tolerate total_search_time_ms not mapped in get datafeed stats (#46432)

ML users who upgrade from versions prior to 7.4 to 7.4 or later
will have ML results indices that do not have mappings for the
total_search_time_ms field.  Therefore, when searching these
indices we must tolerate this field not having a mapping.

Fixes #46437

* [DOCS] Adds progress parameter description to the GET stats data frame analytics API doc. (#46434)

* [DOCS] Resort common-parms (#46419)

* [DOCS] Change // CONSOLE comments to [source,console] (#46441)

* [DOCS] Add index alias definition to glossary (#46339)

* [Docs] Fix typo in field-names-field.asciidoc (#46430)

* [DOCS] [5 of 5] Change // TESTRESPONSE comments to [source,console-results] (#46449)

* [DOCS] Correct definition for `allow_no_indices` parameter (#46450)

* Increase REST-Test Client Timeout to 60s (#46455)

We are seeing requests take more than the default 30s
which leads to requests being retried and returning
unexpected failures like e.g. "index already exists"
because the initial requests that timed out, worked
out functionally anyway.
=> double the timeout to reduce the likelihood of
the failures described in #46091
=> As suggested in the issue, we should in a follow-up
turn off retrying all-together probably

* [DOCS] Remove cat request from Index Segments API requests (#46463)

* SQL: fix scripting for grouped by datetime functions (#46421)

* Fix issue with painless scripting not being correctly generated when
datetime functions are used for GROUPing of an INTERVAL operation.

* Use `null` schema response for `SYS TABLES` command. (#46386)

* Ignore replication for noop updates (#46458)

Previously, we ignore replication for noop updates because they do not
have sequence numbers. Since #44603, we started assigning sequence
numbers to noop updates leading them to be replicated to replicas.

This bug occurs only on 8.0 for it requires #41065 and #44603.

Closes #46366

* Strengthen testUpdate in rolling upgrade

We hit a bug where we can't partially update documents created in a
mixed cluster between 5.x and 6.x. Although this bug does not affect
7.0 or later, we should have a good test that catches this issue.

Relates #46198
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants