Skip to content

Conversation

@piotrsulkowski-elastic
Copy link
Contributor

in case when there is a mismatch between the expected request ID (that comes from a cookie) and the actual in-response-to field in the SAML content, return more specific error to allow for better error handling at the client

closes #128179

@elasticsearchmachine elasticsearchmachine added v9.3.0 needs:triage Requires assignment of a team area label labels Nov 5, 2025
@piotrsulkowski-elastic piotrsulkowski-elastic added >enhancement :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) labels Nov 10, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @piotrsulkowski-elastic, I've created a changelog YAML for you.

@elasticsearchmachine elasticsearchmachine added Team:Security Meta label for security team and removed needs:triage Requires assignment of a team area label labels Nov 10, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

@piotrsulkowski-elastic piotrsulkowski-elastic added needs:triage Requires assignment of a team area label and removed needs:triage Requires assignment of a team area label labels Nov 10, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances error handling for SAML authentication by introducing a more specific exception when the in-response-to field in a SAML response doesn't match expected request IDs. This improves client-side error handling capabilities when mismatches occur, such as when a user spends too long at the IdP site causing the original request to expire.

Key changes:

  • Added a specialized exception method for unsolicited SAML responses with metadata containing the mismatched in-response-to value
  • Updated error handling to use the new exception type instead of generic SAML exceptions
  • Added test coverage for the unsolicited response scenario

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
SamlResponseHandler.java Replaced generic SAML exception with specialized unsolicited response exception
SamlUtils.java Added new samlUnsolicitedInResponseToException method with metadata support
SamlAuthenticatorTests.java Added test case for unsolicited SAML responses and refactored token helper methods
137598.yaml Added changelog entry documenting the enhancement

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

samlContentInResponseTo,
expectedInResponseTos
);
exception.addMetadata(SAML_UNSOLICITED_RESPONSE_KEY, samlContentInResponseTo);
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata is being added as a single string value, but the test expects a list. The call should be exception.addMetadata(SAML_UNSOLICITED_RESPONSE_KEY, Collections.singletonList(samlContentInResponseTo)); to match the test expectation at line 1429 in SamlAuthenticatorTests.java.

Suggested change
exception.addMetadata(SAML_UNSOLICITED_RESPONSE_KEY, samlContentInResponseTo);
exception.addMetadata(SAML_UNSOLICITED_RESPONSE_KEY, java.util.Collections.singletonList(samlContentInResponseTo));

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the interface is
public void addMetadata(String key, String... values)
which apparently misguided copilot

Copy link
Contributor

@jfreden jfreden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job with this PR! I have a couple of comments but the overall flow LGTM!

I think you're missing a changelog since this is an >enhancement, maybe the autogenerated one was overwritten during a force-push?

It would be good with a more descriptive title too: "Improve SAML Error Handling by Including additional Metadata" or something like that?

}

/**
* Constructs exception for a specific case where the in-response-to value in the SAML content does not match any of the values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment states that the initial SAML request has expired but it's actually the cookie that holds the initial request id and the requested URL that has expired, the SAML request is still valid if the missing values are provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's what I tried to say. Thanks for feedback, I tried clarifying the message now to sound more explicit

* matching the initial request, however that initial request is now gone, so the client sends an empty in-response-to parameter causing
* a mismatch between the two.
*/
public static ElasticsearchSecurityException samlUnsolicitedInResponseToException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this could be package-private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed. I initially matched the neighbouring methods, but package-private makes more sense

samlContentInResponseTo,
expectedInResponseTos
);
exception.addMetadata(SAML_UNSOLICITED_RESPONSE_KEY, samlContentInResponseTo);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of the flow to fix this is:

If Kibana detects a SAML login failure for a login that looks like an IdP-initiated login (i.e., the intermediate cookie isn't present), it might retry a POST request, but from its own "SAML retry" page that's hosted on the same domain as Kibana and hence isn't subject to the same-site cookies limitations.

What is meant by "looks like an IdP-initiated login", is that the error message is "SAML content is in-response-to [{}] but expected one of {} ". With your new code they can get the metadata field and see if they indeed intended to send a SAMLResponse for a SP-initiated login from Kibana by matching the id to what was expected by ES and retry.

Is the only reason for this change to simplify getting the IDs? The data is already available in the error message per my understanding, but it's hacky to get it them from there so I think this makes more sense, I just want to make sure I understand it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I didn't think about the retry, but it makes sense! Where did you find that comment, btw? My guess was that they'll ask user to try logging in again, but be able to explain what went wrong to motivate the user to actually retry, and to provide better ux.

Anyway, if the client flow depends on the response type, then IMO it should be based on explicit metadata, rather than guessing from textual message, which can change at any point in time - hence I didn't even ask about the details of what exactly the new flow is going to be, just know that it's going to be differentiated. I'm happy to double check that though if you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting the response id is one thing, but I think that the main reasons is actually knowing what caused the error without the need to parse the message, i.e. that it was a mismatch.

@piotrsulkowski-elastic piotrsulkowski-elastic changed the title in-response-to in saml error response Improve SAML error handling by adding metadata Nov 14, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @piotrsulkowski-elastic, I've created a changelog YAML for you.

@piotrsulkowski-elastic
Copy link
Contributor Author

I was making end-to-end testing and discovered that it wasn't working the way I expected due to 2 factors:

  • there was a null pointer exception thrown when parsing mismatched in-response-to headers - that got an easy fix (but took me a while to find it, because NPE was caught in one of the listeners and transformed into generic unauthorized exception, which looked like intentional flow)

  • it wasn't working still, because saml exceptions are gathered in a loop, iterating over all saml realms. Once all exceptions are collected, they are written to log, but the final exception is generic with all metadata excluded. I propose here a solution where the saml exception can be either terminating, or not. The terminating exceptions break the iteration over realms, and are propagated straight away to the transport and then rest responses with all metadata. The non-terminating ones behave just as they used to be before. That gives us more control and allows to capture both use cases, but unfortunately required some refactoring of code, which I appreciate may be discussible - please let me know your opinions on this approach.

Lastly, I added an integration test to check this behaviour (I found it appropriate given two non-trivial already existing problems).

@jfreden
Copy link
Contributor

jfreden commented Nov 24, 2025

This changes the existing behavior and I think it's a backwards compatibility issue. For example consider this scenario:

  1. saml_realm_1 configured with IdP-A
  2. saml_realm_2 configured with IdP-B
  3. User authenticates with IdP-B, getting in-response-to="xyz123"
  4. Token comes in with realm=null and allowedRequestIds=["xyz123"]
  5. saml_realm_1 tries first, "xyz123" is not in its expected IDs so throws a terminating exception
  6. Realm iteration stops, saml_realm_2 never gets tried
  7. Authentication fails even though saml_realm_2 would have succeeded

Is it possible to instead:

  1. Change the unsuccessful auth result method to accept metadata and populate es.security.saml.
  2. At the end of authentication when all errors are iterated in consumeNullUser introduce an allowlist of metadata keys to keep then populate the metadata with that data by realm id.

@piotrsulkowski-elastic
Copy link
Contributor Author

Thanks!

Correct me if I'm wrong, but I think that when the step 5 comes ('saml_realm_1 tries first, "xyz123" is not in its expected IDs so throws a terminating exception') then it would behave differently. It would look into token's in-response-to, which would be "xyz123" because it's encoded in the token, and it doesn't change while iterating over different realms. It would then look into allowedRequestIds, and again it would see "xyz123", because likewise, this is not changing when iterating over realms. So the in-response-to check would pass. Authentication would fail on matching realm, certificates, etc. (most likely even before it reaches the in-response-to validation), but that particular validation would pass when checked. It should only throw an exception when indeed the request is inherently inconsistent with itself, in which case it will fail for any realm - basically the token is broken. Similar situation would happen e.g. when the token cannot be unmarshalled, or it was missing content, etc. As opposed to whole other half of problems which are realm-specific, like e.g. invalid certificates, roles, metadata, assertions, etc. That was the reason why I concluded in the first place that this is not a breaking change. The only difference as perceived from the user should be the error message and metadata, in any scenario. At least that's my current understanding, but obviously I'm might have missed something.

I like your suggestion of returning errors when they are accumulated from all realms. I think though, that in this particular case of a terminating exception (which so far I still believe this should be), it's more elegant and convenient to the client to return a single error rather than a list of different (or sometimes identical) errors, because on the ES side we know exactly which error stopped us from moving forwards, while the user would have to do a bit of guessing, when given a list.

I really like the idea of whitelisting metadata which can be returned and I'm adding that right now. Let me know what you think about the previous points!

Copy link
Contributor

@jfreden jfreden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, you're right, it's all supplied in the request so the concern of this validation causing early unexpected terminations isn't valid. Sorry for the confusion!

I liked Tim's suggestion of:

Can't we just say that the enhanced error message is only returned if a specific realm is targeted by the request?

So we only terminate the auth with the enhanced error message if a realm is supplied? That way we don't risk breaking any bwc, even though that seems highly unlikely. Maybe if the customer has a custom realm?

Code looks good, left some suggestions.

@@ -0,0 +1,213 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a cleaner separation of concerns would be nice here. Your new SamlInResponseToIT shouldn't have to call makeMetadataAvailable. Can all the unstable metadata code be moved back to that test and then have the base class make it "always available" as the default behaviour?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved those around, so that now the subclasses by default have the metadata available. The tricky part was that during cluster init, it already communicates with the idp trying to cache the metadata, so to work around that there is an extra step in the test rule.

I've also tried to move as many arbitrary strings to base class and it's methods (e.g. getRealmName), which was possible in most cases, but there is still some grey area around e.g. idp, and acs urls, as well as available realm numbers, which might or might not be in the base class too. Moving it in, doesn't help much with readibility, so for now I left those as they were.

import static org.hamcrest.Matchers.hasKey;
import static org.hamcrest.Matchers.is;

public class SamlInResponseToIT extends SamlRestTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job with this test!

assertThat(response, hasKey("access_token"));
}

public void testInResponseToMismatch() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is a great test but it would be easier to read if the different scenarios were split into their own methods. I also wonder if you could use some more randomization to cover more scenarios?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, moved!

Reg. randomization, I added random requestIds instead of arbitrary constant values. Do you have more ideas, e.g. some methods could be called, but not all, or something like that? They don't require setup each time, but each of them makes rest calls, they take around 200-300 ms (each test) on average on my machine.

if (isSuccess(status) == false) {
throw samlException("SAML Response is not a 'success' response: {}", getStatusCodeMessage(status));
// throws a terminating exception, as status must always be success
throw new SamlAuthenticationException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this terminating because it can never be successful on any realm? I wonder if we should really change the behaviour for other paths than the in-response-to? WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is no longer relevant, I changed the behaviour as described in the thread (I'll repeat the current approach in the comment below)

public static ElasticsearchSecurityException samlException(String msg, Object... args) {
final ElasticsearchSecurityException exception = new ElasticsearchSecurityException(msg, args);
exception.addMetadata(SAML_EXCEPTION_KEY);
static SamlAuthenticationException samlUnsolicitedInResponseToException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the static factory method and have one samlException and one terminatingSamlException to keep it more explicit what we're doing. It can be difficult to spot the boolean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@piotrsulkowski-elastic
Copy link
Contributor Author

I changed the behaviour to that discussed on slack thread recently. With the current changes, that's what happens when any ElasticsearchSecurityException is thrown during saml authentication:

  • if the realm name is explicitly specified, then make it a terminating exception, because no other realm can be applicable, as names are guaranteed to be unique;
  • if the realm name is not specifically provided (i.e. null as a parameter), then the exception is treated as not-terminating, it's message added to collection, then at the end logged by elasticsearch, but the returned REST response has only generic information without per-realm error details.

I believe that covers all of the requirements and problems that came up during recent discussions, but waiting for feedback.

logger.debug("Parsed token [{}] to attributes [{}]", token, attributes);
buildUser(attributes, ActionListener.releaseAfter(listener, attributes));
} catch (ElasticsearchSecurityException e) {
if (SamlUtils.isSamlException(e)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the main change

public SamlToken(byte[] content, @Nullable List<String> allowedSamlRequestIds, @Nullable String authenticatingRealm) {
this.content = content;
this.allowedSamlRequestIds = allowedSamlRequestIds;
this.allowedSamlRequestIds = allowedSamlRequestIds != null ? allowedSamlRequestIds : Collections.emptyList();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without this, saml handling was throwing NPE's whenever it encountered null requestIds, which was then wrapped in a generic Unauthorized response, looking similarly to the expected outcome

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement!

*/
public static ElasticsearchSecurityException samlException(String msg, Object... args) {
final ElasticsearchSecurityException exception = new ElasticsearchSecurityException(msg, args);
final ElasticsearchSecurityException exception = new ElasticsearchSecurityException(msg, RestStatus.UNAUTHORIZED, args);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, without this, the code handling exception in lower layers was throwing a new exception while it was validating and parsing this exception, hiding the original error

Copy link
Contributor

@jfreden jfreden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code looks great! Good job getting all the requirements in order, the result turned out really clean. Just have a discussion point.

return token instanceof SamlToken;
}

private boolean isTokenForRealm(SamlToken samlToken) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the default value here confusing. I think it would be clearer to add the extra logic in the caller.

if (SamlUtils.isSamlException(e)) {
listener.onResponse(AuthenticationResult.unsuccessful("Provided SAML response is not valid for realm " + this, e));
} else {
if (isTokenForRealm(samlToken, false)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the slack conversation you mention the options are:

  1. always continue on saml exceptions, but collect the messages and return them to the client (that's a change from the current behaviour)
  2. terminate on non-saml exceptions - just like we do now.

My understanding is that you've implemented (1). With the reasons:
a. It's clearer to the client to receive a single error, rather than a collection of errors in the case when they provide an explicit realm name in the request.
b. Saves us a bit of cpu-effort iterating over all other realms without need.
c. There is no need to treat saml-exception in any other way than a normal ElasticsearchSecurityException (which we do currently).

(a) would be true either way since you have the isTokenForRealm regardless of which approach you take per my understanding? (b) is true but I don't think it should drive which approach we take, it's very minimal and the token-service will be used after initial auth. (c) seems to be true since no tests are failing, but there might have been some motivation behind only continuing on SAML exceptions?

I would keep the current logic, so something like:

Suggested change
if (isTokenForRealm(samlToken, false)) {
if (isTokenForRealm(samlToken, false) || SamlUtils.isSamlException(e) == false) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!
My intention was a bit different, sorry if I haven't described it clearly. It's as follows:

  1. If the realm name is provided, then terminate. Regardless if it's saml or non-saml exception. This is because when a realm name is provided, then there is no chance that any other realm would be applicable, unless that realm comes from a plugin and disregards that it's not applicable, which I'd consider a bug. That would be similar to a situation where token is for saml, but accidentally some jwt plugins accepts it - would we want to support that? I think that's a reasonable assumption that your saml implementation doesn't receive tokens which were designated specifically for a different realm.
  2. If the realm is not provided, then continue. Also regardless if it's saml, or non-saml. This is because, theoretically, whatever problem occurs (within reason, e.g. OOM exceptions don't count) it might not appear for other realms, which might have completely different logic handling token authentication, hence we should not rule out other realms.

The options (1) and (2) from your message both form a single solution which is alternative to my suggested approach (they are not what I currently implemented). Instead I implemented my suggested approach, as described above here. I did that specifically for the reasons listed in (a), (b) and (c). And you're totally right that (b) is least important and on it's own it shouldn't drive the decision.

Adding the logic gate that you suggested (i.e. adding || SamlUtils.isSamlException(e) == false) would mean that the exception metadata would currently not be returned to the client. If we wanted to go this way, that's the alternative solution that I mentioned, to collect all exceptions in the list and return the whole collection - something which would definitely work, but in my opinion was less elegant for the client than the currently proposed solution. Nevertheless, I'm happy to do it this way if you think it's better for some reason!

public SamlToken(byte[] content, @Nullable List<String> allowedSamlRequestIds, @Nullable String authenticatingRealm) {
this.content = content;
this.allowedSamlRequestIds = allowedSamlRequestIds;
this.allowedSamlRequestIds = allowedSamlRequestIds != null ? allowedSamlRequestIds : Collections.emptyList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement!

Copy link
Contributor

@jfreden jfreden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@piotrsulkowski-elastic piotrsulkowski-elastic force-pushed the saml-in-response-to-error branch 6 times, most recently from 249d034 to cf5c816 Compare December 3, 2025 02:08
@piotrsulkowski-elastic piotrsulkowski-elastic enabled auto-merge (squash) December 3, 2025 02:42
@piotrsulkowski-elastic piotrsulkowski-elastic force-pushed the saml-in-response-to-error branch 3 times, most recently from bd70a35 to c772096 Compare December 3, 2025 10:07
in case when there is a mismatch between the expected request ID
(that comes from a cookie) and the actual in-response-to field in the
SAML content, return more specific error to allow for better error
handling at the client

closes elastic#128179
create a separate exception class for SAML-related errors (which
previously was handled using metadata). Also, allow this exception
to support TERMINATION/CONTINUE flag, which was previously not possible,
and CONTINUE was implicitly assumed (which resulted in wiping
all the exception metadata from the resulting transport and
rest responses).
Creates a new integration test, which checks the handling of errors in
various cases of in_response_to header vs saml token metadata mismatch.
Extracts common test base for saml IT tests from the existing
SamlServiceProviderMetadataIT
@piotrsulkowski-elastic piotrsulkowski-elastic merged commit f1641e3 into elastic:main Dec 3, 2025
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) Team:Security Meta label for security team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support additional SAML flow failure information

3 participants