-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Improve SAML error handling by adding metadata #137598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve SAML error handling by adding metadata #137598
Conversation
|
Hi @piotrsulkowski-elastic, I've created a changelog YAML for you. |
|
Pinging @elastic/es-security (Team:Security) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances error handling for SAML authentication by introducing a more specific exception when the in-response-to field in a SAML response doesn't match expected request IDs. This improves client-side error handling capabilities when mismatches occur, such as when a user spends too long at the IdP site causing the original request to expire.
Key changes:
- Added a specialized exception method for unsolicited SAML responses with metadata containing the mismatched
in-response-tovalue - Updated error handling to use the new exception type instead of generic SAML exceptions
- Added test coverage for the unsolicited response scenario
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| SamlResponseHandler.java | Replaced generic SAML exception with specialized unsolicited response exception |
| SamlUtils.java | Added new samlUnsolicitedInResponseToException method with metadata support |
| SamlAuthenticatorTests.java | Added test case for unsolicited SAML responses and refactored token helper methods |
| 137598.yaml | Added changelog entry documenting the enhancement |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| samlContentInResponseTo, | ||
| expectedInResponseTos | ||
| ); | ||
| exception.addMetadata(SAML_UNSOLICITED_RESPONSE_KEY, samlContentInResponseTo); |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metadata is being added as a single string value, but the test expects a list. The call should be exception.addMetadata(SAML_UNSOLICITED_RESPONSE_KEY, Collections.singletonList(samlContentInResponseTo)); to match the test expectation at line 1429 in SamlAuthenticatorTests.java.
| exception.addMetadata(SAML_UNSOLICITED_RESPONSE_KEY, samlContentInResponseTo); | |
| exception.addMetadata(SAML_UNSOLICITED_RESPONSE_KEY, java.util.Collections.singletonList(samlContentInResponseTo)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the interface is
public void addMetadata(String key, String... values)
which apparently misguided copilot
277739c to
31b51c4
Compare
43ff4ea to
e583f5d
Compare
jfreden
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job with this PR! I have a couple of comments but the overall flow LGTM!
I think you're missing a changelog since this is an >enhancement, maybe the autogenerated one was overwritten during a force-push?
It would be good with a more descriptive title too: "Improve SAML Error Handling by Including additional Metadata" or something like that?
| } | ||
|
|
||
| /** | ||
| * Constructs exception for a specific case where the in-response-to value in the SAML content does not match any of the values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment states that the initial SAML request has expired but it's actually the cookie that holds the initial request id and the requested URL that has expired, the SAML request is still valid if the missing values are provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's what I tried to say. Thanks for feedback, I tried clarifying the message now to sound more explicit
| * matching the initial request, however that initial request is now gone, so the client sends an empty in-response-to parameter causing | ||
| * a mismatch between the two. | ||
| */ | ||
| public static ElasticsearchSecurityException samlUnsolicitedInResponseToException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think this could be package-private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fixed. I initially matched the neighbouring methods, but package-private makes more sense
| samlContentInResponseTo, | ||
| expectedInResponseTos | ||
| ); | ||
| exception.addMetadata(SAML_UNSOLICITED_RESPONSE_KEY, samlContentInResponseTo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of the flow to fix this is:
If Kibana detects a SAML login failure for a login that looks like an IdP-initiated login (i.e., the intermediate cookie isn't present), it might retry a POST request, but from its own "SAML retry" page that's hosted on the same domain as Kibana and hence isn't subject to the same-site cookies limitations.
What is meant by "looks like an IdP-initiated login", is that the error message is "SAML content is in-response-to [{}] but expected one of {} ". With your new code they can get the metadata field and see if they indeed intended to send a SAMLResponse for a SP-initiated login from Kibana by matching the id to what was expected by ES and retry.
Is the only reason for this change to simplify getting the IDs? The data is already available in the error message per my understanding, but it's hacky to get it them from there so I think this makes more sense, I just want to make sure I understand it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I didn't think about the retry, but it makes sense! Where did you find that comment, btw? My guess was that they'll ask user to try logging in again, but be able to explain what went wrong to motivate the user to actually retry, and to provide better ux.
Anyway, if the client flow depends on the response type, then IMO it should be based on explicit metadata, rather than guessing from textual message, which can change at any point in time - hence I didn't even ask about the details of what exactly the new flow is going to be, just know that it's going to be differentiated. I'm happy to double check that though if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting the response id is one thing, but I think that the main reasons is actually knowing what caused the error without the need to parse the message, i.e. that it was a mismatch.
04b64c8 to
2c675bf
Compare
|
Hi @piotrsulkowski-elastic, I've created a changelog YAML for you. |
afe18db to
6195674
Compare
|
I was making end-to-end testing and discovered that it wasn't working the way I expected due to 2 factors:
Lastly, I added an integration test to check this behaviour (I found it appropriate given two non-trivial already existing problems). |
|
This changes the existing behavior and I think it's a backwards compatibility issue. For example consider this scenario:
Is it possible to instead:
|
|
Thanks! Correct me if I'm wrong, but I think that when the step 5 comes ('saml_realm_1 tries first, "xyz123" is not in its expected IDs so throws a terminating exception') then it would behave differently. It would look into token's in-response-to, which would be "xyz123" because it's encoded in the token, and it doesn't change while iterating over different realms. It would then look into allowedRequestIds, and again it would see "xyz123", because likewise, this is not changing when iterating over realms. So the in-response-to check would pass. Authentication would fail on matching realm, certificates, etc. (most likely even before it reaches the in-response-to validation), but that particular validation would pass when checked. It should only throw an exception when indeed the request is inherently inconsistent with itself, in which case it will fail for any realm - basically the token is broken. Similar situation would happen e.g. when the token cannot be unmarshalled, or it was missing content, etc. As opposed to whole other half of problems which are realm-specific, like e.g. invalid certificates, roles, metadata, assertions, etc. That was the reason why I concluded in the first place that this is not a breaking change. The only difference as perceived from the user should be the error message and metadata, in any scenario. At least that's my current understanding, but obviously I'm might have missed something. I like your suggestion of returning errors when they are accumulated from all realms. I think though, that in this particular case of a terminating exception (which so far I still believe this should be), it's more elegant and convenient to the client to return a single error rather than a list of different (or sometimes identical) errors, because on the ES side we know exactly which error stopped us from moving forwards, while the user would have to do a bit of guessing, when given a list. I really like the idea of whitelisting metadata which can be returned and I'm adding that right now. Let me know what you think about the previous points! |
jfreden
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, you're right, it's all supplied in the request so the concern of this validation causing early unexpected terminations isn't valid. Sorry for the confusion!
I liked Tim's suggestion of:
Can't we just say that the enhanced error message is only returned if a specific realm is targeted by the request?
So we only terminate the auth with the enhanced error message if a realm is supplied? That way we don't risk breaking any bwc, even though that seems highly unlikely. Maybe if the customer has a custom realm?
Code looks good, left some suggestions.
| @@ -0,0 +1,213 @@ | |||
| /* | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a cleaner separation of concerns would be nice here. Your new SamlInResponseToIT shouldn't have to call makeMetadataAvailable. Can all the unstable metadata code be moved back to that test and then have the base class make it "always available" as the default behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved those around, so that now the subclasses by default have the metadata available. The tricky part was that during cluster init, it already communicates with the idp trying to cache the metadata, so to work around that there is an extra step in the test rule.
I've also tried to move as many arbitrary strings to base class and it's methods (e.g. getRealmName), which was possible in most cases, but there is still some grey area around e.g. idp, and acs urls, as well as available realm numbers, which might or might not be in the base class too. Moving it in, doesn't help much with readibility, so for now I left those as they were.
| import static org.hamcrest.Matchers.hasKey; | ||
| import static org.hamcrest.Matchers.is; | ||
|
|
||
| public class SamlInResponseToIT extends SamlRestTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job with this test!
| assertThat(response, hasKey("access_token")); | ||
| } | ||
|
|
||
| public void testInResponseToMismatch() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is a great test but it would be easier to read if the different scenarios were split into their own methods. I also wonder if you could use some more randomization to cover more scenarios?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, moved!
Reg. randomization, I added random requestIds instead of arbitrary constant values. Do you have more ideas, e.g. some methods could be called, but not all, or something like that? They don't require setup each time, but each of them makes rest calls, they take around 200-300 ms (each test) on average on my machine.
| if (isSuccess(status) == false) { | ||
| throw samlException("SAML Response is not a 'success' response: {}", getStatusCodeMessage(status)); | ||
| // throws a terminating exception, as status must always be success | ||
| throw new SamlAuthenticationException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this terminating because it can never be successful on any realm? I wonder if we should really change the behaviour for other paths than the in-response-to? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is no longer relevant, I changed the behaviour as described in the thread (I'll repeat the current approach in the comment below)
| public static ElasticsearchSecurityException samlException(String msg, Object... args) { | ||
| final ElasticsearchSecurityException exception = new ElasticsearchSecurityException(msg, args); | ||
| exception.addMetadata(SAML_EXCEPTION_KEY); | ||
| static SamlAuthenticationException samlUnsolicitedInResponseToException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the static factory method and have one samlException and one terminatingSamlException to keep it more explicit what we're doing. It can be difficult to spot the boolean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
2ffb30e to
b3f8be2
Compare
|
I changed the behaviour to that discussed on slack thread recently. With the current changes, that's what happens when any ElasticsearchSecurityException is thrown during saml authentication:
I believe that covers all of the requirements and problems that came up during recent discussions, but waiting for feedback. |
| logger.debug("Parsed token [{}] to attributes [{}]", token, attributes); | ||
| buildUser(attributes, ActionListener.releaseAfter(listener, attributes)); | ||
| } catch (ElasticsearchSecurityException e) { | ||
| if (SamlUtils.isSamlException(e)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the main change
| public SamlToken(byte[] content, @Nullable List<String> allowedSamlRequestIds, @Nullable String authenticatingRealm) { | ||
| this.content = content; | ||
| this.allowedSamlRequestIds = allowedSamlRequestIds; | ||
| this.allowedSamlRequestIds = allowedSamlRequestIds != null ? allowedSamlRequestIds : Collections.emptyList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without this, saml handling was throwing NPE's whenever it encountered null requestIds, which was then wrapped in a generic Unauthorized response, looking similarly to the expected outcome
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvement!
| */ | ||
| public static ElasticsearchSecurityException samlException(String msg, Object... args) { | ||
| final ElasticsearchSecurityException exception = new ElasticsearchSecurityException(msg, args); | ||
| final ElasticsearchSecurityException exception = new ElasticsearchSecurityException(msg, RestStatus.UNAUTHORIZED, args); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly, without this, the code handling exception in lower layers was throwing a new exception while it was validating and parsing this exception, hiding the original error
jfreden
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code looks great! Good job getting all the requirements in order, the result turned out really clean. Just have a discussion point.
| return token instanceof SamlToken; | ||
| } | ||
|
|
||
| private boolean isTokenForRealm(SamlToken samlToken) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the default value here confusing. I think it would be clearer to add the extra logic in the caller.
| if (SamlUtils.isSamlException(e)) { | ||
| listener.onResponse(AuthenticationResult.unsuccessful("Provided SAML response is not valid for realm " + this, e)); | ||
| } else { | ||
| if (isTokenForRealm(samlToken, false)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the slack conversation you mention the options are:
- always continue on saml exceptions, but collect the messages and return them to the client (that's a change from the current behaviour)
- terminate on non-saml exceptions - just like we do now.
My understanding is that you've implemented (1). With the reasons:
a. It's clearer to the client to receive a single error, rather than a collection of errors in the case when they provide an explicit realm name in the request.
b. Saves us a bit of cpu-effort iterating over all other realms without need.
c. There is no need to treat saml-exception in any other way than a normal ElasticsearchSecurityException (which we do currently).
(a) would be true either way since you have the isTokenForRealm regardless of which approach you take per my understanding? (b) is true but I don't think it should drive which approach we take, it's very minimal and the token-service will be used after initial auth. (c) seems to be true since no tests are failing, but there might have been some motivation behind only continuing on SAML exceptions?
I would keep the current logic, so something like:
| if (isTokenForRealm(samlToken, false)) { | |
| if (isTokenForRealm(samlToken, false) || SamlUtils.isSamlException(e) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
My intention was a bit different, sorry if I haven't described it clearly. It's as follows:
- If the realm name is provided, then terminate. Regardless if it's saml or non-saml exception. This is because when a realm name is provided, then there is no chance that any other realm would be applicable, unless that realm comes from a plugin and disregards that it's not applicable, which I'd consider a bug. That would be similar to a situation where token is for saml, but accidentally some jwt plugins accepts it - would we want to support that? I think that's a reasonable assumption that your saml implementation doesn't receive tokens which were designated specifically for a different realm.
- If the realm is not provided, then continue. Also regardless if it's saml, or non-saml. This is because, theoretically, whatever problem occurs (within reason, e.g. OOM exceptions don't count) it might not appear for other realms, which might have completely different logic handling token authentication, hence we should not rule out other realms.
The options (1) and (2) from your message both form a single solution which is alternative to my suggested approach (they are not what I currently implemented). Instead I implemented my suggested approach, as described above here. I did that specifically for the reasons listed in (a), (b) and (c). And you're totally right that (b) is least important and on it's own it shouldn't drive the decision.
Adding the logic gate that you suggested (i.e. adding || SamlUtils.isSamlException(e) == false) would mean that the exception metadata would currently not be returned to the client. If we wanted to go this way, that's the alternative solution that I mentioned, to collect all exceptions in the list and return the whole collection - something which would definitely work, but in my opinion was less elegant for the client than the currently proposed solution. Nevertheless, I'm happy to do it this way if you think it's better for some reason!
| public SamlToken(byte[] content, @Nullable List<String> allowedSamlRequestIds, @Nullable String authenticatingRealm) { | ||
| this.content = content; | ||
| this.allowedSamlRequestIds = allowedSamlRequestIds; | ||
| this.allowedSamlRequestIds = allowedSamlRequestIds != null ? allowedSamlRequestIds : Collections.emptyList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvement!
jfreden
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
249d034 to
cf5c816
Compare
bd70a35 to
c772096
Compare
in case when there is a mismatch between the expected request ID (that comes from a cookie) and the actual in-response-to field in the SAML content, return more specific error to allow for better error handling at the client closes elastic#128179
create a separate exception class for SAML-related errors (which previously was handled using metadata). Also, allow this exception to support TERMINATION/CONTINUE flag, which was previously not possible, and CONTINUE was implicitly assumed (which resulted in wiping all the exception metadata from the resulting transport and rest responses).
Creates a new integration test, which checks the handling of errors in various cases of in_response_to header vs saml token metadata mismatch. Extracts common test base for saml IT tests from the existing SamlServiceProviderMetadataIT
This reverts commit 6195674.
c772096 to
9971bc5
Compare
in case when there is a mismatch between the expected request ID (that comes from a cookie) and the actual in-response-to field in the SAML content, return more specific error to allow for better error handling at the client
closes #128179