-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not fail node if SAML HTTP metadata is unavailable #92810
Conversation
This commit changes the SAML realm to use placeholder metadata (`UnresolvedEntity`) when the real metadata cannot be loaded over HTTPS - if `metadata.http.lenient` is set to true. All future use of the realm will fail until the metadata is available, but this change allows the node to bootstrap successfully. Relates: elastic#37608
Hi @tvernum, I've created a changelog YAML for you. |
+1 to high level change to allow nodes to start if they can not download the metadata. I haven't dug too deep so maybe this is already handled...but I do think we need a retry strategy such that if the HTTP issue is transient it will resolve itself without any intervention. I think the 1 hour long default retry is too long in this case (and it is not obvious if the placeholder would get correctly updated), so we may want a hard coded shorter interval until the configured interval kicks in (i.e. with a default of 1 hour, retry every 1 minute during that hour). I don't see a need to expose this leniency behind a configuration option. If anything, I think the default should be reversed such that if set it will fail startup but only if it is configured to do so. Additionally, I think we should to emit a warn message if the http variant is used and the config is not set to fail fast to discourage usage of the http variant (mostly for trying to inform users setting this up for the first time to use a file). Also, is persisting this metadata in the .security index plausible/worth it such that we could use that as a fallback ? seemingly random things that break on restarts are the worst ! |
I agree. I haven't tested whether OpenSAML does what we want yet, but it might. If not we will need to configure the refresh internal lower until we get the first successful metadata.
I think I agree, but it's hard to know what people really want.
Probably. |
This is ready for review now. Mostly it's self explanatory, but there's a few bits that could use some additional details:
I still have 2 more tasks to do, but I'd like to go through a first round of reviews first:
|
Pinging @elastic/es-security (Team:Security) |
Hi @tvernum, I've updated the changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change are looking good. I like the placeholder and refresh strategy. A couple minor questions and will review the tests when complete.
public static final Setting.AffixSetting<TimeValue> IDP_METADATA_HTTP_MIN_REFRESH = Setting.affixKeySetting( | ||
RealmSettings.realmSettingPrefix(TYPE), | ||
IDP_METADATA_SETTING_PREFIX + "http.minimum_refresh", | ||
key -> Setting.timeSetting(key, TimeValue.timeValueMinutes(5), Setting.Property.NodeScope) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- should we also set a min value for this setting to prevent absurdly low values
- maybe we should we add setting validation that this value is < http.refresh rather than setting them equal at runtime, i.e. prefer strict validation vs. lenient behavior ? (it looks like the library has similar validation but haven't tested to see how that would manifest in ES .. i.e. if we want to be strict then that misconfig should prevent startup )
should this be dynamic such that you can bump it way down in uptime if needed ?... After more review, do I understand that dynamic is not needed to help with failed on startup since on each supplier.get() we will manually try to get the metadata -> if cached from prior scheduled operation that will return, else we will call refresh (only 1 outstanding manual call at a time) which will kick the start the process based on demand (not just relying on the schedule) ?- should we deprecate the
http.refresh
in favor ofhttp.max_refresh
and/or mirror the 4 hour default of the library ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also set a min value for this setting to prevent absurdly low values
We can. I'm not sure what would constitute absurdly low though. I guess "minutes" is sensible, "seconds" is acceptable and "milliseconds" is a problem. I'm not sure what the precise dividing line would be (my inclination is 5s)
maybe we should we add setting validation that this value is < http.refresh rather than setting them equal at runtime, i.e. prefer strict validation vs. lenient behavior
I considered that, but it means either a potential breaking change or a slightly more complex behaviour.
If an admin has set http.refresh
to 4 minutes and http.minimum_refresh
defaults to 5 minutes then the introduction of this new setting would prevent their node from starting.
To avoid that we need to only apply the validation if http.minimum_refresh
has an explicit value, and keep the current behaviour otherwise.
should this be dynamic?
It would be helpful if http.refresh
was dynamic, because if you knew that you had published updated (remote) metadata, you could set the refresh
time down to force it to be reloaded more quickly.
It's not so important to set http.minimum_refresh
dynamically because of the automatic reload behaviour you mention.
should we deprecate the http.refresh in favor of http.max_refresh?
We could. It's arguably more consistent, but my gut is that it's one of the annoying changes that isn't worth it.
I don't think having http.refresh
and http.minimum_refresh
is terribly confusing, even if it isn't consistent - in fact I expect most admins who need to fiddle with refresh times should just set http.refresh
and ignore http.minimum_refresh
anyway.
On balance, forcing every admin who has configured http.refresh
to go through the deprecation process doesn't seem justified just to make things a bit more consistent - it doesn't feel like it meets the necessary threshold to force that pain onto users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the precise dividing line would be (my inclination is 5s)
I ended up setting the minimum to 500ms. Mostly because I had a test that relied on setting it to a very low value and making sure it did a refresh in the background. If the limit was 5s then the test would take >5s to run (for no great reason).
500ms still stops people setting it to tiny values, and if they really want it to refresh twice every second, I guess they can.
Could you please help me understand the following behaviour?
What if the refresh at expiration time fails? Does OpenSAML fallback to refresh every 5 minutes? Also, what will be the return value of |
The underlying code in OpenSAML is complex with many separate execution paths. My best understanding is:
It should never cause us any problems - once we've loaded metadata we never lose it, though it could expire while unsuccessfully attempting to load new metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I left some comments, but do not need to look at the changes again. Thanks!
IDP_METADATA_SETTING_PREFIX + "http.fail_on_error", | ||
key -> Setting.boolSetting(key, false, Setting.Property.NodeScope) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this could be categorized as a breaking change. But thought I'd just mention for completeness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, @jakelandis and I discussed earlier. My first version reversed the default behavior and would fail unless you made it lenient. We agreed that it was better to change the behavior with an option to revert and that we didn't consider it breaking.
x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/saml/SamlRealm.java
Show resolved
Hide resolved
x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/saml/SamlRealm.java
Outdated
Show resolved
Hide resolved
if (config.hasSetting(IDP_METADATA_HTTP_MIN_REFRESH)) { | ||
throw new SettingsException( | ||
"the value ({}) for [{}] cannot be greater than the value ({}) for [{}]", | ||
minRefresh.getStringRep(), | ||
RealmSettings.getFullSettingKey(config, IDP_METADATA_HTTP_MIN_REFRESH), | ||
maxRefresh.getStringRep(), | ||
RealmSettings.getFullSettingKey(config, IDP_METADATA_HTTP_REFRESH) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It is also possible that user explicitly configures the max refresh to be lower than the default min refresh. In that case, this check won't be able to catch it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's intentional for BWC.
If someone has an existing config with .refresh
set to 4 minutes, then we don't want the node to fail during bootstrap, even though their refresh interval conflicts with the default for .minimum_refresh
throw SamlUtils.samlException("Cannot find metadata for entity [{}] in [{}]", entityId, sourceLocation); | ||
} else { | ||
logger.warn( | ||
"cannot load SAML metadata for [{}] from [{}]; SAML authentication for this realm will fail", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: might want to include realm name here. It can be derived by the entityId and url. But it is convenient for readers to just call it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, we don't know the realm name here and it's tricky to resolve that.
x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/saml/SamlRealm.java
Show resolved
Hide resolved
* has been resolved (although if metadata is loaded from a local file we monitor it for changes anyway, so this refresh | ||
* is unlikely to have any benefit). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we check the resolver
object type and only perform refresh if it is a HTTPMetadataResolver
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, do we care enough about that to make the code more complex?
...gin/security/src/main/java/org/elasticsearch/xpack/security/authc/saml/UnresolvedEntity.java
Outdated
Show resolved
Hide resolved
...lugin/security/src/test/java/org/elasticsearch/xpack/security/authc/saml/SamlRealmTests.java
Outdated
Show resolved
Hide resolved
...lugin/security/src/test/java/org/elasticsearch/xpack/security/authc/saml/SamlRealmTests.java
Outdated
Show resolved
Hide resolved
I'm going to move ahead with docs + a QA test on this since it seems like we have consensus on the approach. |
Hi @tvernum, I've updated the changelog YAML for you. |
@elasticmachine run elasticsearch-ci/part-3 please |
@jakelandis, @ywangd I've added a QA test & docs for the new settings. Do either of you want to review those? |
...RestTest/java/org/elasticsearch/xpack/security/authc/saml/SamlServiceProviderMetadataIT.java
Outdated
Show resolved
Hide resolved
I had a brief look at the doc and test update. They look good to me. The way you setup the tests is quite intriguing. It will be a great reference when I need to juggle so much of SAML messages :) |
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (and really nice work on the integration test and setup!)
test/framework/src/main/java/org/elasticsearch/test/junit/RunnableTestRuleAdapter.java
Show resolved
Hide resolved
This commit changes the SAML realm to use placeholder metadata (UnresolvedEntity) when the real metadata cannot be loaded over HTTPS - unless metadata.http.fail_on_error is set to true. All future use of the realm will fail until the metadata is available, but this change allows the node to bootstrap successfully.
This commit changes the SAML realm to use placeholder metadata (UnresolvedEntity) when the real metadata cannot be loaded over HTTPS - unless metadata.http.fail_on_error is set to true. All future use of the realm will fail until the metadata is available, but this change allows the node to bootstrap successfully.
This commit changes the SAML realm to use placeholder metadata (UnresolvedEntity) when the real metadata cannot be loaded over HTTPS - unless metadata.http.fail_on_error is set to true. All future use of the realm will fail until the metadata is available, but this change allows the node to bootstrap successfully.
This commit changes the SAML realm to use placeholder metadata (UnresolvedEntity) when the real metadata cannot be loaded over HTTPS - unless metadata.http.fail_on_error is set to true. All future use of the realm will fail until the metadata is available, but this change allows the node to bootstrap successfully.
This commit changes the SAML realm to use placeholder metadata (
UnresolvedEntity
) when the real metadata cannot be loaded over HTTPS - unlessmetadata.http.fail_on_error
is set to true.All future use of the realm will fail until the metadata is available, but this change allows the node to bootstrap successfully.
Closes: #37608