Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not fail node if SAML HTTP metadata is unavailable #92810

Merged
merged 24 commits into from
Feb 16, 2023

Conversation

tvernum
Copy link
Contributor

@tvernum tvernum commented Jan 11, 2023

This commit changes the SAML realm to use placeholder metadata (UnresolvedEntity) when the real metadata cannot be loaded over HTTPS - unless metadata.http.fail_on_error is set to true.

All future use of the realm will fail until the metadata is available, but this change allows the node to bootstrap successfully.

Closes: #37608

This commit changes the SAML realm to use placeholder metadata
(`UnresolvedEntity`) when the real metadata cannot be loaded over
HTTPS - if `metadata.http.lenient` is set to true.

All future use of the realm will fail until the metadata is available,
but this change allows the node to bootstrap successfully.

Relates: elastic#37608
@tvernum tvernum added >enhancement :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) labels Jan 11, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @tvernum, I've created a changelog YAML for you.

@jakelandis
Copy link
Contributor

+1 to high level change to allow nodes to start if they can not download the metadata.

I haven't dug too deep so maybe this is already handled...but I do think we need a retry strategy such that if the HTTP issue is transient it will resolve itself without any intervention. I think the 1 hour long default retry is too long in this case (and it is not obvious if the placeholder would get correctly updated), so we may want a hard coded shorter interval until the configured interval kicks in (i.e. with a default of 1 hour, retry every 1 minute during that hour).

I don't see a need to expose this leniency behind a configuration option. If anything, I think the default should be reversed such that if set it will fail startup but only if it is configured to do so. Additionally, I think we should to emit a warn message if the http variant is used and the config is not set to fail fast to discourage usage of the http variant (mostly for trying to inform users setting this up for the first time to use a file).

Also, is persisting this metadata in the .security index plausible/worth it such that we could use that as a fallback ? seemingly random things that break on restarts are the worst !

@tvernum
Copy link
Contributor Author

tvernum commented Jan 12, 2023

I haven't dug too deep so maybe this is already handled...but I do think we need a retry strategy such that if the HTTP issue is transient it will resolve itself without any intervention. I think the 1 hour long default retry is too long in this case (and it is not obvious if the placeholder would get correctly updated), so we may want a hard coded shorter interval until the configured interval kicks in (i.e. with a default of 1 hour, retry every 1 minute during that hour)

I agree. I haven't tested whether OpenSAML does what we want yet, but it might. If not we will need to configure the refresh internal lower until we get the first successful metadata.

I don't see a need to expose this leniency behind a configuration option

I think I agree, but it's hard to know what people really want.
The main downside of leniency is if you misconfigure the URL then the node starts up OK but SAML doesn't work and it might not be obvious why that is - although the volume of log messages that are generated should be an indication.
I've come to the conclusion that we probably want to default to leniency (despite it being abhorrent).

Also, is persisting this metadata in the .security index plausible/worth it such that we could use that as a fallback ?

Probably.
It's got the pain of having to coordinate between nodes & index health changes, but it should be possible - the metadata isn't huge.
I probably wouldn't try and do it in this PR because it would probably mean we can't ship as quickly, but I think it is possible to make it work.

@tvernum tvernum changed the title [WIP] Allow leniency in SAML HTTP metadata loading Do not fail node if SAML HTTP metadata is unavailable Jan 19, 2023
@tvernum
Copy link
Contributor Author

tvernum commented Jan 19, 2023

This is ready for review now.

Mostly it's self explanatory, but there's a few bits that could use some additional details:

  • In order to make this change as small as possible, I added a new UnresolvedEntity that is used as a placeholder whenever we can't load a real entity from the metadata URL. We could just use null or introduced an Optional but then we need to review all the places where the entity is used and make them handle null (or empty). The UnresolvedEntity throws exceptions when you try to use it so anything that tries to use the realm will fail at runtime, (hopefully) with a reasonable message.
  • If the initial load of metadata fails, we want to keep trying (more frequently than the default 1h refresh). That was solved through two complimentary ways:
    1. I added a new metadata.http.minimum_refresh setting, which defaults to 5 minutes. The way it's used in OpenSAML changes the semantics of the realm slightly, but mostly in helpful ways. If metadata has not been loaded, it will be retried every 5 minutes. If metadata has been loaded, and the URL has an Expires header that is less than 1 hour in the future, then it will be refreshed when it expires (with a minimum of 5 minutes)
    2. When we request the metadata from the resolver, if it returns null (which means we never received valid metadata for this entity) we automatically force a refresh (once). An AtomicBoolean ensures that only 1 thread per node will refresh at a time, the others will fail with "no metadata" as they always have.

I still have 2 more tasks to do, but I'd like to go through a first round of reviews first:

  1. Update docs
  2. Write for the case where there are 2 SAML realms - 1 has valid metadata, the other is in a failed state. We need to ensure that authentication with the valid realm works even if the other realm is broken.

@tvernum tvernum marked this pull request as ready for review January 19, 2023 08:14
@elasticsearchmachine elasticsearchmachine added the Team:Security Meta label for security team label Jan 19, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

@elasticsearchmachine
Copy link
Collaborator

Hi @tvernum, I've updated the changelog YAML for you.

Copy link
Contributor

@jakelandis jakelandis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change are looking good. I like the placeholder and refresh strategy. A couple minor questions and will review the tests when complete.

public static final Setting.AffixSetting<TimeValue> IDP_METADATA_HTTP_MIN_REFRESH = Setting.affixKeySetting(
RealmSettings.realmSettingPrefix(TYPE),
IDP_METADATA_SETTING_PREFIX + "http.minimum_refresh",
key -> Setting.timeSetting(key, TimeValue.timeValueMinutes(5), Setting.Property.NodeScope)
Copy link
Contributor

@jakelandis jakelandis Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. should we also set a min value for this setting to prevent absurdly low values
  2. maybe we should we add setting validation that this value is < http.refresh rather than setting them equal at runtime, i.e. prefer strict validation vs. lenient behavior ? (it looks like the library has similar validation but haven't tested to see how that would manifest in ES .. i.e. if we want to be strict then that misconfig should prevent startup )
  3. should this be dynamic such that you can bump it way down in uptime if needed ? ... After more review, do I understand that dynamic is not needed to help with failed on startup since on each supplier.get() we will manually try to get the metadata -> if cached from prior scheduled operation that will return, else we will call refresh (only 1 outstanding manual call at a time) which will kick the start the process based on demand (not just relying on the schedule) ?
  4. should we deprecate the http.refresh in favor of http.max_refresh and/or mirror the 4 hour default of the library ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also set a min value for this setting to prevent absurdly low values

We can. I'm not sure what would constitute absurdly low though. I guess "minutes" is sensible, "seconds" is acceptable and "milliseconds" is a problem. I'm not sure what the precise dividing line would be (my inclination is 5s)

maybe we should we add setting validation that this value is < http.refresh rather than setting them equal at runtime, i.e. prefer strict validation vs. lenient behavior

I considered that, but it means either a potential breaking change or a slightly more complex behaviour.
If an admin has set http.refresh to 4 minutes and http.minimum_refresh defaults to 5 minutes then the introduction of this new setting would prevent their node from starting.

To avoid that we need to only apply the validation if http.minimum_refresh has an explicit value, and keep the current behaviour otherwise.

should this be dynamic?

It would be helpful if http.refresh was dynamic, because if you knew that you had published updated (remote) metadata, you could set the refresh time down to force it to be reloaded more quickly.
It's not so important to set http.minimum_refresh dynamically because of the automatic reload behaviour you mention.

should we deprecate the http.refresh in favor of http.max_refresh?

We could. It's arguably more consistent, but my gut is that it's one of the annoying changes that isn't worth it.

I don't think having http.refresh and http.minimum_refresh is terribly confusing, even if it isn't consistent - in fact I expect most admins who need to fiddle with refresh times should just set http.refresh and ignore http.minimum_refresh anyway.

On balance, forcing every admin who has configured http.refresh to go through the deprecation process doesn't seem justified just to make things a bit more consistent - it doesn't feel like it meets the necessary threshold to force that pain onto users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the precise dividing line would be (my inclination is 5s)

I ended up setting the minimum to 500ms. Mostly because I had a test that relied on setting it to a very low value and making sure it did a refresh in the background. If the limit was 5s then the test would take >5s to run (for no great reason).

500ms still stops people setting it to tiny values, and if they really want it to refresh twice every second, I guess they can.

@ywangd
Copy link
Member

ywangd commented Jan 23, 2023

Could you please help me understand the following behaviour?

If metadata has been loaded, and the URL has an Expires header that is less than 1 hour in the future, then it will be refreshed when it expires (with a minimum of 5 minutes)

What if the refresh at expiration time fails? Does OpenSAML fallback to refresh every 5 minutes? Also, what will be the return value of resolveSingle()? Is it null?

@tvernum
Copy link
Contributor Author

tvernum commented Jan 24, 2023

What if the refresh at expiration time fails? Does OpenSAML fallback to refresh every 5 minutes? Also, what will be the return value of resolveSingle()? Is it null?

The underlying code in OpenSAML is complex with many separate execution paths. My best understanding is:

  1. If there is any failure to refresh the metadata, we will continue to use the old metadata (if we successfully loaded metadata at some point). The exception would be if the "failure" is actually a 200 response that parses successfully as XML metadata (which is highly implausible).
  2. The min/max refresh times just control how frequently OpenSAML will attempt to refresh. It will attempt to refresh at least every maxRefresh time-units, and never attempt to refresh more often than minRefresh time-units. Within those boundaries it attempts to pick the best "nextScheduledRefresh" based whether it ever received a successful HTTP response, when that was, what Expires header it received (I believe) and validUntil (etc) attributes inside the metadata itself.

It should never cause us any problems - once we've loaded metadata we never lose it, though it could expire while unsuccessfully attempting to load new metadata.

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I left some comments, but do not need to look at the changes again. Thanks!

Comment on lines +58 to +59
IDP_METADATA_SETTING_PREFIX + "http.fail_on_error",
key -> Setting.boolSetting(key, false, Setting.Property.NodeScope)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this could be categorized as a breaking change. But thought I'd just mention for completeness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, @jakelandis and I discussed earlier. My first version reversed the default behavior and would fail unless you made it lenient. We agreed that it was better to change the behavior with an option to revert and that we didn't consider it breaking.

Comment on lines +697 to +704
if (config.hasSetting(IDP_METADATA_HTTP_MIN_REFRESH)) {
throw new SettingsException(
"the value ({}) for [{}] cannot be greater than the value ({}) for [{}]",
minRefresh.getStringRep(),
RealmSettings.getFullSettingKey(config, IDP_METADATA_HTTP_MIN_REFRESH),
maxRefresh.getStringRep(),
RealmSettings.getFullSettingKey(config, IDP_METADATA_HTTP_REFRESH)
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It is also possible that user explicitly configures the max refresh to be lower than the default min refresh. In that case, this check won't be able to catch it.

Copy link
Contributor Author

@tvernum tvernum Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's intentional for BWC.
If someone has an existing config with .refresh set to 4 minutes, then we don't want the node to fail during bootstrap, even though their refresh interval conflicts with the default for .minimum_refresh

throw SamlUtils.samlException("Cannot find metadata for entity [{}] in [{}]", entityId, sourceLocation);
} else {
logger.warn(
"cannot load SAML metadata for [{}] from [{}]; SAML authentication for this realm will fail",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: might want to include realm name here. It can be derived by the entityId and url. But it is convenient for readers to just call it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we don't know the realm name here and it's tricky to resolve that.

Comment on lines +827 to +828
* has been resolved (although if metadata is loaded from a local file we monitor it for changes anyway, so this refresh
* is unlikely to have any benefit).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check the resolver object type and only perform refresh if it is a HTTPMetadataResolver?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, do we care enough about that to make the code more complex?

@tvernum
Copy link
Contributor Author

tvernum commented Feb 7, 2023

I'm going to move ahead with docs + a QA test on this since it seems like we have consensus on the approach.

@rjernst rjernst added v8.8.0 and removed v8.7.0 labels Feb 8, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @tvernum, I've updated the changelog YAML for you.

@tvernum
Copy link
Contributor Author

tvernum commented Feb 10, 2023

@elasticmachine run elasticsearch-ci/part-3 please

@tvernum
Copy link
Contributor Author

tvernum commented Feb 10, 2023

@jakelandis, @ywangd I've added a QA test & docs for the new settings. Do either of you want to review those?

@ywangd
Copy link
Member

ywangd commented Feb 10, 2023

I had a brief look at the doc and test update. They look good to me. The way you setup the tests is quite intriguing. It will be a great reference when I need to juggle so much of SAML messages :)

@tvernum
Copy link
Contributor Author

tvernum commented Feb 13, 2023

@elasticmachine update branch

Copy link
Contributor

@jakelandis jakelandis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (and really nice work on the integration test and setup!)

@tvernum tvernum merged commit 34c270c into elastic:main Feb 16, 2023
salvatore-campagna pushed a commit to salvatore-campagna/elasticsearch that referenced this pull request Feb 16, 2023
This commit changes the SAML realm to use placeholder metadata (UnresolvedEntity) when the real metadata cannot be loaded over HTTPS - unless metadata.http.fail_on_error is set to true.

All future use of the realm will fail until the metadata is available, but this change allows the node to bootstrap successfully.
carlosdelest pushed a commit to carlosdelest/elasticsearch that referenced this pull request Feb 21, 2023
This commit changes the SAML realm to use placeholder metadata (UnresolvedEntity) when the real metadata cannot be loaded over HTTPS - unless metadata.http.fail_on_error is set to true.

All future use of the realm will fail until the metadata is available, but this change allows the node to bootstrap successfully.
kderusso pushed a commit to kderusso/elasticsearch that referenced this pull request Feb 23, 2023
This commit changes the SAML realm to use placeholder metadata (UnresolvedEntity) when the real metadata cannot be loaded over HTTPS - unless metadata.http.fail_on_error is set to true.

All future use of the realm will fail until the metadata is available, but this change allows the node to bootstrap successfully.
saarikabhasi pushed a commit to saarikabhasi/elasticsearch that referenced this pull request Apr 10, 2023
This commit changes the SAML realm to use placeholder metadata (UnresolvedEntity) when the real metadata cannot be loaded over HTTPS - unless metadata.http.fail_on_error is set to true.

All future use of the realm will fail until the metadata is available, but this change allows the node to bootstrap successfully.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) Team:Security Meta label for security team v8.8.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Be more lenient when remotely hosted SAML IdP metadata is unavailable
6 participants