Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve threadpool usage and error handling for API key validation #58090

Merged
merged 17 commits into from
Jul 6, 2020

Conversation

ywangd
Copy link
Member

@ywangd ywangd commented Jun 15, 2020

The PR introduces following two changes:

  • Move API key validation into a new separate threadpool
  • Return more informative response on threadpool saturation

The new threadpool is created separately with half of the available processors and 1000 in queue size. We could combine it with the existing TokenService's threadpool. Technically it is straightforward, but I am not sure whether it could be a rushed optimization since I am not clear about potential impact on the token service.

On threadpoool saturation, it now fails with EsRejectedExecutionException which in turns gives back a 429 status code to users. Note this is also a subtle behaviour change: Previously any failures during API key validation are translated into "unsuccessful but continue to realm authentication". After the change, threadpool saturation error is translated into "unsuccessful and terminate authentication". The difference will manifest iteself when user sends in two set of credentials, e.g. one for API key and one for basic auth. Before the change, authentication will continue with the basic auth and if it is valid, authentication will end up as successful. After the change, the authentication stops at API key when pool is saturated and does not proceed further. When threadpool is saturated, it is highly likely that users do want the API key authentication (otherwise the pool will not be saturated in the first place). Hence I doubt any user would really depend on the existing behaviour. (edit: this is not a concern since the code does not allow multiple Authorization headers. Thanks @jkakavas)

Resolves: #58088

Also return 429 when either GET or the hashing thread pool is saturated.
@ywangd ywangd added >enhancement :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) v8.0.0 v7.9.0 labels Jun 15, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security (:Security/Authentication)

@elasticmachine elasticmachine added the Team:Security Meta label for security team label Jun 15, 2020
@ywangd
Copy link
Member Author

ywangd commented Jun 15, 2020

Just realised that shunting everything after GetDoc call (mainly ApiKeyService#validateApiKeyCredentials) to a new thread pool does not solve the problem that cached API key auth is blocked by uncached API key auth, i.e. existing indexing operations are made unstable because new clients try to connect.

So I made changes to only push ApiKeyService#verifyKeyAgainstHash to the new thread pool and leave cached API key auth to the same GET thread pool. Accordingly to the performance tests, once all API keys are cached, the GET thread pool can sustain at least 5000 auth requests per second. So this change seems to be the best of both worlds.

@jkakavas
Copy link
Member

We could combine it with the existing TokenService's threadpool. Technically it is straightforward, but I am not sure whether it could be a rushed optimization since I am not clear about potential impact on the token service.

Agreed, I don't see any obvious benefit for this or solving any actual problem we have right now, but I can see a clear negative impact on token based authentication which we should avoid.

As discussed in slack, I'm not particularly worried about multiple Authorization headers as we only handle the first one right now

Copy link
Contributor

@albertzaharovits albertzaharovits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good!
I have raised two topics for discussion:

  • I have a suggestion about what I believe is a better place for handling EsRejectedExecutionException
  • I think it's better to enqueue only expensive hashing on the new thread pool, and not all (most) API verifications.

new FixedExecutorBuilder(settings, TokenService.THREAD_POOL_NAME, 1, 1000,
"xpack.security.authc.token.thread_pool", false),
new FixedExecutorBuilder(settings, ApiKeyService.THREAD_POOL_NAME,
(allocatedProcessors + 1) / 2, 1000,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note by using half of the allocated processors, initial authentiation of API keys, i.e. warming up the cache, could be up to twice as slow as current implemention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The maximum theoretical throughput of new (previously unseen) API key validations will indeed halve, but I'm not personally worried about it. We're talking about the theoretical maximum throughput which, because of the contention on the get threadpool, and because there are many thread pools overbooking the available processors, it's not something I would consider practically important.

Yet, a possible mitigation would be to decrease (halve) the default hashing cost factor for API Keys. GIven the length of the random api key secret that we generate, we would still remain out of the brute force plausibility range.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's quite a large number of new threads to be adding to the system, especially threads that could end up all being busy at once for a long time authenticating a lot of clients. They're then stealing CPU time from other components of the system. What work have we done to justify that the thread pool needs to be this large (the ceiling of half the number of cores)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few reasons:

  • It has fewer threads than the GET thread pool - the current code performs the hashing in the GET thread pool, which has number of threads equals to number of cores. The new thread pool should in fact help with resource stealing. Besides, saturation of GET thread pool has wider negative impacts to both uncached and cached authentication. A new pool helps isolate the impact to only new clients trying to connect.
  • Balance between throughput and resource contention - the expensive hashing operations are unavoidable. Performance tests show about 5000 authentications per minute (8 cores). With half of the threads, the number will be halved as well. Assuming a 25K clients (50K keys) use case, it will take 20 min for all keys to be cached. Reducing the number of threads too much would push the warm up time even higher. Therefore using half number of the cores feels like a reasonable middle ground.
  • Hashing in the generic thread pool (which uses all cores) does not have obvious performance hit - I tried simply pushing the hashing to the generic thread pool, which then used all cores for the hashing. I didn't observe any obivious performance issue with it. In fact, the initial warm up stage is smoother and has a better throughput than using the GET thread pool. Therefore it feels that using half of the cores should be relatively safe and also performant. I do plan to run a few more rounds of performance tests after all scheduled v7.9 API key improvements. So we can be confident on this.
  • The new thread pool will be shared by all security related expensive operations - currently its only usage is for hashing of API key auth, so it is not a very good reason. But it could soon see another use case: the API key creation also requires expensive hashing which currently runs in the transport_worker thread. I need to do some more investigation, but the result is very likely that we need move it to this new thread pool as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts for approving the PR were that we operate under some desired rough approximation for the rate of authentication using keys (what Yang is describing on the second point above) and so by switching to a new pool (given that the get operation is so much faster than the hash, and that the new pool is smaller than the get pool) all the threads in the new pool will be busy at peak rate, so that we can estimate that peak rate (assuming a relatively tranquil system where the pool threads stay running). The original peak rate (using the get pool) was noisy because any get requests in the system interfered with it (in addition to the obvious flip side that authn hashing interfered with get requests in the system).

When the system is busy and all threads are working at the pool queues and no queue is empty, it all boils down to the processors' queue. In a sense all thread pool queues "merge" into processor queues, with the priority given by the relative number of threads ascribed to each pool. So introducing a new pool will decrease the throughput of existing pools, proportionate to relative sizes, in exchange of a guaranteed minimum throughput on the new pool ,because the new queue is not populated with any other types of work (compared to reusing an existing pool that contains other work types).

To summarise, I would generally introduce a thread pool when I want more control over the rate of some work (both the minimum and the maximum rate).

That's my thought process about thread pool queues upon which I've approved the PR. I believe a new thread pool is the right decision, but I concede that we've informed the sizing decision on rough external authn rates without also consulting with folks that tune the existing thread pools.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks for the great explanations to help me understand the perspective here. One more question: is it enough to have a simple "security" thread pool? Do we need a token and a crypto thread pool and maybe another thread pool for other security aspects in the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR: We are not considering another thread pool for other security aspects.

Our intention for the new thread pool is to focus on "cpu-intensive" security computations, namely hashing and encryption, especially in the scope of authentication. Given their expensiveness and importance, we'd like to control their impact both ways: they should not take over all system resources but in the mean time should maintain a reasonable level of throughput. We use following criteria to justify whether an operation should be added to this pool:

  • Is it expensive? The standard here is somewhat subjective. Based on performance tests, I personally would eye on anything longer than 1 ms.
  • Is it authentication/authorization related? We'd like to prioritize these operations since they usually need to complete in a timely manner.
  • Does it need to scale? i.e. potential burst in volume? When there is a large increase in concurrency, we'd like to protect the system from being completely flooded.

With above criteria, we can come up with the followings:

  • Hashing for uncached results are great candidates. These include API Key auth (current PR, v7.9), API Key creation (v7.9?), tokens, username/password (planned, v7.10)
  • Signing and cryption for SAML assertions and odic claims are potential candidates (pending further performance check)
  • Hashing for cached result is not qualified for the new thread pool since it is fast (sha256, ~3 microseconds).
  • Manipulations of security documents are not qualified either
  • Encrypted snapshot, though potentially expensive, is not auth related and unlikely needs to scale. Hence it is also not qualified.
  • An interesting case is read/write of the security index. When the index is not on the local node, these operations can be expensive (above 1 ms) and it also matches the two other criteria. However the system indices work (System index reads in separate threadpool #57936) is supposed to cover it. Hence we are not considering it either.
  • Any other security aspects that are not qualified

Now the question is: whether above unqualified security operations worth to have their own separate thread pool?

When security is enabled, security related operations are ubiquitous, e.g. every request needs to go through authentication and authorization flow. Currently our general practice is to "just let them run in the same thread", which could be get, transport_worker etc. This strategy has served us reasonally well so far. But with the upcoming change to expose the service to unprecedented large number of clients, we started noticing some issues revealed by the performance tests, which in turn leads to this PR and a few other work in the pipeline. At this stage, based on what we know, we still believe the existing strategy works fine for security operations that are not qualified for the new thread pool. Hence we do not plan to add another one.

For the sake of completeness, there are exceptions to the existing strategy. Outbound LDAP calls are submitted via the generic pool to avoid risks of deadlocking if they are submitted via the same pool used by the outbound requests. There is no sign for this being an issue so far. We had brief discussion about it and decided to evaluate it more closely at a later time.

Copy link
Contributor

@albertzaharovits albertzaharovits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! besides a minor thing in tests, this is ready to 🚢
Great job Yang!

new FixedExecutorBuilder(settings, TokenService.THREAD_POOL_NAME, 1, 1000,
"xpack.security.authc.token.thread_pool", false),
new FixedExecutorBuilder(settings, ApiKeyService.THREAD_POOL_NAME,
(allocatedProcessors + 1) / 2, 1000,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The maximum theoretical throughput of new (previously unseen) API key validations will indeed halve, but I'm not personally worried about it. We're talking about the theoretical maximum throughput which, because of the contention on the get threadpool, and because there are many thread pools overbooking the available processors, it's not something I would consider practically important.

Yet, a possible mitigation would be to decrease (halve) the default hashing cost factor for API Keys. GIven the length of the random api key secret that we generate, we would still remain out of the brute force plausibility range.

@jaymode
Copy link
Member

jaymode commented Jun 23, 2020

Was any consideration given to having a single thread pool for hashing/expensive operations within security?

@ywangd
Copy link
Member Author

ywangd commented Jun 23, 2020

Was any consideration given to having a single thread pool for hashing/expensive operations within security?

I thought about it but didn't have an open discussion within the team. My concern was the initial authentication would be too slow, but this could also be a bias that I got from doing performance tests where initial warming up is evident. It might not be as a big concern for real use cases? Do you have an argument in favor of it?

@jaymode
Copy link
Member

jaymode commented Jun 23, 2020

It might not be as a big concern for real use cases? Do you have an argument in favor of it?

In general, we have been very judicious with adding additional thread pools. This PR targets API keys but we also have similar issues with native users since we use a get request and then validate on the get thread pool there as well. Some hash verification can also occur on network threads in the case of the file realm. The reserved realm also will perform the hash verification on a network thread or thread from the get thread pool depending on the state of the security index. These operations are designed to take time so if we were going to introduce a new thread pool, I think it would be wise to use a single thread pool to move these expensive operations out of the other pools and limit the amount of CPU that could be scheduled to service these operations.

Copy link
Contributor

@tvernum tvernum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see a test somewhere for the 429 response code behaviour.

We have a test for the ApiKeyService, but nothing that shows that EsRejectedExecutionException bubbles up into a 429 response. I think we need that.

return List.of(
new FixedExecutorBuilder(settings, TokenService.THREAD_POOL_NAME, 1, 1000,
"xpack.security.authc.token.thread_pool", false),
new FixedExecutorBuilder(settings, ApiKeyService.THREAD_POOL_NAME,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should name this thread pool in a more generic way to reflect the intent that it be used for all password hashing.
Something like security-password-hash feels better to me than security-api-key

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, renamed it to security-crypto.

@tvernum
Copy link
Contributor

tvernum commented Jun 24, 2020

Was any consideration given to having a single thread pool for hashing/expensive operations within security?

I thought about it but didn't have an open discussion within the team

I recall having exactly that conversation and thought that we agreed to move in that direction (not in this PR, but as a followup).

@ywangd
Copy link
Member Author

ywangd commented Jun 24, 2020

Was any consideration given to having a single thread pool for hashing/expensive operations within security?

I thought about it but didn't have an open discussion within the team

I recall having exactly that conversation and thought that we agreed to move in that direction (not in this PR, but as a followup).

Sorry @jaymode I mis-read your message. I thought you were asking for a single thread thread pool .... Tim is right. We have discussed this and agreed that we will consolidate security related expensive operations into a single pool.

@ywangd
Copy link
Member Author

ywangd commented Jun 29, 2020

@tvernum Thank you for your suggestion, a test for the 429 status code is now added.

@ywangd ywangd requested a review from tvernum June 29, 2020 06:48
Copy link
Contributor

@tvernum tvernum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ywangd ywangd merged commit 7dcfd45 into elastic:master Jul 6, 2020
ywangd added a commit to ywangd/elasticsearch that referenced this pull request Jul 6, 2020
…lastic#58090)

The PR introduces following two changes:

Move API key validation into a new separate threadpool. The new threadpool is created separately with half of the available processors and 1000 in queue size. We could combine it with the existing TokenService's threadpool. Technically it is straightforward, but I am not sure whether it could be a rushed optimization since I am not clear about potential impact on the token service.

On threadpoool saturation, it now fails with EsRejectedExecutionException which in turns gives back a 429, instead of 401 status code to users.
ywangd added a commit that referenced this pull request Jul 6, 2020
A small follow up for #58090 to correct the settings prefix
ywangd added a commit that referenced this pull request Jul 6, 2020
…58090) (#59047)

The PR introduces following two changes:

Move API key validation into a new separate threadpool. The new threadpool is created separately with half of the available processors and 1000 in queue size. We could combine it with the existing TokenService's threadpool. Technically it is straightforward, but I am not sure whether it could be a rushed optimization since I am not clear about potential impact on the token service.

On threadpoool saturation, it now fails with EsRejectedExecutionException which in turns gives back a 429, instead of 401 status code to users.
ywangd added a commit that referenced this pull request Jun 22, 2021
The changes in #74106 make API keys cached on creation time. It helps avoid the
expensive hashing operation on initial authentication when a request using the
key hits the same node that creates the key. Since the more expensive hashing
on authentication time is handled by a dedicated "crypto" thread pool (#58090),
it is expected that usage of the "crypto" thread pool to be reduced.

This PR moves the hashing on creation time to the "crypto" thread pool so that
a similar (before #74106) usage level of "crypto" thread pool is maintained. It
also has the benefit to avoid costly operations in the transport_worker thread,
which is generally preferred.

Relates: #74106
ywangd added a commit to ywangd/elasticsearch that referenced this pull request Jun 22, 2021
The changes in elastic#74106 make API keys cached on creation time. It helps avoid the
expensive hashing operation on initial authentication when a request using the
key hits the same node that creates the key. Since the more expensive hashing
on authentication time is handled by a dedicated "crypto" thread pool (elastic#58090),
it is expected that usage of the "crypto" thread pool to be reduced.

This PR moves the hashing on creation time to the "crypto" thread pool so that
a similar (before elastic#74106) usage level of "crypto" thread pool is maintained. It
also has the benefit to avoid costly operations in the transport_worker thread,
which is generally preferred.

Relates: elastic#74106
ywangd added a commit that referenced this pull request Jun 22, 2021
The changes in #74106 make API keys cached on creation time. It helps avoid the
expensive hashing operation on initial authentication when a request using the
key hits the same node that creates the key. Since the more expensive hashing
on authentication time is handled by a dedicated "crypto" thread pool (#58090),
it is expected that usage of the "crypto" thread pool to be reduced.

This PR moves the hashing on creation time to the "crypto" thread pool so that
a similar (before #74106) usage level of "crypto" thread pool is maintained. It
also has the benefit to avoid costly operations in the transport_worker thread,
which is generally preferred.

Relates: #74106
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) Team:Security Meta label for security team v7.9.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid saturating GET thread pool with API key hash verification
8 participants