Improve threadpool usage and error handling for API key validation #58090

ywangd · 2020-06-15T06:33:50Z

The PR introduces following two changes:

Move API key validation into a new separate threadpool
Return more informative response on threadpool saturation

The new threadpool is created separately with half of the available processors and 1000 in queue size. We could combine it with the existing TokenService's threadpool. Technically it is straightforward, but I am not sure whether it could be a rushed optimization since I am not clear about potential impact on the token service.

On threadpoool saturation, it now fails with EsRejectedExecutionException which in turns gives back a 429 status code to users. Note this is also a subtle behaviour change: Previously any failures during API key validation are translated into "unsuccessful but continue to realm authentication". After the change, threadpool saturation error is translated into "unsuccessful and terminate authentication". The difference will manifest iteself when user sends in two set of credentials, e.g. one for API key and one for basic auth. Before the change, authentication will continue with the basic auth and if it is valid, authentication will end up as successful. After the change, the authentication stops at API key when pool is saturated and does not proceed further. When threadpool is saturated, it is highly likely that users do want the API key authentication (otherwise the pool will not be saturated in the first place). Hence I doubt any user would really depend on the existing behaviour. (edit: this is not a concern since the code does not allow multiple Authorization headers. Thanks @jkakavas)

Resolves: #58088

Also return 429 when either GET or the hashing thread pool is saturated.

elasticmachine · 2020-06-15T06:33:52Z

Pinging @elastic/es-security (:Security/Authentication)

ywangd · 2020-06-15T12:05:08Z

Just realised that shunting everything after GetDoc call (mainly ApiKeyService#validateApiKeyCredentials) to a new thread pool does not solve the problem that cached API key auth is blocked by uncached API key auth, i.e. existing indexing operations are made unstable because new clients try to connect.

So I made changes to only push ApiKeyService#verifyKeyAgainstHash to the new thread pool and leave cached API key auth to the same GET thread pool. Accordingly to the performance tests, once all API keys are cached, the GET thread pool can sustain at least 5000 auth requests per second. So this change seems to be the best of both worlds.

x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/ApiKeyService.java

jkakavas · 2020-06-15T12:41:29Z

We could combine it with the existing TokenService's threadpool. Technically it is straightforward, but I am not sure whether it could be a rushed optimization since I am not clear about potential impact on the token service.

Agreed, I don't see any obvious benefit for this or solving any actual problem we have right now, but I can see a clear negative impact on token based authentication which we should avoid.

As discussed in slack, I'm not particularly worried about multiple Authorization headers as we only handle the first one right now

albertzaharovits

Overall this looks good!
I have raised two topics for discussion:

I have a suggestion about what I believe is a better place for handling EsRejectedExecutionException
I think it's better to enqueue only expensive hashing on the new thread pool, and not all (most) API verifications.

x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/ApiKeyService.java

ywangd · 2020-06-17T07:47:45Z

x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/Security.java

+                new FixedExecutorBuilder(settings, TokenService.THREAD_POOL_NAME, 1, 1000,
+                    "xpack.security.authc.token.thread_pool", false),
+                new FixedExecutorBuilder(settings, ApiKeyService.THREAD_POOL_NAME,
+                    (allocatedProcessors + 1) / 2, 1000,


Note by using half of the allocated processors, initial authentiation of API keys, i.e. warming up the cache, could be up to twice as slow as current implemention.

The maximum theoretical throughput of new (previously unseen) API key validations will indeed halve, but I'm not personally worried about it. We're talking about the theoretical maximum throughput which, because of the contention on the get threadpool, and because there are many thread pools overbooking the available processors, it's not something I would consider practically important.

Yet, a possible mitigation would be to decrease (halve) the default hashing cost factor for API Keys. GIven the length of the random api key secret that we generate, we would still remain out of the brute force plausibility range.

That's quite a large number of new threads to be adding to the system, especially threads that could end up all being busy at once for a long time authenticating a lot of clients. They're then stealing CPU time from other components of the system. What work have we done to justify that the thread pool needs to be this large (the ceiling of half the number of cores)?

A few reasons:

It has fewer threads than the GET thread pool - the current code performs the hashing in the GET thread pool, which has number of threads equals to number of cores. The new thread pool should in fact help with resource stealing. Besides, saturation of GET thread pool has wider negative impacts to both uncached and cached authentication. A new pool helps isolate the impact to only new clients trying to connect.

Balance between throughput and resource contention - the expensive hashing operations are unavoidable. Performance tests show about 5000 authentications per minute (8 cores). With half of the threads, the number will be halved as well. Assuming a 25K clients (50K keys) use case, it will take 20 min for all keys to be cached. Reducing the number of threads too much would push the warm up time even higher. Therefore using half number of the cores feels like a reasonable middle ground.

Hashing in the generic thread pool (which uses all cores) does not have obvious performance hit - I tried simply pushing the hashing to the generic thread pool, which then used all cores for the hashing. I didn't observe any obivious performance issue with it. In fact, the initial warm up stage is smoother and has a better throughput than using the GET thread pool. Therefore it feels that using half of the cores should be relatively safe and also performant. I do plan to run a few more rounds of performance tests after all scheduled v7.9 API key improvements. So we can be confident on this.

The new thread pool will be shared by all security related expensive operations - currently its only usage is for hashing of API key auth, so it is not a very good reason. But it could soon see another use case: the API key creation also requires expensive hashing which currently runs in the transport_worker thread. I need to do some more investigation, but the result is very likely that we need move it to this new thread pool as well.

My thoughts for approving the PR were that we operate under some desired rough approximation for the rate of authentication using keys (what Yang is describing on the second point above) and so by switching to a new pool (given that the get operation is so much faster than the hash, and that the new pool is smaller than the get pool) all the threads in the new pool will be busy at peak rate, so that we can estimate that peak rate (assuming a relatively tranquil system where the pool threads stay running). The original peak rate (using the get pool) was noisy because any get requests in the system interfered with it (in addition to the obvious flip side that authn hashing interfered with get requests in the system).

When the system is busy and all threads are working at the pool queues and no queue is empty, it all boils down to the processors' queue. In a sense all thread pool queues "merge" into processor queues, with the priority given by the relative number of threads ascribed to each pool. So introducing a new pool will decrease the throughput of existing pools, proportionate to relative sizes, in exchange of a guaranteed minimum throughput on the new pool ,because the new queue is not populated with any other types of work (compared to reusing an existing pool that contains other work types).

To summarise, I would generally introduce a thread pool when I want more control over the rate of some work (both the minimum and the maximum rate).

That's my thought process about thread pool queues upon which I've approved the PR. I believe a new thread pool is the right decision, but I concede that we've informed the sizing decision on rough external authn rates without also consulting with folks that tune the existing thread pools.

Okay, thanks for the great explanations to help me understand the perspective here. One more question: is it enough to have a simple "security" thread pool? Do we need a token and a crypto thread pool and maybe another thread pool for other security aspects in the future?

TL;DR: We are not considering another thread pool for other security aspects.

Our intention for the new thread pool is to focus on "cpu-intensive" security computations, namely hashing and encryption, especially in the scope of authentication. Given their expensiveness and importance, we'd like to control their impact both ways: they should not take over all system resources but in the mean time should maintain a reasonable level of throughput. We use following criteria to justify whether an operation should be added to this pool:

Is it expensive? The standard here is somewhat subjective. Based on performance tests, I personally would eye on anything longer than 1 ms.

Is it authentication/authorization related? We'd like to prioritize these operations since they usually need to complete in a timely manner.

Does it need to scale? i.e. potential burst in volume? When there is a large increase in concurrency, we'd like to protect the system from being completely flooded.

With above criteria, we can come up with the followings:

Hashing for uncached results are great candidates. These include API Key auth (current PR, v7.9), API Key creation (v7.9?), tokens, username/password (planned, v7.10)

Signing and cryption for SAML assertions and odic claims are potential candidates (pending further performance check)

Hashing for cached result is not qualified for the new thread pool since it is fast (sha256, ~3 microseconds).

Manipulations of security documents are not qualified either

Encrypted snapshot, though potentially expensive, is not auth related and unlikely needs to scale. Hence it is also not qualified.

An interesting case is read/write of the security index. When the index is not on the local node, these operations can be expensive (above 1 ms) and it also matches the two other criteria. However the system indices work (System index reads in separate threadpool #57936) is supposed to cover it. Hence we are not considering it either.

Any other security aspects that are not qualified

Now the question is: whether above unqualified security operations worth to have their own separate thread pool?

When security is enabled, security related operations are ubiquitous, e.g. every request needs to go through authentication and authorization flow. Currently our general practice is to "just let them run in the same thread", which could be get, transport_worker etc. This strategy has served us reasonally well so far. But with the upcoming change to expose the service to unprecedented large number of clients, we started noticing some issues revealed by the performance tests, which in turn leads to this PR and a few other work in the pipeline. At this stage, based on what we know, we still believe the existing strategy works fine for security operations that are not qualified for the new thread pool. Hence we do not plan to add another one.

For the sake of completeness, there are exceptions to the existing strategy. Outbound LDAP calls are submitted via the generic pool to avoid risks of deadlocking if they are submitted via the same pool used by the outbound requests. There is no sign for this being an issue so far. We had brief discussion about it and decided to evaluate it more closely at a later time.

albertzaharovits

LGTM ! besides a minor thing in tests, this is ready to 🚢
Great job Yang!

...plugin/security/src/test/java/org/elasticsearch/xpack/security/authc/ApiKeyServiceTests.java

albertzaharovits · 2020-06-18T12:33:54Z

x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/Security.java

+                new FixedExecutorBuilder(settings, TokenService.THREAD_POOL_NAME, 1, 1000,
+                    "xpack.security.authc.token.thread_pool", false),
+                new FixedExecutorBuilder(settings, ApiKeyService.THREAD_POOL_NAME,
+                    (allocatedProcessors + 1) / 2, 1000,


The maximum theoretical throughput of new (previously unseen) API key validations will indeed halve, but I'm not personally worried about it. We're talking about the theoretical maximum throughput which, because of the contention on the get threadpool, and because there are many thread pools overbooking the available processors, it's not something I would consider practically important.

Yet, a possible mitigation would be to decrease (halve) the default hashing cost factor for API Keys. GIven the length of the random api key secret that we generate, we would still remain out of the brute force plausibility range.

…read-pool

jaymode · 2020-06-23T15:10:08Z

Was any consideration given to having a single thread pool for hashing/expensive operations within security?

ywangd · 2020-06-23T15:33:12Z

Was any consideration given to having a single thread pool for hashing/expensive operations within security?

I thought about it but didn't have an open discussion within the team. My concern was the initial authentication would be too slow, but this could also be a bias that I got from doing performance tests where initial warming up is evident. It might not be as a big concern for real use cases? Do you have an argument in favor of it?

jaymode · 2020-06-23T16:28:12Z

It might not be as a big concern for real use cases? Do you have an argument in favor of it?

In general, we have been very judicious with adding additional thread pools. This PR targets API keys but we also have similar issues with native users since we use a get request and then validate on the get thread pool there as well. Some hash verification can also occur on network threads in the case of the file realm. The reserved realm also will perform the hash verification on a network thread or thread from the get thread pool depending on the state of the security index. These operations are designed to take time so if we were going to introduce a new thread pool, I think it would be wise to use a single thread pool to move these expensive operations out of the other pools and limit the amount of CPU that could be scheduled to service these operations.

tvernum

I'd like to see a test somewhere for the 429 response code behaviour.

We have a test for the ApiKeyService, but nothing that shows that EsRejectedExecutionException bubbles up into a 429 response. I think we need that.

tvernum · 2020-06-24T01:46:56Z

x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/Security.java

+            return List.of(
+                new FixedExecutorBuilder(settings, TokenService.THREAD_POOL_NAME, 1, 1000,
+                    "xpack.security.authc.token.thread_pool", false),
+                new FixedExecutorBuilder(settings, ApiKeyService.THREAD_POOL_NAME,


I think we should name this thread pool in a more generic way to reflect the intent that it be used for all password hashing.
Something like security-password-hash feels better to me than security-api-key

As discussed, renamed it to security-crypto.

tvernum · 2020-06-24T02:04:55Z

Was any consideration given to having a single thread pool for hashing/expensive operations within security?

I thought about it but didn't have an open discussion within the team

I recall having exactly that conversation and thought that we agreed to move in that direction (not in this PR, but as a followup).

ywangd · 2020-06-24T02:19:50Z

Was any consideration given to having a single thread pool for hashing/expensive operations within security?

I thought about it but didn't have an open discussion within the team

I recall having exactly that conversation and thought that we agreed to move in that direction (not in this PR, but as a followup).

Sorry @jaymode I mis-read your message. I thought you were asking for a single thread thread pool .... Tim is right. We have discussed this and agreed that we will consolidate security related expensive operations into a single pool.

…read-pool

ywangd · 2020-06-29T06:48:26Z

@tvernum Thank you for your suggestion, a test for the 429 status code is now added.

tvernum

LGTM

...k/plugin/security/src/test/java/org/elasticsearch/xpack/security/authc/ApiKeyIntegTests.java

Co-authored-by: Tim Vernum <tim@adjective.org>

…read-pool

jasontedor

LGTM.

…read-pool

…lastic#58090) The PR introduces following two changes: Move API key validation into a new separate threadpool. The new threadpool is created separately with half of the available processors and 1000 in queue size. We could combine it with the existing TokenService's threadpool. Technically it is straightforward, but I am not sure whether it could be a rushed optimization since I am not clear about potential impact on the token service. On threadpoool saturation, it now fails with EsRejectedExecutionException which in turns gives back a 429, instead of 401 status code to users.

A small follow up for #58090 to correct the settings prefix

…58090) (#59047) The PR introduces following two changes: Move API key validation into a new separate threadpool. The new threadpool is created separately with half of the available processors and 1000 in queue size. We could combine it with the existing TokenService's threadpool. Technically it is straightforward, but I am not sure whether it could be a rushed optimization since I am not clear about potential impact on the token service. On threadpoool saturation, it now fails with EsRejectedExecutionException which in turns gives back a 429, instead of 401 status code to users.

The changes in #74106 make API keys cached on creation time. It helps avoid the expensive hashing operation on initial authentication when a request using the key hits the same node that creates the key. Since the more expensive hashing on authentication time is handled by a dedicated "crypto" thread pool (#58090), it is expected that usage of the "crypto" thread pool to be reduced. This PR moves the hashing on creation time to the "crypto" thread pool so that a similar (before #74106) usage level of "crypto" thread pool is maintained. It also has the benefit to avoid costly operations in the transport_worker thread, which is generally preferred. Relates: #74106

The changes in elastic#74106 make API keys cached on creation time. It helps avoid the expensive hashing operation on initial authentication when a request using the key hits the same node that creates the key. Since the more expensive hashing on authentication time is handled by a dedicated "crypto" thread pool (elastic#58090), it is expected that usage of the "crypto" thread pool to be reduced. This PR moves the hashing on creation time to the "crypto" thread pool so that a similar (before elastic#74106) usage level of "crypto" thread pool is maintained. It also has the benefit to avoid costly operations in the transport_worker thread, which is generally preferred. Relates: elastic#74106

The changes in #74106 make API keys cached on creation time. It helps avoid the expensive hashing operation on initial authentication when a request using the key hits the same node that creates the key. Since the more expensive hashing on authentication time is handled by a dedicated "crypto" thread pool (#58090), it is expected that usage of the "crypto" thread pool to be reduced. This PR moves the hashing on creation time to the "crypto" thread pool so that a similar (before #74106) usage level of "crypto" thread pool is maintained. It also has the benefit to avoid costly operations in the transport_worker thread, which is generally preferred. Relates: #74106

Add separat thread pool for API key hashing

8018522

Also return 429 when either GET or the hashing thread pool is saturated.

ywangd added >enhancement :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) v8.0.0 v7.9.0 labels Jun 15, 2020

ywangd requested review from tvernum, albertzaharovits and jkakavas June 15, 2020 06:33

elasticmachine added the Team:Security Meta label for security team label Jun 15, 2020

ywangd added 3 commits June 15, 2020 17:57

Fix bug where threadpool size can be 0

85ff035

Fix test threadpool

202a5a8

Only push expansive hash computation to new thread pool

121581e

ywangd commented Jun 15, 2020

View reviewed changes

x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/ApiKeyService.java Outdated Show resolved Hide resolved

Avoid forking if current thread is already for api-key

aac19ad

albertzaharovits reviewed Jun 16, 2020

View reviewed changes

ywangd added 2 commits June 17, 2020 15:13

Address feedback

a56927f

Minor tweak

770b246

ywangd requested a review from albertzaharovits June 17, 2020 05:40

ywangd commented Jun 17, 2020

View reviewed changes

albertzaharovits approved these changes Jun 18, 2020

View reviewed changes

ywangd added 2 commits June 23, 2020 09:36

Address feedback

4c1c27f

Merge remote-tracking branch 'origin/master' into es-58088-hashing-th…

06baa99

…read-pool

tvernum requested changes Jun 24, 2020

View reviewed changes

ywangd added 2 commits June 29, 2020 16:45

Address feedback to add 429 status code test

57ec11f

Merge remote-tracking branch 'origin/master' into es-58088-hashing-th…

39ed0ed

…read-pool

ywangd requested a review from tvernum June 29, 2020 06:48

ywangd added 2 commits June 29, 2020 16:51

CheckStyle

c71e2d4

checkstyle

8c4e560

tvernum approved these changes Jul 3, 2020

View reviewed changes

...k/plugin/security/src/test/java/org/elasticsearch/xpack/security/authc/ApiKeyIntegTests.java Outdated Show resolved Hide resolved

...k/plugin/security/src/test/java/org/elasticsearch/xpack/security/authc/ApiKeyIntegTests.java Show resolved Hide resolved

ywangd and others added 3 commits July 3, 2020 14:39

Apply suggestions from code review

d44caad

Co-authored-by: Tim Vernum <tim@adjective.org>

Merge remote-tracking branch 'origin/master' into es-58088-hashing-th…

99a8faf

…read-pool

Merge remote-tracking branch 'origin/master' into es-58088-hashing-th…

867a35b

…read-pool

jasontedor approved these changes Jul 6, 2020

View reviewed changes

Merge remote-tracking branch 'origin/master' into es-58088-hashing-th…

1f71f47

…read-pool

ywangd merged commit 7dcfd45 into elastic:master Jul 6, 2020

ywangd mentioned this pull request Jul 6, 2020

Correct settings prefix for the crypto thread pool #59049

Merged

ywangd added a commit that referenced this pull request Jul 6, 2020

Correct settings prefix for the crypto thread pool (#59049)

ece2897

A small follow up for #58090 to correct the settings prefix

roncohen mentioned this pull request Jul 21, 2020

[Ingest Manager] Fleet: Gracefully handle overloaded Elasticsearch elastic/kibana#72610

Closed

ywangd mentioned this pull request Aug 14, 2020

Do not perform API key validation in GET thread pool #53937

Closed

ywangd mentioned this pull request Jun 16, 2021

Move hashing on API key creation to crypto thread pool #74165

Merged

ywangd mentioned this pull request Jun 22, 2021

Move hashing on API key creation to crypto thread pool (#74165) #74417

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve threadpool usage and error handling for API key validation #58090

Improve threadpool usage and error handling for API key validation #58090

ywangd commented Jun 15, 2020 •

edited

Loading

elasticmachine commented Jun 15, 2020

ywangd commented Jun 15, 2020

jkakavas commented Jun 15, 2020

albertzaharovits left a comment

ywangd Jun 17, 2020

albertzaharovits Jun 18, 2020

jasontedor Jun 24, 2020

ywangd Jun 24, 2020

albertzaharovits Jun 25, 2020

jasontedor Jul 2, 2020

ywangd Jul 2, 2020

albertzaharovits left a comment

albertzaharovits Jun 18, 2020

jaymode commented Jun 23, 2020

ywangd commented Jun 23, 2020

jaymode commented Jun 23, 2020

tvernum left a comment

tvernum Jun 24, 2020

ywangd Jun 29, 2020

tvernum commented Jun 24, 2020 •

edited

Loading

ywangd commented Jun 24, 2020 •

edited

Loading

ywangd commented Jun 29, 2020

tvernum left a comment

jasontedor left a comment

Improve threadpool usage and error handling for API key validation #58090

Improve threadpool usage and error handling for API key validation #58090

Conversation

ywangd commented Jun 15, 2020 • edited Loading

elasticmachine commented Jun 15, 2020

ywangd commented Jun 15, 2020

jkakavas commented Jun 15, 2020

albertzaharovits left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albertzaharovits left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaymode commented Jun 23, 2020

ywangd commented Jun 23, 2020

jaymode commented Jun 23, 2020

tvernum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tvernum commented Jun 24, 2020 • edited Loading

ywangd commented Jun 24, 2020 • edited Loading

ywangd commented Jun 29, 2020

tvernum left a comment

Choose a reason for hiding this comment

jasontedor left a comment

Choose a reason for hiding this comment

ywangd commented Jun 15, 2020 •

edited

Loading

tvernum commented Jun 24, 2020 •

edited

Loading

ywangd commented Jun 24, 2020 •

edited

Loading