Grace period for user profile activation #89566

ywangd · 2022-08-23T23:54:40Z

The user profile document is updated on each activate call even when
there is no actual content change because it always updates the
last_synchronized timestamp. This behaviour is intentional to track the
user's last login time (since Kibana calls to the activate API on user
login). Client must explicitly handle retries for version conflicts.
This is generally desirable. However, on each login there are often
multiple web components trying to call this API concurrently. This
results into more frequent version conflicts. Since these updates
occur in a short period of time, updating last_synchronized for each of
them does not really contribute a lot for tracking user login.

This PR introduces a grace period for the update behaviour (30 seconds
non-configurable) so that the update (on activate) is only performed
when either of the following is true:

There are actual content changes
Or it has been more than 30 seconds since last update

The user profile document is updated on each activate call even when there is no actual content change because it always updates the last_synchronized timestamp. This behaviour is intentional to track the user's last login time (since Kibana calls to the activate API on user login). Client must explicitly handls retry for version conflict error. This is generally desirable. However, on each login there are often multiple web components trying to call this API concurrently This results into more frequent version conflict errors. Since these updates occur in a short period of time, updating last_synchronized for each of them does not really contribute too much for tracking user login. This PR introduces a grace period for the update behaviour (30 seconds non-configurable) so that the update (on activate) is only performed when either of the following is true: * There are actual content changes * Or it has been more than 30 seconds since last update

elasticsearchmachine · 2022-08-23T23:55:03Z

Pinging @elastic/es-security (Team:Security)

elasticsearchmachine · 2022-08-23T23:55:04Z

Hi @ywangd, I've created a changelog YAML for you.

ywangd · 2022-08-24T09:39:05Z

...k/plugin/security/src/main/java/org/elasticsearch/xpack/security/profile/ProfileService.java

            buildUpdateRequest(
-                profileDocument.uid(),
-                wrapProfileDocumentWithoutApplicationData(profileDocument),
-                RefreshPolicy.WAIT_UNTIL,
-                versionedDocument.primaryTerm,
-                versionedDocument.seqNo
+                newProfileDocument.uid(),
+                wrapProfileDocumentWithoutApplicationData(newProfileDocument),
+                RefreshPolicy.WAIT_UNTIL
            ),


No behaviour change here. The existing code does not fetch primaryTerm and seqNo for search result. Hence the net effect is that the update request just runs without a particular expectation for primaryTerm and seqNo. This is desirable because activate should just rely on the regular update mechanism instead of tying to any specific version. The change here is to make the intention explicit.

ywangd · 2022-08-24T13:38:48Z

@albertzaharovits I just came to realise there are additional complexities in this change. So I changed this PR to draft. You don't need to review it for now. I'll ping you again once it's ready. Sorry for the inconvenience.

...k/plugin/security/src/main/java/org/elasticsearch/xpack/security/profile/ProfileService.java

ywangd · 2022-08-26T05:33:43Z

The options we have discussed so far (including the current on in this PR) are:

For the update operation, catch version conflict exception. If an exception is caught, GET the document and check whether it requires update. If it does not require update, return the doc as is. Otherwise, re-throw the exception.
GET the document right before the update, check whether it requires update. If it requires no update, return the doc as is. Otherweise, go ahead the update.
Use scripted update which essentially move the "grace period" logic into the script.

One thing I'd like to re-emphasize is that the issue is not to avoid "version conflict" because Kibana will retry by default. The issue is to avoid "multiple consecutive version conflict" which can lead Kibana to eventually fail after exhausting all retries. Therefore all of the options can still encounter "version conflict" on first attempt. But their goal is to eliminate "version conflict" on the 2nd attempts so that Kibana is guaranteed to succeed with 3 attempts (1 + 2 retries).

albertzaharovits

In principle I think this is a reasonable approach, but I do have a questions.
First, let me check if I got the problem statement correctly: the problem is that the _security/profile/_activate API fails too often with document update conflic errors.
Hence we need to make the internal profile doc update request not fail in order for the _security/profile/_activate to be successful.
Also, the doc update must be conditional on the primary_term/seq_no so that the API can work concurrently.

The proosed solution is twofold:

use the update timestamp field from the profile doc in order to avoid updating the doc in case it was recently updated
use get after search to retrieve a more recent profile doc which has, in principle, a lower chance for update conflicts

My question is around the second point: if the doc updates use the wait_until refresh policy, shouldn't the search already return a recent doc?
In other words, why do we both need the doc get and wait_until refresh policy for updates?

ywangd · 2022-08-29T03:20:30Z

if the doc updates use the wait_until refresh policy, shouldn't the search already return a recent doc?

The search returns the updated doc only if the searh request comes after the wait_util is completed. If a concurrent search comes in when another thread is still waiting for its wait_until. The search can get back a stale document. The sequence is something like the follows (timeline denoted as "T + X.Y seconds"):

T-60.0: The profile document got updated (last_synchronized: T-60.0)
T+0.0: Thread_1 tries to update the document
T+0.1: The document is updated by Thread_1 (last_synchronized: T+0.1). Thread_1 is still waiting for the operation to complete (wait_until) so that the document is not visible to search yet (up to 1 second).
T+0.2: Both Thread_2 and Thread_3 come in and try to update the document. They use search to get ahold of the document which is still the old version (last_synchronized: T-60.0).
T+0.3: Both threads (2 & 3) decide to update the document because they see an old timestamp.
T+0.4: One of Thread_2 and Thread_3 can possibly run into version conflict exception because they try to update the document concurrently. If there are more concurrent threads, the chance of version conflict would be higher (this is what happens with the load tests). This is what turns a "simple" grace period change into a more complicated concurrency issue and where my focus is on.

So the current change proposes that we GET the document in between step 3 and 4. Since GET is realtime, at step 4, both Thread_2 and Thread_3 should find out that the document is in fact recently updated (last_synchronized: T+0.1) and skip the update operations, thus avoid version conflict.

Other alternatives are listed here. At this point, I am actually thinking option 1 could be better. Because it seems easier to explain. Essentially it add a Step 6 to catch the version conflict and GET the document to decide whether it can be suppressed if the document already has the required changes.

albertzaharovits · 2022-08-29T16:12:04Z

So the current change proposes that we GET the document in between step 3 and 4. Since GET is realtime, at step 4, both Thread_2 and Thread_3 should find out that the document is in fact recently updated (last_synchronized: T+0.1) and skip the update operations, thus avoid version conflict.

I see.
So this solves the version conflicts in the meantime between the successful profile activation and the time when the activation becomes visible to search.

At this point, I am actually thinking option 1 could be better. Because it seems easier to explain. Essentially it add a Step 6 to catch the version conflict and GET the document to decide whether it can be suppressed if the document already has the required changes.

I agree. I think it would be clearer what's the internal state that the get is designed to serve.
I must admit that, in this draft PR, the search followed by a get looks like a patchwork/best-effort/hope-this-works type of lenient logic.
Option 1) should also be more performant, assuming the time that a profile stays in the "activated but not yet visible for search" state is relatively small.

ywangd · 2022-08-30T02:37:32Z

Thanks Albert! I will change this PR to go with option 1 and ping you again when it is ready.

Option 1) should also be more performant, assuming the time that a profile stays in the "activated but not yet visible for search" state is relatively small.

I don't think it would be more performant. Though it will have fewer reads (GET), it will lead to more (unnecessary) writes to the document. In the load test, the difference is 300 vs 14. More specifically, the setup is:

Check grace period with the search result
Go ahead with update if the document hasn't been updated for the grace period (based on the search result)
Update can be scucessful or it can fail with version conflict

The number of successful update is 300 and the number of version conflict is 400. When the code changes to GET doc before update, the number of successful update is only 14 and there is no version conflict. So the difference between Option 1 and 2 should roughly be:

Option 1(update then catch & get): 300 updates, 400 reads (for error handling)
Option 2 (get then udpate/skip): 14 updates, 700 reads (before update)

Intuitively, I'd think Option 2 is more performant (especially because every update is a Get+Index). I don't think it is a sufficient argument for Option 2. But I'd like to make the point clear. In theory, Option 3 (scripted update) would be the best of worlds if it is not because script can be affected by other settings.

…-profile-activation

albertzaharovits · 2022-08-30T17:23:51Z

Intuitively, I'd think Option 2 is more performant (especially because every update is a Get+Index). I don't think it is a sufficient argument for Option 2. But I'd like to make the point clear.

Hmm, Option 1 trades failed writes, because of if_seq_no condition, for doc get. We might need to race them, but my intuition is that the failed write is cheaper.

albertzaharovits · 2022-08-30T17:25:21Z

I will change this PR to go with option 1 and ping you again when it is ready.

Thank you, I really think it would be an improvement.

…-profile-activation

ywangd · 2022-08-31T02:38:02Z

@albertzaharovits Thanks again for helping me out on shaping this PR. It's in a much better state now and ready for review!

I ran the load test locally for the current change and I cannot tell any real performance difference. The number of successful logins from Kibana is very close (less than 20 with a total of ~8800). The new approach has much better semantics which is a strong reason for choosing it over the other option. Thanks for the suggestion!

albertzaharovits

LGTM
Nice tests, as always.

albertzaharovits · 2022-08-31T16:58:02Z

...k/plugin/security/src/main/java/org/elasticsearch/xpack/security/profile/ProfileService.java

+    // to avoid potential excessive version conflicts
+    boolean shouldSkipUpdateForActivate(ProfileDocument currentProfileDocument, ProfileDocument newProfileDocument) {
+        assert newProfileDocument.enabled() : "new profile document must be enabled";
+        if (newProfileDocument.user().equals(currentProfileDocument.user())


I'm OK with the decision to not skip the profile activation if not through the same ES node.
I would've opted for the opposite, I don't think the node that's hit by the activate call is relevant profile information. I acknowledge that implementing it, in this PR, would be nasty.

We can definitely adjust this in a future PR if necessary. I chose the current behaviour for safety over efficiency. Also like you said, the code looks quite ugly when we cannot simply rely on user.equals. The logic is simple enough though. What we have in this PR has done the harder part. So it should be fairly easy to adjust for the equality check. I'll circle this back to the kibana team and maybe raise a follow-up PR. Thanks!

...k/plugin/security/src/main/java/org/elasticsearch/xpack/security/profile/ProfileService.java

...ity/qa/profile/src/javaRestTest/java/org/elasticsearch/xpack/security/profile/ProfileIT.java

...ternalClusterTest/java/org/elasticsearch/xpack/security/profile/ProfileDomainIntegTests.java

Relates: #89566

ywangd added 3 commits August 24, 2022 09:23

more test

8909181

tweak

e32506a

ywangd added >enhancement :Security/Security Security issues without another label v8.5.0 labels Aug 23, 2022

ywangd requested a review from albertzaharovits August 23, 2022 23:54

elasticsearchmachine added the Team:Security Meta label for security team label Aug 23, 2022

Update docs/changelog/89566.yaml

081bb1b

tweak

a7eb57e

ywangd commented Aug 24, 2022

View reviewed changes

ywangd marked this pull request as draft August 24, 2022 13:36

GET document again before update to check whether skip is possible.

854562f

ywangd commented Aug 25, 2022

View reviewed changes

...k/plugin/security/src/main/java/org/elasticsearch/xpack/security/profile/ProfileService.java Outdated Show resolved Hide resolved

albertzaharovits reviewed Aug 26, 2022

View reviewed changes

Merge remote-tracking branch 'origin/main' into grace-period-for-user…

573c18e

…-profile-activation

ywangd added 2 commits August 31, 2022 12:16

address feedback

23a33b7

Merge remote-tracking branch 'origin/main' into grace-period-for-user…

859252e

…-profile-activation

ywangd marked this pull request as ready for review August 31, 2022 02:32

ywangd requested a review from albertzaharovits August 31, 2022 02:33

albertzaharovits approved these changes Aug 31, 2022

View reviewed changes

ywangd added 2 commits September 1, 2022 11:10

address feedback

d4a0511

tweak

ce34a32

ywangd added the auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 1, 2022

elasticsearchmachine merged commit 69a31da into elastic:main Sep 1, 2022

ywangd deleted the grace-period-for-user-profile-activation branch September 1, 2022 01:56

ywangd added a commit that referenced this pull request Sep 1, 2022

[Test Mute] AwaitsFix #89764

936cde0

Relates: #89566

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grace period for user profile activation #89566

Grace period for user profile activation #89566

ywangd commented Aug 23, 2022

elasticsearchmachine commented Aug 23, 2022

elasticsearchmachine commented Aug 23, 2022

ywangd Aug 24, 2022

ywangd commented Aug 24, 2022

ywangd commented Aug 26, 2022

albertzaharovits left a comment

ywangd commented Aug 29, 2022 •

edited

Loading

albertzaharovits commented Aug 29, 2022

ywangd commented Aug 30, 2022

albertzaharovits commented Aug 30, 2022 •

edited

Loading

albertzaharovits commented Aug 30, 2022

ywangd commented Aug 31, 2022

albertzaharovits left a comment

albertzaharovits Aug 31, 2022

ywangd Sep 1, 2022

Grace period for user profile activation #89566

Grace period for user profile activation #89566

Conversation

ywangd commented Aug 23, 2022

elasticsearchmachine commented Aug 23, 2022

elasticsearchmachine commented Aug 23, 2022

ywangd Aug 24, 2022

Choose a reason for hiding this comment

ywangd commented Aug 24, 2022

ywangd commented Aug 26, 2022

albertzaharovits left a comment

Choose a reason for hiding this comment

ywangd commented Aug 29, 2022 • edited Loading

albertzaharovits commented Aug 29, 2022

ywangd commented Aug 30, 2022

albertzaharovits commented Aug 30, 2022 • edited Loading

albertzaharovits commented Aug 30, 2022

ywangd commented Aug 31, 2022

albertzaharovits left a comment

Choose a reason for hiding this comment

albertzaharovits Aug 31, 2022

Choose a reason for hiding this comment

ywangd Sep 1, 2022

Choose a reason for hiding this comment

ywangd commented Aug 29, 2022 •

edited

Loading

albertzaharovits commented Aug 30, 2022 •

edited

Loading