Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grace period for user profile activation #89566

Conversation

ywangd
Copy link
Member

@ywangd ywangd commented Aug 23, 2022

The user profile document is updated on each activate call even when
there is no actual content change because it always updates the
last_synchronized timestamp. This behaviour is intentional to track the
user's last login time (since Kibana calls to the activate API on user
login). Client must explicitly handle retries for version conflicts.
This is generally desirable. However, on each login there are often
multiple web components trying to call this API concurrently. This
results into more frequent version conflicts. Since these updates
occur in a short period of time, updating last_synchronized for each of
them does not really contribute a lot for tracking user login.

This PR introduces a grace period for the update behaviour (30 seconds
non-configurable) so that the update (on activate) is only performed
when either of the following is true:

  • There are actual content changes
  • Or it has been more than 30 seconds since last update

The user profile document is updated on each activate call even when
there is no actual content change because it always updates the
last_synchronized timestamp. This behaviour is intentional to track the
user's last login time (since Kibana calls to the activate API on user
login). Client must explicitly handls retry for version conflict error.
This is generally desirable. However, on each login there are often
multiple web components trying to call this API concurrently This
results into more frequent version conflict errors. Since these updates
occur in a short period of time, updating last_synchronized for each of
them does not really contribute too much for tracking user login.

This PR introduces a grace period for the update behaviour (30 seconds
non-configurable) so that the update (on activate) is only performed
when either of the following is true:

* There are actual content changes
* Or it has been more than 30 seconds since last update
@ywangd ywangd added >enhancement :Security/Security Security issues without another label v8.5.0 labels Aug 23, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

@elasticsearchmachine elasticsearchmachine added the Team:Security Meta label for security team label Aug 23, 2022
@elasticsearchmachine
Copy link
Collaborator

Hi @ywangd, I've created a changelog YAML for you.

Comment on lines 712 to 716
buildUpdateRequest(
profileDocument.uid(),
wrapProfileDocumentWithoutApplicationData(profileDocument),
RefreshPolicy.WAIT_UNTIL,
versionedDocument.primaryTerm,
versionedDocument.seqNo
newProfileDocument.uid(),
wrapProfileDocumentWithoutApplicationData(newProfileDocument),
RefreshPolicy.WAIT_UNTIL
),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No behaviour change here. The existing code does not fetch primaryTerm and seqNo for search result. Hence the net effect is that the update request just runs without a particular expectation for primaryTerm and seqNo. This is desirable because activate should just rely on the regular update mechanism instead of tying to any specific version. The change here is to make the intention explicit.

@ywangd ywangd marked this pull request as draft August 24, 2022 13:36
@ywangd
Copy link
Member Author

ywangd commented Aug 24, 2022

@albertzaharovits I just came to realise there are additional complexities in this change. So I changed this PR to draft. You don't need to review it for now. I'll ping you again once it's ready. Sorry for the inconvenience.

@ywangd
Copy link
Member Author

ywangd commented Aug 26, 2022

The options we have discussed so far (including the current on in this PR) are:

  1. For the update operation, catch version conflict exception. If an exception is caught, GET the document and check whether it requires update. If it does not require update, return the doc as is. Otherwise, re-throw the exception.
  2. GET the document right before the update, check whether it requires update. If it requires no update, return the doc as is. Otherweise, go ahead the update.
  3. Use scripted update which essentially move the "grace period" logic into the script.

One thing I'd like to re-emphasize is that the issue is not to avoid "version conflict" because Kibana will retry by default. The issue is to avoid "multiple consecutive version conflict" which can lead Kibana to eventually fail after exhausting all retries. Therefore all of the options can still encounter "version conflict" on first attempt. But their goal is to eliminate "version conflict" on the 2nd attempts so that Kibana is guaranteed to succeed with 3 attempts (1 + 2 retries).

Copy link
Contributor

@albertzaharovits albertzaharovits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle I think this is a reasonable approach, but I do have a questions.
First, let me check if I got the problem statement correctly: the problem is that the _security/profile/_activate API fails too often with document update conflic errors.
Hence we need to make the internal profile doc update request not fail in order for the _security/profile/_activate to be successful.
Also, the doc update must be conditional on the primary_term/seq_no so that the API can work concurrently.

The proosed solution is twofold:

  • use the update timestamp field from the profile doc in order to avoid updating the doc in case it was recently updated
  • use get after search to retrieve a more recent profile doc which has, in principle, a lower chance for update conflicts

My question is around the second point: if the doc updates use the wait_until refresh policy, shouldn't the search already return a recent doc?
In other words, why do we both need the doc get and wait_until refresh policy for updates?

@ywangd
Copy link
Member Author

ywangd commented Aug 29, 2022

if the doc updates use the wait_until refresh policy, shouldn't the search already return a recent doc?

The search returns the updated doc only if the searh request comes after the wait_util is completed. If a concurrent search comes in when another thread is still waiting for its wait_until. The search can get back a stale document. The sequence is something like the follows (timeline denoted as "T + X.Y seconds"):

  1. T-60.0: The profile document got updated (last_synchronized: T-60.0)
  2. T+0.0: Thread_1 tries to update the document
  3. T+0.1: The document is updated by Thread_1 (last_synchronized: T+0.1). Thread_1 is still waiting for the operation to complete (wait_until) so that the document is not visible to search yet (up to 1 second).
  4. T+0.2: Both Thread_2 and Thread_3 come in and try to update the document. They use search to get ahold of the document which is still the old version (last_synchronized: T-60.0).
  5. T+0.3: Both threads (2 & 3) decide to update the document because they see an old timestamp.
  6. T+0.4: One of Thread_2 and Thread_3 can possibly run into version conflict exception because they try to update the document concurrently. If there are more concurrent threads, the chance of version conflict would be higher (this is what happens with the load tests). This is what turns a "simple" grace period change into a more complicated concurrency issue and where my focus is on.

So the current change proposes that we GET the document in between step 3 and 4. Since GET is realtime, at step 4, both Thread_2 and Thread_3 should find out that the document is in fact recently updated (last_synchronized: T+0.1) and skip the update operations, thus avoid version conflict.

Other alternatives are listed here. At this point, I am actually thinking option 1 could be better. Because it seems easier to explain. Essentially it add a Step 6 to catch the version conflict and GET the document to decide whether it can be suppressed if the document already has the required changes.

@albertzaharovits
Copy link
Contributor

So the current change proposes that we GET the document in between step 3 and 4. Since GET is realtime, at step 4, both Thread_2 and Thread_3 should find out that the document is in fact recently updated (last_synchronized: T+0.1) and skip the update operations, thus avoid version conflict.

I see.
So this solves the version conflicts in the meantime between the successful profile activation and the time when the activation becomes visible to search.

At this point, I am actually thinking option 1 could be better. Because it seems easier to explain. Essentially it add a Step 6 to catch the version conflict and GET the document to decide whether it can be suppressed if the document already has the required changes.

I agree. I think it would be clearer what's the internal state that the get is designed to serve.
I must admit that, in this draft PR, the search followed by a get looks like a patchwork/best-effort/hope-this-works type of lenient logic.
Option 1) should also be more performant, assuming the time that a profile stays in the "activated but not yet visible for search" state is relatively small.

@ywangd
Copy link
Member Author

ywangd commented Aug 30, 2022

Thanks Albert! I will change this PR to go with option 1 and ping you again when it is ready.

Option 1) should also be more performant, assuming the time that a profile stays in the "activated but not yet visible for search" state is relatively small.

I don't think it would be more performant. Though it will have fewer reads (GET), it will lead to more (unnecessary) writes to the document. In the load test, the difference is 300 vs 14. More specifically, the setup is:

  • Check grace period with the search result
  • Go ahead with update if the document hasn't been updated for the grace period (based on the search result)
  • Update can be scucessful or it can fail with version conflict

The number of successful update is 300 and the number of version conflict is 400. When the code changes to GET doc before update, the number of successful update is only 14 and there is no version conflict. So the difference between Option 1 and 2 should roughly be:

  • Option 1(update then catch & get): 300 updates, 400 reads (for error handling)
  • Option 2 (get then udpate/skip): 14 updates, 700 reads (before update)

Intuitively, I'd think Option 2 is more performant (especially because every update is a Get+Index). I don't think it is a sufficient argument for Option 2. But I'd like to make the point clear. In theory, Option 3 (scripted update) would be the best of worlds if it is not because script can be affected by other settings.

@albertzaharovits
Copy link
Contributor

albertzaharovits commented Aug 30, 2022

Intuitively, I'd think Option 2 is more performant (especially because every update is a Get+Index). I don't think it is a sufficient argument for Option 2. But I'd like to make the point clear.

Hmm, Option 1 trades failed writes, because of if_seq_no condition, for doc get. We might need to race them, but my intuition is that the failed write is cheaper.

@albertzaharovits
Copy link
Contributor

I will change this PR to go with option 1 and ping you again when it is ready.

Thank you, I really think it would be an improvement.

@ywangd ywangd marked this pull request as ready for review August 31, 2022 02:32
@ywangd
Copy link
Member Author

ywangd commented Aug 31, 2022

@albertzaharovits Thanks again for helping me out on shaping this PR. It's in a much better state now and ready for review!

I ran the load test locally for the current change and I cannot tell any real performance difference. The number of successful logins from Kibana is very close (less than 20 with a total of ~8800). The new approach has much better semantics which is a strong reason for choosing it over the other option. Thanks for the suggestion!

Copy link
Contributor

@albertzaharovits albertzaharovits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Nice tests, as always.

// to avoid potential excessive version conflicts
boolean shouldSkipUpdateForActivate(ProfileDocument currentProfileDocument, ProfileDocument newProfileDocument) {
assert newProfileDocument.enabled() : "new profile document must be enabled";
if (newProfileDocument.user().equals(currentProfileDocument.user())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with the decision to not skip the profile activation if not through the same ES node.
I would've opted for the opposite, I don't think the node that's hit by the activate call is relevant profile information. I acknowledge that implementing it, in this PR, would be nasty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can definitely adjust this in a future PR if necessary. I chose the current behaviour for safety over efficiency. Also like you said, the code looks quite ugly when we cannot simply rely on user.equals. The logic is simple enough though. What we have in this PR has done the harder part. So it should be fairly easy to adjust for the equality check. I'll circle this back to the kibana team and maybe raise a follow-up PR. Thanks!

@ywangd ywangd added the auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 1, 2022
@elasticsearchmachine elasticsearchmachine merged commit 69a31da into elastic:main Sep 1, 2022
@ywangd ywangd deleted the grace-period-for-user-profile-activation branch September 1, 2022 01:56
ywangd added a commit that referenced this pull request Sep 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >enhancement :Security/Security Security issues without another label Team:Security Meta label for security team v8.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants