-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grace period for user profile activation #89566
Grace period for user profile activation #89566
Conversation
The user profile document is updated on each activate call even when there is no actual content change because it always updates the last_synchronized timestamp. This behaviour is intentional to track the user's last login time (since Kibana calls to the activate API on user login). Client must explicitly handls retry for version conflict error. This is generally desirable. However, on each login there are often multiple web components trying to call this API concurrently This results into more frequent version conflict errors. Since these updates occur in a short period of time, updating last_synchronized for each of them does not really contribute too much for tracking user login. This PR introduces a grace period for the update behaviour (30 seconds non-configurable) so that the update (on activate) is only performed when either of the following is true: * There are actual content changes * Or it has been more than 30 seconds since last update
Pinging @elastic/es-security (Team:Security) |
Hi @ywangd, I've created a changelog YAML for you. |
buildUpdateRequest( | ||
profileDocument.uid(), | ||
wrapProfileDocumentWithoutApplicationData(profileDocument), | ||
RefreshPolicy.WAIT_UNTIL, | ||
versionedDocument.primaryTerm, | ||
versionedDocument.seqNo | ||
newProfileDocument.uid(), | ||
wrapProfileDocumentWithoutApplicationData(newProfileDocument), | ||
RefreshPolicy.WAIT_UNTIL | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No behaviour change here. The existing code does not fetch primaryTerm
and seqNo
for search result. Hence the net effect is that the update request just runs without a particular expectation for primaryTerm
and seqNo
. This is desirable because activate should just rely on the regular update mechanism instead of tying to any specific version. The change here is to make the intention explicit.
@albertzaharovits I just came to realise there are additional complexities in this change. So I changed this PR to draft. You don't need to review it for now. I'll ping you again once it's ready. Sorry for the inconvenience. |
...k/plugin/security/src/main/java/org/elasticsearch/xpack/security/profile/ProfileService.java
Outdated
Show resolved
Hide resolved
The options we have discussed so far (including the current on in this PR) are:
One thing I'd like to re-emphasize is that the issue is not to avoid "version conflict" because Kibana will retry by default. The issue is to avoid "multiple consecutive version conflict" which can lead Kibana to eventually fail after exhausting all retries. Therefore all of the options can still encounter "version conflict" on first attempt. But their goal is to eliminate "version conflict" on the 2nd attempts so that Kibana is guaranteed to succeed with 3 attempts (1 + 2 retries). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle I think this is a reasonable approach, but I do have a questions.
First, let me check if I got the problem statement correctly: the problem is that the _security/profile/_activate
API fails too often with document update conflic errors.
Hence we need to make the internal profile doc update request not fail in order for the _security/profile/_activate
to be successful.
Also, the doc update must be conditional on the primary_term/seq_no so that the API can work concurrently.
The proosed solution is twofold:
- use the update timestamp field from the profile doc in order to avoid updating the doc in case it was recently updated
- use
get
aftersearch
to retrieve a more recent profile doc which has, in principle, a lower chance for update conflicts
My question is around the second point: if the doc updates use the wait_until
refresh policy, shouldn't the search already return a recent doc?
In other words, why do we both need the doc get
and wait_until
refresh policy for updates?
The search returns the updated doc only if the searh request comes after the
So the current change proposes that we GET the document in between step 3 and 4. Since GET is realtime, at step 4, both Thread_2 and Thread_3 should find out that the document is in fact recently updated ( Other alternatives are listed here. At this point, I am actually thinking option 1 could be better. Because it seems easier to explain. Essentially it add a Step 6 to catch the version conflict and GET the document to decide whether it can be suppressed if the document already has the required changes. |
I see.
I agree. I think it would be clearer what's the internal state that the get is designed to serve. |
Thanks Albert! I will change this PR to go with option 1 and ping you again when it is ready.
I don't think it would be more performant. Though it will have fewer reads (GET), it will lead to more (unnecessary) writes to the document. In the load test, the difference is 300 vs 14. More specifically, the setup is:
The number of successful update is 300 and the number of version conflict is 400. When the code changes to GET doc before update, the number of successful update is only 14 and there is no version conflict. So the difference between Option 1 and 2 should roughly be:
Intuitively, I'd think Option 2 is more performant (especially because every update is a Get+Index). I don't think it is a sufficient argument for Option 2. But I'd like to make the point clear. In theory, Option 3 (scripted update) would be the best of worlds if it is not because script can be affected by other settings. |
…-profile-activation
Hmm, Option 1 trades failed writes, because of |
Thank you, I really think it would be an improvement. |
@albertzaharovits Thanks again for helping me out on shaping this PR. It's in a much better state now and ready for review! I ran the load test locally for the current change and I cannot tell any real performance difference. The number of successful logins from Kibana is very close (less than 20 with a total of ~8800). The new approach has much better semantics which is a strong reason for choosing it over the other option. Thanks for the suggestion! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Nice tests, as always.
// to avoid potential excessive version conflicts | ||
boolean shouldSkipUpdateForActivate(ProfileDocument currentProfileDocument, ProfileDocument newProfileDocument) { | ||
assert newProfileDocument.enabled() : "new profile document must be enabled"; | ||
if (newProfileDocument.user().equals(currentProfileDocument.user()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with the decision to not skip the profile activation if not through the same ES node.
I would've opted for the opposite, I don't think the node that's hit by the activate call is relevant profile information. I acknowledge that implementing it, in this PR, would be nasty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can definitely adjust this in a future PR if necessary. I chose the current behaviour for safety over efficiency. Also like you said, the code looks quite ugly when we cannot simply rely on user.equals
. The logic is simple enough though. What we have in this PR has done the harder part. So it should be fairly easy to adjust for the equality check. I'll circle this back to the kibana team and maybe raise a follow-up PR. Thanks!
The user profile document is updated on each activate call even when
there is no actual content change because it always updates the
last_synchronized timestamp. This behaviour is intentional to track the
user's last login time (since Kibana calls to the activate API on user
login). Client must explicitly handle retries for version conflicts.
This is generally desirable. However, on each login there are often
multiple web components trying to call this API concurrently. This
results into more frequent version conflicts. Since these updates
occur in a short period of time, updating last_synchronized for each of
them does not really contribute a lot for tracking user login.
This PR introduces a grace period for the update behaviour (30 seconds
non-configurable) so that the update (on activate) is only performed
when either of the following is true: