-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3.5.5: client auth failures on new member first startup #14571
Comments
Is it possible missing authStore.refreshRangePermCache() call in authStore.Recover() call triggers the issue on first startup? |
seems like fixed an issue. During the troubleshooting I found that this issue lead to data inconsistency in the cluster. |
@mitake @ahrtr @serathius Please have a look, it looks like 3.5.5 is prone to data inconsistencies in newly added members if auth is enabled. It is a regression compared to 3.5.4. |
@veshij Thanks for raising this issue. Good catch! Your proposed change looks good to me. Previously we fixed a similar issue and fixed it in 14358 , but missed this case. Please feel free to deliver a PR for this. Please note that this issue can be easily workaround. FYI. #14355 (comment) #14355 (comment) |
Thank you, will cut a PR a bit later today. |
@ahrtr Can we add more tests for auth enabled mode? I looked at couple integration tests, but seems none of them has auth enabled?e.g. https://github.com/etcd-io/etcd/blob/main/tests/integration/cluster_test.go |
Please see #14574 (comment) The auth related test cases should be included in |
@euroelessar @yishuT @veshij Sorry for the issue... and thanks for fixing it. Let me check the PR. |
Note that this is an important issue although it's simple fix, because new added etcd member may run into data inconsistency issue if auth is enabled. Note that network partition might also lead to this issue. If users setup & enable auth during network partition, when the isolated member rejoins the cluster, it may recover from a snapshot, and so run into this issue as well. Usually I don't think users will do this operation (setup & enable auth during network partition), so it should be low possibility to run into this issue due to network partition. Please refer to my summary in https://github.com/ahrtr/etcd-issues/tree/master/issues/14571 |
Next steps:
|
Yep, I’ll backport once test cases are backported
Пт, 28 окт. 2022 г. в 22:30, Benjamin Wang ***@***.***>:
Next steps:
1. Backport #14589 <#14589> to 3.5
and probably 3.4. @biosvs <https://github.com/biosvs> Please let me
know if you have bandwith to do this.
2. Backport the e2e test case in #14574
<#14574> to 3.5 and probably 3.4.
@veshij <https://github.com/veshij> Please let me know if you have
bandwidth to do this. Of course, you need to do it after above task is done.
—
Reply to this email directly, view it on GitHub
<#14571 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABOURUNPFQKHL23G2CXL3A3WFSY5PANCNFSM6AAAAAARCABEBY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Oleg Guba
|
Just added an e2e test for 3.5. FYI. #14656 |
FYI me and @chaochn47 are working on adding linearizability test case for authorization. They should allows us to test other scenarios where injected failure could cause a authorization issue. Following known data inconsistency should also allow us to focus our work on features that had issues in past. |
Looks good to me. All related PRs are merged. I think it's ready to release 3.4.22 and 3.5.6. I will release 3.4.22 sometime in the following week. Please add your comments or concerns in #14651 . @mitake @veshij @serathius @spzala @ptabor |
FYI. v3.4.22 is just released, which includes the fix to this issue. https://github.com/etcd-io/etcd/releases/tag/v3.4.22 @serathius I suggest you to release 3.5.6 as well. |
What happened?
After cluster upgrade from 3.5.4 to 3.5.5 we are seeing auth issues on new nodes after first startup.
Cluster runs in auth-enable mode.
New node joins the cluster, fetches the snapshot, but fails all client requests with:
etcd restart fixes the issue and no requests fail after that.
What did you expect to happen?
New node joins cluster and starts serving requests.
How can we reproduce it (as minimally and precisely as possible)?
WIP
Anything else we need to know?
might be related to
https://github.com/etcd-io/etcd/pull/14227/files ?
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
paste your configuration here
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
No response
The text was updated successfully, but these errors were encountered: