-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watch dropping an event when compacting on delete #18089
Comments
cc @MadhavJivrajani @siyuanfoundation @fuweid Can someone take a look and double check my findings? |
/assign |
I rerun the report on latest version of #17833 and it found that linearization was broken. Looks like a false positive due to some bug. |
Ok, ok I encountered another case. Failpoint used:
Report: 1717749910517706368.tar.gz Found when adding compaction to operations serathius@5959110 Logs showing broken watch guarantee due to missing log:
Screenshot showing a transaction for revision 141 and fact that key "key6" was present so delete really happened. Watch response showing missing delete event Complete log:
|
@fuweid Think this is related to #17780 See the following logs:
Looks like we lost first delete after restoring from restoring from "last compact revision". |
Yes. Let me try to reproduce it in my local. Update it later. |
I managed to reproduce it without multi TXN, just a delete request. This extends the impact to Kubernetes. Again the common thing is that etcd misses a watch even if it restores kvstore on revision before a delete. In last repro:
etcd bootstrap log:
|
Something similar reproduced on v3.4, however this time it was long after crash, and it flagged broken resumable properly. Which just means it provided incorrect event on a newly opened watch which provided revision.
|
For this one, the tombstone of
|
Correct. :( Currently, watcher isn't able to send out all the possible deleted events. |
Did the client side receive an ErrCompacted? If yes, then it should be expected behaviour.
|
No, it would be recorded in the report. |
@fuweid, don't understand the statement. Distributed system needs to uphold it's guarantees even if a node is down. Events on WAL used for replay should match what users observes. For watch client should observe either full history or or get ErrCompacted, as @ahrtr mentioned. |
One thing I noticed, that the WatchChan is not broken during etcd downtime, so the time it's down is short enough that the etcd client code retries the request in a transparent way. There might be a bug there too. Still there is a possibility of a bug in recording client responses, it is a little more complicated to record history of watch than KV requests. etcd/tests/robustness/client/client.go Lines 268 to 304 in 8a0054f
|
@fuweid Please feel free to ping me on slack if you want a quick discussion on this issue. |
It's expected behaviour and by design as long as etcdserver doesn't return any error (e.g ErrCompacted). |
Added logs to ensure that issue is not one robustness test, got: They clearly show that revision 125 was not provided to client.
|
Where did you add the log? Also suggest to provide a clear summary on the exact steps you did. |
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Code #18145 |
TL; DRDelete events may be missing in watch response. This issue can only be reproduced when the compact revision happens to be a delete operation. How to reproduce this issue?
Obviously the deletion operation with revision 5 is missing. Alternatively, If you want to try the robustness test, then run command below on #18145,
Expected behaviourAs mentioned in #18089 (comment), some revisions may be compacted, and it may impact the watch clients. But it shouldn't never drop any events quietly. Expected behaviour:
Root causeUsually etcd keeps the compact revision (see code below). For example, when you compact revision 5, actually etcd only compacts revisions <=4. etcd/server/storage/mvcc/key_index.go Line 256 in 8a0054f
But if the compact revision is a delete operation, then etcd doesn't keep it (in other words, etcd will compact/remove it) (see code below), etcd/server/storage/mvcc/key_index.go Lines 244 to 247 in 8a0054f
But the problem is that etcdserver doesn't return etcd/server/storage/mvcc/watcher_group.go Line 253 in 8a0054f
Note this isn't a regression. It's an old issue. Potential solutionIt's open to discussion. When executing compaction operation, we should only keep the latest revision for each key. For example, if the latest revision for "k1" is 10, and user compacts 10, then we should keep 10; otherwise if user compacts any revision <=9, we just compact it, there is no point to keep it. Note that different keys may have different latest revision. The etcd/server/storage/mvcc/key_index.go Line 208 in 8a0054f
We also need to change the if condition to etcd/server/storage/mvcc/watcher_group.go Line 253 in 8a0054f
So the watch client will get an ImpactAll versions (including 3.4, 3.5 and main) are impacted. Major, but it might not be easy to reproduce.
Immediate actionCreate a simple e2e test to reliably reproduce this issue. Please refer to the section " Brainstorm the solution. |
Hi @serathius sorry for late reply. In this case 1717749910517706368.tar.gz, both IMO, the robustness replay assumes that the delete-operation event is just sent out after applying to db. Even if the compact request is following that delete request, the robustness replay always assumes that the client should receive that delete-operation event. However, in real world, delete-operation event depends on compactor which is handled in background and we can't control in replay. That's why I said that watch can't guarantee the client can see all the events in current design.
@ahrtr will ping you tomorrow. Thanks! |
This should be pretty common, delete operations are pretty frequent, for example pods are immutable resource, meaning they are frequently created and then deleted and never resurrected.
It's once every 5 minutes, ask anyone if they want to build a distributed system that makes a mistake every 5 minutes. I think that the only saving grace is that this requires a slow watch that is reading events from before compaction window. This should be pretty rare for watch to be 5 minutes behind, however recently I have seen case where lock contention on watch cache caused it to be behind even 2 hours. |
@ahrtr confirmed the bug being present on both release-3.4 and release-3.5 branches Can someone take a look if the bug was introduced recently (introduced in one of the patch versions) or is an old one (tracing back to before v3.4)? It's important to know if this is a regression in some newer version, or it was always there. If this is a new bug we might need to consider an announcement. |
Obviously this isn't a regression. It's an old issue. Also updated my above comment.
Pls see updated section "Impact" above. added
|
Following the clear manual |
Thanks, @jmhbnz, for confirming this on v3.4.0! I appreciate everyone's insights so far. To clarify my earlier question, it's sometimes valuable to confirm even seemingly obvious details, just to ensure we're all on the same page and can make the most informed decisions going forward. Now that we know this isn't a recent regression, we can focus on understanding the underlying cause and potential solutions. |
Just had a quick talk with @serathius , we need to clearly define the behaviour of compaction, and also the impact on watch, Please anyone feel free to comment in https://docs.google.com/document/d/1APJE38DxENsRLLVapRz1CQ6UD4-uYaFutO12Y01VcNw/edit Eventually we need to document the behaviour in etcd's official document |
+1 on successfully reproducing the issue |
Actions
|
The code to handle compacted version in watch was written 8 years ago, and it never handled the etcd/server/storage/mvcc/watcher_group.go Lines 253 to 262 in 5790774
|
Just as a thought exercise, how can admin of etcd cluster be alerted about the dropped watch events issue in general? In kubernetes/kubernetes#123072, However, this issue seems difficult to catch by alarms because just one delete event is missed. Diff by 1 could be caused by different times that client restarted. Trying to monitor in production and measure how many kubernetes clusters are impacted. Any insights are much appreciated !! |
Excellent question! Unfortunately, there's no simple way for etcd cluster admins to be directly alerted about dropped watch events in a production environment. The robustness tests achieve this by collecting comprehensive data – requests from all clients, WAL entries, etc. – and performing in-depth analysis. This level of data collection and analysis isn't feasible in production due to the sheer volume of information involved. This highlights the importance of rigorous, continuous robustness testing with diverse traffic types and failure injection. It allows us to thoroughly analyze smaller chunks of data in a controlled environment, identifying issues that might otherwise go unnoticed in the vastness of production data. On Kubernetes side, I have some plans to improve watch cache observability. Stay tuned! |
Tried replicating #18089 (comment) in an e2e here: #18201 |
Thanks @MadhavJivrajani . We need to add more tests ((or update existing tests)), please see all the proposed test cases as mentioned in #18162 (comment). Note I also added the test cases in the doc (under the option 2). @fuweid @MadhavJivrajani Please consider a whole plan to resolve this issue. Please feel free to ping on slack if you want any discussion. |
… other issues Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Disable robustness test detection of #18089 to allow detecting other issues
… other issues Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
…etecting other issues" This reverts commit 4fe227c. Signed-off-by: Wei Fu <fuweid89@gmail.com>
Bug report criteria
What happened?
When testing #17833 I encountered watch breaking reliable guarantee.
Report included
1716928418720891814.zip
What did you expect to happen?
Watch should not break guarantee
How can we reproduce it (as minimally and precisely as possible)?
TODO
Anything else we need to know?
No response
Etcd version (please run commands below)
v3.5
Etcd configuration (command line flags or environment variables)
paste your configuration here
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
The text was updated successfully, but these errors were encountered: