Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Code based member for cancel reason in WatchResponse #15058

Open
mitake opened this issue Dec 31, 2022 · 25 comments
Open

Add a Code based member for cancel reason in WatchResponse #15058

mitake opened this issue Dec 31, 2022 · 25 comments

Comments

@mitake
Copy link
Contributor

mitake commented Dec 31, 2022

What would you like to be added?

Currently WatchResponse message has a string type member CancelReason. The member is used for storing an error which happened during watch.
It's better to add a new Code based member for representing an error in the watch API.

Why is this needed?

Using the existing member for error handling code isn't good because of fragility and hard to track behavior of the etcd wire protocol.

See also:

@lavacat
Copy link

lavacat commented Jan 1, 2023

@mitake seams like you've been working on this code before, are you planning to contribute this change if approved?

How are you planning to handle different values of CancelReason? I don't think they map to grpc codes directly.

@mitake
Copy link
Contributor Author

mitake commented Jan 4, 2023

@lavacat Yes I'm planning to work on, but if other people can work on this issue, it's fine for me.
I checked grpc codes again, and yeah it doesn't seem suitable for the purpose. Having a new int32 member and define etcd specific error codes might be good? If you have suggestions, I'd like to know.

@ahrtr
Copy link
Member

ahrtr commented Jan 6, 2023

The direction of adding a Code (e.g. ReasonCode) looks good to me. Please evaluate the impact and provide a design firstly.

Since it changes the protocol between client and server, so we need K8s folks to review the changes as well. cc @ptabor @serathius @liggitt @dims @lavalamp @deads2k as well.

@liggitt
Copy link
Contributor

liggitt commented Jan 6, 2023

a new optional (since old servers won't send it) code field the client reacts to if present seems like a minimal and safe approach

@ahrtr
Copy link
Member

ahrtr commented Jan 6, 2023

a new optional (since old servers won't send it) code field the client reacts to if present seems like a minimal and safe approach

Agree. It would be something like below,

$ git diff
diff --git a/api/etcdserverpb/rpc.proto b/api/etcdserverpb/rpc.proto
index 9cdc0b37f..7b37b1556 100644
--- a/api/etcdserverpb/rpc.proto
+++ b/api/etcdserverpb/rpc.proto
@@ -834,6 +834,8 @@ message WatchResponse {
   // framgment is true if large watch response was split over multiple responses.
   bool fragment = 7 [(versionpb.etcd_version_field)="3.4"];
 
+  optional uint32 reason_code = 8 [(gogoproto.nullable) = true];
+
   repeated mvccpb.Event events = 11;
 }

@ahrtr
Copy link
Member

ahrtr commented Jan 7, 2023

@mitake @lavacat or anyone else please feel free to work on this task if you have bandwidth and be interested. Please drop a message before you get started.

@lavacat
Copy link

lavacat commented Jan 8, 2023

@mitake if you don't mind, I'll work on this task, since we have an internal use case for it and we also might contribute retry logic to jetcd client.

@mitake
Copy link
Contributor Author

mitake commented Jan 9, 2023

@lavacat Thanks a lot! If you can work on it, it's really helpful.

@lavacat
Copy link

lavacat commented Jan 21, 2023

@mitake @ahrtr I've tried implementing reason codes, but it feels cumbersome to have both reasonCode and cancelReason string. Here are some concerns:

  1. All other response objects have string error. StatusResponse, LeaseGrantResponse. So adding codes for WatchResponse isn't consistent.
  2. errors defined in mvcc/watcher.go I've marked with InternalWatchError code. Not sure it's worth defining a code for each of those.

How about we add retry flag to WatchResponse to signal that watch can be retried without exposing codes?

I've added both reason_code and retry in this branch for demo. I'm ok with proceeding with original plan but want to run retry flag idea by you.

@ahrtr
Copy link
Member

ahrtr commented Jan 21, 2023

My immediate feeling is adding retry is a good idea because It's much simpler. It should have no any impact on K8s either. It's false by default, and only explicitly set to true on etcdserver side when needed.

@mitake
Copy link
Contributor Author

mitake commented Jan 23, 2023

I also feel retry might be fine, but adding reasonCode to the response and making cancelReason not a suggested thing to access will be good for letting clients not rely on cancelReason in future (I'm not sure it's valuable to introduce similar change to other response messages). What do you think @lavacat @ahrtr ?

@ahrtr
Copy link
Member

ahrtr commented Jan 23, 2023

I also feel retry might be fine, but adding reasonCode to the response and making cancelReason not a suggested thing to access will be good for letting clients not rely on cancelReason in future (I'm not sure it's valuable to introduce similar change to other response messages). What do you think @lavacat @ahrtr ?

I think adding field something like errorCode for all messages may be a long-term goal, but it's a big change, and we should document all the available Codes as well, just in the same way as HTTP does, e.g 4xx means bad request, and 5xx means service side internal errors.

Note Kubernetes describes working against a storage layer that implements the etcd wire protocol. FYI. kubernetes/kubernetes#114403 (comment)

Even we only add reasonCode for WatchResponse, we also need to expose and clearly document the reason Codes. Once we add them, then we can't easily change them in order to be backward compatible. The good side is it's extensible and flexible. On the contrary, adding retry is a good workaround for this case, because it's much simpler and just a minor change.

In summary, if we follow the direction of adding field something like errorCode or reasonCode for all messages, we need to evaluate the impact and efforts, and also whether it's really needed or necessary. Otherwise, we'd better add retry or reasonCode only for WatchResponse for now.

@lavacat
Copy link

lavacat commented Jan 24, 2023

@mitake main problem with codes is that they add extra duplication to exposed rpctypes. For example there will be new AuthOldRevision code and existing ErrAuthOldRevision. Also errors from mvcc/watcher.go will have to get new codes or be aggregated under InternalWatchError. See lavacat@e4e4e71

From client perspective the only actionable use-case is retry on InvalidAuthToken and AuthOldRevision, so exposing that as a flag is simpler. I agree, it's less flexible, but are there other hypothetical scenarios when code will be useful?

I'm open to reasonCode route, but just want to call out extra boilerplate it will add.

@mitake
Copy link
Contributor Author

mitake commented Jan 24, 2023

@ahrtr @lavacat Thanks for sharing your thoughts. The current concrete issue and possible action for clients is retrying on InvalidAuthToken and AuthOldRevision, and maintaining the codes will be costly. I agree with the approach based on retry.

lavacat pushed a commit to lavacat/etcd that referenced this issue Feb 7, 2023
Problem: client relies on string value of CancelReason to
determine if watcher should be retried. This is too fragile.
See etcd-io#14992

Solution: add explicit `retry` flag

fixes: etcd-io#15058

Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
lavacat pushed a commit to lavacat/etcd that referenced this issue Feb 8, 2023
Problem: client relies on string value of CancelReason to
determine if watcher should be retried. This is too fragile.
See etcd-io#14992

Solution: add explicit `retry` flag

fixes: etcd-io#15058

Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
@serathius
Copy link
Member

cc @logicalhan

@serathius
Copy link
Member

I don't treat couple of comments as a proper agreement on what we should do next, especially when the proposal changed mid way. Let's at least to semi-formal enhancement proposal before moving forward.

Please write a dedicated comment that will include all relevant information about the current proposal. How exactly you want to change proto, how you expect it will be used by users of new client, what are the alternatives, why your solution is better than alternatives. If you want to propose a backport like in #15253, there should a clear in proposal which should include asses impact of backport. There should be clear agreement from all stakeholders that this is best way to move forward.

@lavacat
Copy link

lavacat commented Feb 9, 2023

Sounds good. I think this requires google doc to summarize everything and allow folks to comment. I'll put it together.

@chaochn47
Copy link
Member

chaochn47 commented Mar 27, 2023

Is there any update on this? @lavacat

Related issue: #7988 when refactoring on common auth tests on watch API.

@lavacat
Copy link

lavacat commented Mar 30, 2023

@chaochn47 I've created a proposal. I'd like to polish but didn't have time.
https://docs.google.com/document/d/12uoUM6tyn5BX9LvYrewnEuR0U9MpiVUdv3agumrPlj0/edit#heading=h.kuoiyzt3m5ff

Feel free to comment what you'd like to see there.

@stale
Copy link

stale bot commented Aug 12, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@mitake
Copy link
Contributor Author

mitake commented Dec 25, 2023

hi @lavacat , do you have time for working on the proposal and #15253 ? It's great if we can finish before 3.6 because the change adds a new field to protobuf. If you are busy, I'd like to help.

@lavacat
Copy link

lavacat commented Jan 2, 2024

@mitake yes, I'll work on proposal this week or next. Agreed about finishing it before 3.6. Would appreciate your review.

@mitake
Copy link
Contributor Author

mitake commented Jan 9, 2024

@lavacat Thanks a lot for working on this! I think proposal 1 for 3.5 or earlier and proposal 3 for 3.6+ for keeping compatibility and simplifying the usage would be good.

@mitake
Copy link
Contributor Author

mitake commented Jan 28, 2024

hi @lavacat , do you have updates? Please let me know if I can help.

@ahrtr
Copy link
Member

ahrtr commented Feb 6, 2024

See the rough summary/plan in #17384 (comment)

Workarounds for the issue (use only one of the following two solutions):

  • Try to use TLS CN based auth
  • If the client is still using username:password, then re-create the client when running into the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

6 participants