Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use gRPC standard status error model #115

Merged
merged 1 commit into from
Nov 11, 2017

Conversation

saad-ali
Copy link
Member

@saad-ali saad-ali commented Oct 2, 2017

As discussed in #23

@akutz
Copy link
Contributor

akutz commented Oct 2, 2017

Hi @saad-ali,

A few things from a brief glance:

  1. I personally like PR descriptions to have more than just a link to the original issue, but I understand why people do it. Still, a short problem summary is always helpful.

  2. I would call this out in the description wayyy up front: Default value for backwards compatibility. I didn't see those in the diff; not until I viewed the file sans the diff mode. It's probably just me, but it was very confusing seeing those old error enums at first.

  3. It would also be useful to have the PR's description show the old behavior vs. the new behavior from a client's, and possibly server's, perspective.

Thanks for this hard work. This is going to make CSI even better and set a strong foundation on which to build future features!

@jieyu jieyu added this to the v0.1 milestone Oct 2, 2017
@cpuguy83
Copy link
Contributor

cpuguy83 commented Oct 3, 2017

We discussed some issues with a couple of cases that you felt didn't quite line up with grpc error codes, can you explain them?

We also discussed that while the protocol supports it, the various libraries will not let you return an error and some object (like more detailed/structured error data, or a volume handle for the AlreadyExists case).
It looks like the go library handles metadata via the context object (which you can set values on for requests and responses).
I'm not sure if metadata is the right place for this or should be considered at all (@stevvooe ?)

csi.proto Outdated
Result result = 1;
Error error = 2;
}
Result result = 1;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you have an empty message, go ahead and use google.protobuf.Empty.

Was there a reason for the extra encapsulation here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

spec.md Outdated
### Error Type/Codes
### Error Handling

All CSI API calls defined in this spec MUST return a [standard gRPC status](https://github.com/grpc/grpc/blob/master/src/proto/grpc/status/status.proto).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double check, but I think this is implied by the protocol on error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anywhere in the gRPC docs where it is prescribed, so I want it to be very explicit for CSI.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/grpc/grpc/blob/master/doc/statuscodes.md

All RPCs started at a client return a status object composed of an integer code and a string message.

There might be something more normative, elsewhere.

@stevvooe
Copy link

stevvooe commented Oct 3, 2017

@cpuguy83 You use the status package. It places a detail object on the header (effectively, metadata from grpc perspective) but it is transparent. Here is the same piece in java. Looks like python has support with Call.details().

We can figure out how to upstream changes for other client packages, if it is missing.

Let me know if that addresses the concern.

@saad-ali
Copy link
Member Author

saad-ali commented Oct 3, 2017

@akutz:

I personally like PR descriptions to have more than just a link to the original issue, but I understand why people do it. Still, a short problem summary is always helpful.

I agree! Wanted to get this out in front of everyone as a work-in-progress ASAP.

It would also be useful to have the PR's description show the old behavior vs. the new behavior from a client's, and possibly server's, perspective.

Yep, will do as soon as this is no longer WIP.

I would call this out in the description wayyy up front: Default value for backwards compatibility. I didn't see those in the diff; not until I viewed the file sans the diff mode. It's probably just me, but it was very confusing seeing those old error enums at first.

Not sure I follow.

@cpuguy83:

+1 to @stevvooe. As for the cases that we need to think through, the ones that involve idempotency: specifically CreateVolume, if it returns an error (AlreadyExists), then we may need to introduce a GetVolumeInfo(...) call (I'm leaning towards this)

Everyone:
I'm looking in to resusing google/rpc/error_details.proto for error details, instead of defining our own custom error details. Any objections to this?

@stevvooe
Copy link

stevvooe commented Oct 3, 2017

+1 to @stevvooe. As for the cases that we need to think through, the ones that involve idempotency: specifically CreateVolume, if it returns an error (AlreadyExists), then we may need to introduce a GetVolumeInfo(...) call (I'm leaning towards this)

I'd recommend adding that anyways. ;)

I'm looking in to resusing google/rpc/error_details.proto for error details, instead of defining our own custom error details. Any objections to this?

Check with the grpc team and see what the scope of this is. This looks like it is generally usable, but they may have other ideas.

spec.md Outdated
The status `message` MUST contain a human readable description of error, if the status `code` is not `OK`.
This string MAY be surfaced by CO to end users.
The status `details` MUST return the machine-parsable `Error` protobuf message defined below, if the status `code` is not `OK`.
This allows CO to implement smarter error handling and fault resolution.

```protobuf
message Error {
// General Error that MAY be returned by any RPC.
message GeneralError {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like we don't need to define the enum for general errors as we'll use the general grpc error code. Here is my suggestion:

  1. For each general error (that applies to all RPCs) code from grpc that we'll use, have a text to explain what that means in this section (like what we have for now), and define the corresponding error details struct. We can define a GeneralErrorDetails message that'll be reused by all general errors.
  2. For RPC specific errors, we should place the corresponding error explanation close to where the RPC is. We might want to define RPC specific error details type.

@stevvooe
Copy link

stevvooe commented Oct 6, 2017

@saad-ali While we're not using error_details.proto yet, containerd/containerd#1597 is an example of where we brought those protobufs in for use. Your project is a little different because you have generated go and protos in different places, but hopefully that helps with some hints to get started.

@akutz
Copy link
Contributor

akutz commented Oct 25, 2017

Hi All,

This, to me, is the most impactful, outstanding PR scheduled for inclusion in v0.1.0. If the latter is going to have any hope of being tagged in time for COs like Mesos and K8s to be compatible with plug-ins being actively developed, this PR must be resolved ASAP.

@jieyu
Copy link
Member

jieyu commented Oct 26, 2017

+1 to @akutz said. @saad-ali do you need any help here? I'd like to see this get merged asap.

@stevvooe
Copy link

@jieyu It looks like the only outstanding thing is the removal of the extra error codes that overlap with the standard error codes. Did you still need any help on my end?

@lpabon
Copy link

lpabon commented Nov 1, 2017

Some extra info for those googling this: http://avi.im/grpc-errors/

@lpabon
Copy link

lpabon commented Nov 2, 2017

Looks like the csi.pb.go needs to be updated from the csi.proto. Example:

csi.pb.go : csi.pb.go

@vladimirvivien
Copy link

So, if this PR adopts gRPC's standard error-signaling and error-handling types/mechanism, @saad-ali (et al) will the multi-level nested response (with Response, Reply) also go away in favor of a flatter model ?

I am guessing the oneof reply approach was adopted so that either a valid response or an error could be returned in one response as shown below.

message GetSupportedVersionsResponse {
  message Result {
    repeated Version supported_versions = 1;
  }

  // One of the following fields MUST be specified.
  oneof reply {
    Result result = 1;
    Error error = 2;
  }
}

So, if the Error message will no longer be part of the xxxResponse message, will responses be flatter as in something like:

message GetSupportedVersionsResponse {
    repeated Version supported_versions = 1;
}

@saad-ali
Copy link
Member Author

So, if this PR adopts gRPC's standard error-signaling and error-handling types/mechanism, @saad-ali (et al) will the multi-level nested response (with Response, Reply) also go away in favor of a flatter model ?

Yes!

@saad-ali
Copy link
Member Author

Ok, I finally got around to this PR! Apologies for the delay.

For now I decided not to have any custom error details proto. Instead I specified standard gRPC error codes for all existing error conditions. In the future, if we decide that error details are critical we can introduce them. I also changed the idempotent retries (e.g. volume already exists) to success (0 OK) instead of an error code.

PTAL, I'd like to get this sucker merged once and for all.

spec.md Outdated

Condition | gRPC Code | Description | Recovery Behavior
| --- | --- | --- | --- |
| Volume already exists | 0 OK | Indicates that a volume corresponding to the specified volume `name` already exists. | Caller MUST assume the `CreateVolume` call succeeded. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return an already exists error code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is on purpose. For idempotency if the volume already exists the request should succeed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on what @saad-ali said

If AlreadyExists is used, the CO might not be able to figure out the VolumeInfo for the already created volume. cc @julian-hj

Copy link
Contributor

@akutz akutz Nov 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree (or rather I agree with @cpuguy83). One part of the purpose behind this new error model was the ability to return both valid data and associated error codes. I want the RPC to return the error that indicates the volume already exists and the VolumeInfo for that volume in a single response.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly either way. Since both @akutz and @cpuguy83 perfer the error code. I'll change it to that.

spec.md Outdated

Condition | gRPC Code | Description | Recovery Behavior
| --- | --- | --- | --- |
| Unsupported `volume_capability` | 0 OK | Indicates that one or more of the specified `volume_capability` options (mount flags, volume type, FS type, etc.) are not supported. `ValidateVolumeCapabilitiesResponse.supported` MUST be `false`. `ValidateVolumeCapabilitiesResponse.message` must detail what volume capabilities are incompatible. | Caller MUST fix the `volume_capability` before retrying. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is OK correct here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@saad-ali saad-ali changed the title WIP: Use gRPC standard status error model Use gRPC standard status error model Nov 10, 2017
spec.md Outdated
@@ -248,13 +248,13 @@ service Controller {
returns (CreateVolumeResponse) {}

rpc DeleteVolume (DeleteVolumeRequest)
returns (DeleteVolumeResponse) {}
returns (Empty) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually prefer still keeping the empty DeleteVolumeResponse (and others). The reason for that is for future upgrade and backward compatibility. Say in the future, we want to add some field to the response of this RPC. Changing from Empty to DeleteVolumeResponse won't be backwards compatible, but add an optional field to DeleteVolumeResponse will be backward compatible (without the need for bumping a CSI version).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm conflicted on this. While I agree with much of @jieyu's reasoning, this bit strikes me in a funny way:

add an optional field to DeleteVolumeResponse will be backward compatible (without the need for bumping a CSI version).

While this is technically true, if there is additional information added to otherwise empty responses, is bumping the version such a bad thing? Otherwise the version wouldn't reflect the addition of new information. Let's look at the stated behavior of SemVer:

MAJOR version when you make incompatible API changes,
MINOR version when you add functionality in a backwards-compatible manner, and
PATCH version when you make backwards-compatible bug fixes.

It would seem then that a MINOR version change would be required in the case of keeping DeleteVolumeResponse for now in order to possibly add new data to it in the future. However, I'm not sure I agree with this. To me a MINOR change is valid if we, for example, added a new RPC that provides extended functionality. However, changing the response data for an RPC, even if technically backwards compatible, is to me a MAJOR change as it affects an existing RPC. Again, I understand it is technically a MINOR change, but I'm talking about in practice, not in theory.

Therefore I'm fine leaving this items Empty for now and then requiring a MAJOR version bump if we should ever change them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would only be backwards incompatible at the library layer, right?
The proto/wire format should be fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i think wire format should be the same. Just i feel the benefit of doing this is thin. It's nice to keep request/response symmetric.

Copy link
Contributor

@akutz akutz Nov 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would only be backwards incompatible at the library layer, right?

That's correct. Truth be told though, adding a new RPC, while a MINOR SemVer bump, would be backwards incompatible for any Go library.

I think we have to agree (or not) that compatibility is with regards to the spec's wire format, or should also include one or more language binding/implementations of the spec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to agree (or not) that compatibility is with regards to the spec, not with regards to a specific language binding/implementation of the spec.

Actually, I think the above is a good topic for next week's meeting. Is Go prevalent enough that we should consider it when considering SemVer for the spec?

Copy link
Member Author

@saad-ali saad-ali Nov 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly either way. Since @jieyu strongly prefers symmetry, I'll go ahead and make it symmetric.


Condition | gRPC Code | Description | Recovery Behavior
| --- | --- | --- | --- |
| Unsupported request version | 3 INVALID_ARGUMENT | Indicates that the version specified in the request is not supported by the Plugin. | Caller MUST NOT retry; caller SHOULD call `GetSupportedVersions` to discover which CSI versions the Plugin supports. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was debating if this should be InvalidArgument or FailedPrecondition.

// InvalidArgument indicates client specified an invalid argument.
// Note that this differs from FailedPrecondition. It indicates arguments
// that are problematic regardless of the state of the system
// (e.g., a malformed file name).
InvalidArgument Code = 3

It looks to me that for this case, it depends on the state of the SP (support that version or not). However, on the other hand, if we implicitly assume that CO should not retry on InvalidArgument and can retry on FailedPrecondition, then InvalidArgument is better here.

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This is tricky. As @jieyu points out, the gRPC error code docs state:

// Note that this differs from FailedPrecondition. It indicates arguments
// that are problematic regardless of the state of the system

However, the FailedPrecondition doc says:

// FailedPrecondition indicates operation was rejected because the
// system is not in a state required for the operation's execution.

I read this as a FailedPrecondition being an argument's relationship to the state of the system. The question becomes -- are the versions supported by a CSI SP part of the system's state? I don't know. I could go either way here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to leave it as InvalidArgument because I don't want the caller to retry without human intervention.

spec.md Outdated

Condition | gRPC Code | Description | Recovery Behavior
| --- | --- | --- | --- |
| Volume already exists | 0 OK | Indicates that a volume corresponding to the specified volume `name` already exists. | Caller MUST assume the `CreateVolume` call succeeded. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on what @saad-ali said

If AlreadyExists is used, the CO might not be able to figure out the VolumeInfo for the already created volume. cc @julian-hj


The status `code` MUST contain a [canonical error code](https://github.com/grpc/grpc-go/blob/master/codes/codes.go). COs must handle all valid error codes. Each RPC defines a set of gRPC error codes that MUST be returned by the plugin when specified conditions are encountered. In addition to those, if the conditions defined below are encountered, the plugin MUST return the associated gRPC error code.

Condition | gRPC Code | Description | Recovery Behavior
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also include a line for the following in the current spec.

// Indicates that a required field is missing from the request.
// More human-readable information MAY be provided in the
// `error_description` field. The `caller_must_not_retry` field
// MUST be set to true.
//
// Recovery behavior: Caller MUST fix the request by adding the
// missing required field before retrying.
MISSING_REQUIRED_FIELD = 3;

I think using INVALID_ARGUMENT is more appropriate here because CO should not retry in this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

spec.md Outdated
Condition | gRPC Code | Description | Recovery Behavior
| --- | --- | --- | --- |
| Volume already exists | 0 OK | Indicates that a volume corresponding to the specified volume `name` already exists. | Caller MUST assume the `CreateVolume` call succeeded. |
| Invalid `name` | 3 INVALID_ARGUMENT | Indicates that the specified volume name is not allowed by the Plugin. More human-readable information MAY be provided in the gRPC `status.message` field. | Caller MUST fix the name before retrying. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is to have a general "Invalid or unsupported field in the request" condition (INVALID_ARGUMENT), and put that in the general error section above so that we don't have to mention them here for each call. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to leave them split up so that if we want we can convert these in to detailed error codes in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One second thought, I'll go ahead and change it as you described. We can always re-add these in the future, if needed.

spec.md Outdated

Condition | gRPC Code | Description | Recovery Behavior
| --- | --- | --- | --- |
| Unsupported `volume_capability` | 3 INVALID_ARGUMENT | Indicates that one or more of the specified `volume_capability` options (mount flags, volume type, FS type, etc.) are not supported. | Caller MUST fix the `volume_capability` before retrying. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fits into the general invalid/unsupported fields condition? Consider remove it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

spec.md Outdated

Condition | gRPC Code | Description | Recovery Behavior
| --- | --- | --- | --- |
| Bad plugin config | 3 INVALID_ARGUMENT | Indicates that the plugin is misconfigured. | Caller MUST assume the plugin is not healthy. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INVALID_ARGUMENT here is weird because we don't really specify any argument for the call :)
I prefer FailedPrecondition here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur 100%. This is the entire purpose of FailedPrecondition as it relates to the state of the system.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Brain farted on that one.

spec.md Outdated

Condition | gRPC Code | Description | Recovery Behavior
| --- | --- | --- | --- |
| Unsupported `volume_capability` | 3 INVALID_ARGUMENT | Indicates that one or more of the specified `volume_capability` options (mount flags, volume type, FS type, etc.) are not supported. | Caller MUST fix the `volume_capability` before retrying. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

spec.md Outdated

Condition | gRPC Code | Description | Recovery Behavior
| --- | --- | --- | --- |
| Unsupported `volume_capability` | 3 INVALID_ARGUMENT | Indicates that one or more of the specified `volume_capability` options (mount flags, volume type, FS type, etc.) are not supported. | Caller MUST fix the `volume_capability` before retrying. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

spec.md Outdated

Condition | gRPC Code | Description | Recovery Behavior
| --- | --- | --- | --- |
| Bad plugin config | 3 INVALID_ARGUMENT | Indicates that the plugin is misconfigured. | Caller MUST assume the plugin is not healthy. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. "FAILED_PRECONDITION" ?

spec.md Outdated
@@ -248,13 +248,13 @@ service Controller {
returns (CreateVolumeResponse) {}

rpc DeleteVolume (DeleteVolumeRequest)
returns (DeleteVolumeResponse) {}
returns (Empty) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm conflicted on this. While I agree with much of @jieyu's reasoning, this bit strikes me in a funny way:

add an optional field to DeleteVolumeResponse will be backward compatible (without the need for bumping a CSI version).

While this is technically true, if there is additional information added to otherwise empty responses, is bumping the version such a bad thing? Otherwise the version wouldn't reflect the addition of new information. Let's look at the stated behavior of SemVer:

MAJOR version when you make incompatible API changes,
MINOR version when you add functionality in a backwards-compatible manner, and
PATCH version when you make backwards-compatible bug fixes.

It would seem then that a MINOR version change would be required in the case of keeping DeleteVolumeResponse for now in order to possibly add new data to it in the future. However, I'm not sure I agree with this. To me a MINOR change is valid if we, for example, added a new RPC that provides extended functionality. However, changing the response data for an RPC, even if technically backwards compatible, is to me a MAJOR change as it affects an existing RPC. Again, I understand it is technically a MINOR change, but I'm talking about in practice, not in theory.

Therefore I'm fine leaving this items Empty for now and then requiring a MAJOR version bump if we should ever change them.


Condition | gRPC Code | Description | Recovery Behavior
| --- | --- | --- | --- |
| Unsupported request version | 3 INVALID_ARGUMENT | Indicates that the version specified in the request is not supported by the Plugin. | Caller MUST NOT retry; caller SHOULD call `GetSupportedVersions` to discover which CSI versions the Plugin supports. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This is tricky. As @jieyu points out, the gRPC error code docs state:

// Note that this differs from FailedPrecondition. It indicates arguments
// that are problematic regardless of the state of the system

However, the FailedPrecondition doc says:

// FailedPrecondition indicates operation was rejected because the
// system is not in a state required for the operation's execution.

I read this as a FailedPrecondition being an argument's relationship to the state of the system. The question becomes -- are the versions supported by a CSI SP part of the system's state? I don't know. I could go either way here.

spec.md Outdated

Condition | gRPC Code | Description | Recovery Behavior
| --- | --- | --- | --- |
| Volume already exists | 0 OK | Indicates that a volume corresponding to the specified volume `name` already exists. | Caller MUST assume the `CreateVolume` call succeeded. |
Copy link
Contributor

@akutz akutz Nov 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree (or rather I agree with @cpuguy83). One part of the purpose behind this new error model was the ability to return both valid data and associated error codes. I want the RPC to return the error that indicates the volume already exists and the VolumeInfo for that volume in a single response.

spec.md Outdated

Condition | gRPC Code | Description | Recovery Behavior
| --- | --- | --- | --- |
| Unsupported `volume_capability` | 0 OK | Indicates that one or more of the specified `volume_capability` options (mount flags, volume type, FS type, etc.) are not supported. `ValidateVolumeCapabilitiesResponse.supported` MUST be `false`. `ValidateVolumeCapabilitiesResponse.message` must detail what volume capabilities are incompatible. | Caller MUST fix the `volume_capability` before retrying. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I'm wondering. I don't think an error code of 0 is appropriate if one of the options are not supported.

spec.md Outdated

Condition | gRPC Code | Description | Recovery Behavior
| --- | --- | --- | --- |
| Bad plugin config | 3 INVALID_ARGUMENT | Indicates that the plugin is misconfigured. | Caller MUST assume the plugin is not healthy. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur 100%. This is the entire purpose of FailedPrecondition as it relates to the state of the system.

@akutz
Copy link
Contributor

akutz commented Nov 10, 2017

If this PR is no longer going to return ALREADY_EXISTS error codes along with the data for the existing object, then I fail to see the value in this PR. I was under the impression the guiding purpose of this PR was to be able to return both informative error codes and related data concurrently as the result of a single call. Otherwise this is back to the same discussion regarding whether to return an error or the result in an idempotent situation.

@akutz
Copy link
Contributor

akutz commented Nov 10, 2017

Hi @saad-ali,

If this PR is no longer going to return ALREADY_EXISTS error codes along with the data for the existing object, then I fail to see the value in this PR. I was under the impression the guiding purpose of this PR was to be able to return both informative error codes and related data concurrently as the result of a single call. Otherwise this is back to the same discussion regarding whether to return an error or the result in an idempotent situation.

The more I think about what I wrote above, the more I must propose we delay this PR to post v0.1. If this PR isn't going to introduce the ability to receive both error codes and data as the result of a single call then I do not think the change this PR introduces is worth the trouble to adopt it so close to when people expect to be able to use CSI.

I will have to go back and find our discussion, but we all agreed that COs may both wish to know if an object does or doesn't already exist and then still have that object's data returned. I disagree with @jieyu's statement:

If AlreadyExists is used, the CO might not be able to figure out the VolumeInfo for the already created volume. cc @julian-hj

Of course the CO can figure that out. There are many examples of when a client/receiver must consider more than just 0 as a successful exit code. This is the same as that. For example:

if err == nil || err.Code == ALREADY_EXISTS

The above Go-based pseudo-code demonstrates how simple this is.

If this PR will not return error codes related to idempotency along with existing object data then I propose this PR be delayed until after v0.1.

@jieyu
Copy link
Member

jieyu commented Nov 10, 2017

@akutz I think setting both gRPC status and response sounds good to me too. The CO will have slightly more logic to deal with that, but that should be small like you said.

@saad-ali
Copy link
Member Author

Feedback addressed. PTAL

Copy link
Member

@jieyu jieyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM! Thanks @saad-ali !

In fact, given that we don't have other machine parsable information about the error other than the status code, the retry/recovery behavior has to be consistent for each error code. For instance, for each RPC, the recovery behaviors for all cases whereFAILED_PRECONDITION will be returned should be consistent. Otherwise, CO does not have other information to tell how to perform the recovery behavior differently.

Therefore, one thing that might worth doing in the future is to structure the error code description like the following:

RPC 1: (e.g., CreateVolume)

status code 1 (e.g., FAILED_PRECONDITION)
  - description
  - recovery behavior
  - conditions:
      - condition 1:
      - condition 2:

Copy link
Contributor

@akutz akutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for this tremendous effort @saad-ali!

@saad-ali saad-ali merged commit 51e48d3 into container-storage-interface:master Nov 11, 2017
@akutz
Copy link
Contributor

akutz commented Nov 13, 2017

FWIW, I'm already 95% of the way to adopting it in rexray/gocsi#32. Except for the client, the new error model has been pushed into the interceptors, mock SP, tests, etc. Working very, very well. Thank you again @saad-ali!


Condition | gRPC Code | Description | Recovery Behavior
| --- | --- | --- | --- |
| Volume already exists | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified volume `name` already exists. Plugin MUST also return a valid `CreateVolumeResponse`. | Caller MUST assume the `CreateVolume` call succeeded. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is possible, at least for golang-based servers. Has anyone tested this approach? I'm looking at the golang grpc code and it seems like if an RPC generates an error, then the xxxResponse is dropped and just "status" is returned to the client: https://github.com/grpc/grpc-go/blob/8ff8683602df4fb6a85c22f2da944bfc8fc73c50/server.go#L877

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarification: xxxResponse being the non-status payload

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you do this with status.New(code, msg).WithDetails(proto)... and then s.Err() to get the underlying error implementation.

Copy link
Contributor

@chhsia0 chhsia0 Nov 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a response with a non-OK status is not recommended from the discussion of the open issue here: grpc/grpc#12824 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the example above the detailed error is encoded with the status message and can be decoded on the other side (or at least that's how I understand the API, I haven't tried to use it yet).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chhsia0,

"Client side SHOULD drop any response messages received with a non-OK response."

If the above is in fact true then I agree that the error details is the correct location for any idempotent data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this has been discussed, but another possible option (if we want to avoid using the error details for any reason) is to return an OK and introduce a boolean field in the response to indicate if a volume of the same capabilities/parameters has been created before, and use ALREADY_EXISTS as an error suggested by @julian-hj.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chhsia0,

I don't like that approach because it subverts (or compliments in a way that replaces) the existing error model that is designed to handle the case of something that already exists. I recommend using the gRPC error details as you've already suggested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand it, if a function is idempotent, then calling it twice with the same inputs should not generate an error code. So if "ALREADY_EXISTS" is an error, we should not use it for the case where the client just got impatient and repeated an RPC call, or repeated a call after a dropped connection.

What I am not clear on is whether "ALREADY_EXISTS" is deemed as an error condition or not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine either way.
For me idempotent means you can run the same thing multiple times without side-effects. The action can be idempotent (nothing changed) while still providing more information to the caller (this thing already exists) instead of "yep we did that".

However there is a case where a volume could exist but the state of it conflicts with the passed in options. In which case ALREADY_EXISTS could work... but also maybe a different error code applies to this.

| Volume already exists | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified volume `name` already exists. Plugin MUST also return a valid `CreateVolumeResponse`. | Caller MUST assume the `CreateVolume` call succeeded. |
| Operation pending for volume | 9 FAILED_PRECONDITION | Indicates that there is a already an operation pending for the specified volume. In general the Cluster Orchestrator (CO) is responsible for ensuring that there is no more than one call "in-flight" per volume at a given time. However, in some circumstances, the CO MAY lose state (for example when the CO crashes and restarts), and MAY issue multiple calls simultaneously for the same volume. The Plugin, SHOULD handle this as gracefully as possible, and MAY return this error code to reject secondary calls. | Caller SHOULD ensure that there are no other calls pending for the specified volume, and then retry with exponential back off. |
| Unsupported `capacity_range` | 11 OUT_OF_RANGE | Indicates that the capacity range is not allowed by the Plugin. More human-readable information MAY be provided in the gRPC `status.message` field. | Caller MUST fix the capacity range before retrying. |
| Call not implemented | 12 UNIMPLEMENTED | CreateVolume call is not implemented by the plugin or disabled in the Plugin's current mode of operation. | Caller MUST not retry caller MAY call `ControllerGetCapabilities` or `NodeGetCapabilities` to discover Plugin capabilities. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UNIMPLEMENTED is also returned by the grpc libs for other cases. Maybe consider another, non-overlapping error code for this? see https://github.com/grpc/grpc/blob/master/doc/statuscodes.md

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for all other uses of UNIMPLEMENTED by this spec

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest ABORTED instead

| Volume does not exists | 5 NOT_FOUND | Indicates that a volume corresponding to the specified `volume_id` does not exist. | Caller MUST verify that the `volume_id` is correct and that the volume is accessible and has not been deleted before retrying with exponential back off. |
| Node does not exists | 5 NOT_FOUND | Indicates that a node corresponding to the specified `node_id` does not exist. | Caller MUST verify that the `node_id` is correct and that the node is available and has not been terminated or deleted before retrying with exponential backoff. |
| Volume published to another node | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified `volume_id` is already attached to another node and does not support multi-node attach. If this error code is returned, the Plugin SHOULD specify the `node_id` of the node the volume is already attached to as part of the gRPC `status.message`. | Caller SHOULD ensure the specified volume is not attached to any other node before retrying with exponential back off. |
| Max volumes attached | 8 RESOURCE_EXHAUSTED | Indicates that the maximum supported number of volumes that can be attached to the specified node are already attached. Therefore, this operation will fail until at least one of the existing attached volumes is detached from the node. | Caller MUST ensure that the number of volumes already attached to the node is less then the maximum supported number of volumes before retrying with exponential backoff. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RESOURCE_EXHAUSTED is also returned for other reasons by the grpc library: https://github.com/grpc/grpc/blob/master/doc/statuscodes.md

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest FAILED_PRECONDITION instead

// in such cases the CO SHALL NOT specify this field.
//
// If `GetNodeID` does not omit node ID from a successful `Result`,
// If `GetNodeID` does not omit node ID from a successful response,
// the CO MAY omit this field as well, indicating that it does not
// know which node the volume was previously used. The Plugin SHOULD
// return an Error if this is not supported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Error/error/


Condition | gRPC Code | Description | Recovery Behavior
| --- | --- | --- | --- |
| Invalid `starting_token` | 10 ABORTED | Indicates that `starting_token` is not valid. | Caller SHOULD start the `ListVolumes` operation again with an empty `starting_token`. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no UNIMPLEMENTED row here, like for the other optional calls. suggest adding another row for this but as ABORTED (as suggested above).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants