Error codes, the wire protocol, and clients #462

cole-miller · 2023-01-25T16:50:14Z

Right now, dqlite reports errors to the client through the FAILURE response, which consists of an integer error code and a string. This is a reasonable design, but there are some problems with the implementation:

The interpretation of the error code is not established anywhere. There are several potential sources of integer error codes (SQLite, dqlite, raft), which overlap with each other, and it's not documented which of these sets of constants are applicable to the value that appears on the wire. My sense is that the dqlite server code is not completely consistent about what kind of error codes it passes to the client (see also Public functions mix different kinds of error code #379).
We would like for the C client library functions to generally return SQLite result codes, to make it easy to migrate existing applications from SQLite. In some cases, the server will get an error code from an sqlite3 function, and we should make sure that that code is passed all the way back to the client; in other cases, when the failure originates in dqlite or raft, we'll have to use a "synthetic" SQLite code like the generic SQLITE_ERROR, and perhaps expose a way for the client to access the underlying dqlite or raft code. Right now, the single hard-to-interpret integer that comes over the wire is not enough for the C client to do this effectively.

One thing I'd like to do to start addressing this would be to make the DQLITE_* and RAFT_* error codes disjoint from each other and from the SQLite codes (and the libuv codes), so that we can unambiguously interpret the return value of any fallible function in the codebase. That's a backwards-incompatible change, because these constants appear in public header files, but I don't expect any breakage in practice if we were to implement this. Then we could keep the current design of sending a single error code in the FAILURE response, and the client would have enough information to discern e.g. whether it's an SQLite error that should be passsed back to the caller as-is.

A backwards-compatible alternative, suggested by @stgraber, would be to add some fields to the end of the FAILURE response that hold error codes from a specific, well-defined domain. I don't think any existing clients actually look at the existing, generic error code, but if they do we could migrate them to examining these specific fields instead. If one of the new fields were reserved for an SQLite code, that would make life pretty easy for the C client.

freeekanayaka · 2023-01-25T19:56:26Z

The current state of mixing error codes from various sources is indeed unfortunate. Since the integer contained in the FAILURE response is a 64-bit one, can't we reserve the first 8 bits to include a sub-code identifying the source? The remaining 56 bits should be enough to hold any SQLITE, DQLITE, Raft or libuv code (please correct me if I'm wrong).

I agree that backward compatibility shouldn't be a big issue, but there are few codes used in go-dqlite and perhaps LXD so please double check to convert those at the same time.

cole-miller · 2023-01-25T20:00:45Z

The current state of mixing error codes from various sources is indeed unfortunate. Since the integer contained in the FAILURE response is a 64-bit one, can't we reserve the first 8 bits to include a sub-code identifying the source? The remaining 56 bits should be enough to hold any SQLITE, DQLITE, Raft or libuv code (please correct me if I'm wrong).

Yeah, something like this is what I had in mind when I proposed making the code ranges disjoint.

I agree that backward compatibility shouldn't be a big issue, but there are few codes used in go-dqlite and perhaps LXD so please double check to convert those at the same time.

Noted and will do.

cole-miller · 2023-01-25T20:54:35Z

Here's a concrete proposal for allocating the 64-bit space of return codes that's available for the FAILURE response. Note that we use int as the return type for fallible functions throughout the dqlite and raft codebases, and I don't relish the prospect of switching to a 64-bit type everywhere, so I think we should stick to codes that are representable by a 32-bit signed integer. (Do we care about portability to platforms where an int is only 16 bits?)

0 means success, as always.
Nonzero SQLITE_* result codes are positive 32-bit signed integers, and the largest one currently allocated is less than 10000. We should preserve all these values and reserve values <= 1 << 30 for any codes that might be added in the future.
libuv error codes are negated errno values. We should map these into the corresponding int64_t values (sign-extension).
raft should own the range 1<<30 < i <= (1<<30 + 1<<29); we can just translate the existing codes.
dqlite should own the range (1<<30 + 1<<29) < i < 1<<31; we can just translate the existing codes.

freeekanayaka · 2023-01-26T10:05:09Z

I think it's fine to assume that int is at least 32 bits, and discard 16-bit platforms.

Thinking about this a bit more though, I'm not entirely sure that we should expose libuv/raft errors. Ideally the integer returned to the client should be helpful to make decisions about what to do, or help a human understand what went wrong (together with the string error message). Having the wire protocol return, say, EINVAL is not particularly useful to the client I think.

Perhaps what we should return is basically a superset of SQLite codes, augmented with codes that are specific to dqlite, either because they are related to wire protocol operations unrelated to SQLite (e.g. REQUEST_TRANSFER) or because they represent failure modes which are not part of the SQLite semantics and that require special care from the client. One important example of the latter type are the SQLITE_IOERR_NOT_LEADER and SQLITE_IOERR_LEADERSHIP_LOST errors defined in leader.h in the dqlite code base. They effectively augment the the SQLite error codes set as I mentioned.

If I recall correctly what I described was actually the original design, although the actual implementation didn't quite match it.

In general I believe a library or subsystem should abstract away the errors coming from lower layers, the stdlib does that with the kernel/hardware, raft does that with the stdlib/libuv (or should), SQLite does that with the stdlib (it does not "pass through" any error). This is helpful because the library should decide which errors and failure modes fit the model at hand, and ideally also provide documentation about what the client should do for each failure mode (for example if it should retry in some form).

Sorry to derail the conversation a bit, just my 2 cents, YMMV.

freeekanayaka · 2023-01-26T10:11:15Z

Note also that I believe DQLITE_ error codes were meant for just the C part of the dqlite API (so basically for all functions defined in dqlite.h), while the wire protocol errors were meant to be a superset of SQLITE_ errors, as mentioned above.

cole-miller · 2023-01-26T16:44:44Z

Perhaps what we should return is basically a superset of SQLite codes, augmented with codes that are specific to dqlite, either because they are related to wire protocol operations unrelated to SQLite (e.g. REQUEST_TRANSFER) or because they represent failure modes which are not part of the SQLite semantics and that require special care from the client.

I definitely agree that this is the kind of information that should ultimately be returned by the C client library functions. The question is where in the code we do the collapsing from a large, variegated space of error codes down into "SQLite codes plus". I'm not particularly attached to doing this collapsing on the client side, but I think it should happen centrally, in exactly one place in the combined server+client codebase. Everywhere else we should just propagate error codes from other APIs. One reason to do it on the client side is that the client will have its own error conditions, so it already has to do some munging to translate errors into SQLit-ese; if we put the rest of the munging in the same place it might be easier to reason about.

I think any error constants we define should start with DQLITE_, not SQLITE_.

freeekanayaka · 2023-01-26T19:19:02Z

Perhaps what we should return is basically a superset of SQLite codes, augmented with codes that are specific to dqlite, either because they are related to wire protocol operations unrelated to SQLite (e.g. REQUEST_TRANSFER) or because they represent failure modes which are not part of the SQLite semantics and that require special care from the client.

I definitely agree that this is the kind of information that should ultimately be returned by the C client library functions. The question is where in the code we do the collapsing from a large, variegated space of error codes down into "SQLite codes plus". I'm not particularly attached to doing this collapsing on the client side, but I think it should happen centrally, in exactly one place in the combined server+client codebase.

In my mind it's more "layering" than "collapsing" (but maybe it's just a matter of terminology).

Let's take a concrete example from the raft library (in src/uv_writer.c):

static int uvWriterIoSetup(unsigned n, aio_context_t *ctx, char *errmsg)
{
    int rv;
    rv = UvOsIoSetup(n, ctx);
    if (rv != 0) {
        switch (rv) {
            case UV_EAGAIN:
                ErrMsgPrintf(errmsg, "AIO events user limit exceeded");
                rv = RAFT_TOOMANY;
                break;
            default:
                UvOsErrMsg(errmsg, "io_setup", rv);
                rv = RAFT_IOERR;
                break;
        }
        return rv;
    }
    return 0;
}

In this case the errors happening in the libuv layer get handled in the raft layer: they are mapped into the raft error codes set and augmented with a more specific description. The client code consuming the raft library could then choose to handle RAFT_TOOMANY and RAFT_IOERR differently, because RAFT_TOOMANY is retriable. Such client code does not need to know anything about libuv error codes, and more importantly it can rely on a clear meaning of the raft codes (currently the documentation of RAFT_ error codes is scarce, but that's another issue).

If we were returning UV_EAGAIN directly instead of RAFT_TOOMANY, the client code would not be in the position to decide what to do, because for instance there may be other spots in the raft code base were we return UV_EAGAIN as well but in that case the error is not retriable because of raft-specific reasons (this is just hypothetical, for the sake of the explanation).

So, no libuv error codes should be returned by raft, only RAFT_ ones. Hence dqlite should not have to deal with libuv codes at all.

Similarly, when dqlite hits a raft error, it should do the same as the raft code above is doing when hitting a libuv error: it should handle it and map it to its own error codes (the augmented SQLITE_ ones), possibly prepending a prefix to the raft error message string, to provide more context (like raft is doing in the code snippet above).

In this sense there would be no "central" place where a collapsing happens, because each error that dqlite receives from a bottom layer (e.g. raft) needs to be handled on the spot, because that's where we have the context to take a meaningful decision about what dqlite-level error code to return and what error message string to generate.

Everywhere else we should just propagate error codes from other APIs. One reason to do it on the client side is that the client will have its own error conditions, so it already has to do some munging to translate errors into SQLit-ese; if we put the rest of the munging in the same place it might be easier to reason about.

I think the reasoning above applies to the dqlite client code as well. The errors it receives from the wire-level protocol are basically the equivalent of a lower layer (think dqlite handling raft errors, or raft handling libuv errors), they need to be handled, possibly translating/augmenting them (in case of the wire-level errors most of the time I guess they will be just propagated verbatim, but there might be a few exceptions). On top of that there will be the errors from its own error conditions, as you mention.

Even in this case the handling/munging/translation happens right on the spot when the error occurs.

I think any error constants we define should start with DQLITE_, not SQLITE_.

Are you talking about wire protocol error constants or about int values returned by the dqlite C client functions?

For the wire protocol I believe we agree that those constants should be a super set of SQLITE_ codes right? We probably don't need to export symbols for them, perhaps just document that the error codes are the same as SQLite and enumerate what are the additional ones. For the int values returned by the dqlite C client, I'm not sure, I'm fine translating the SQLITE_ prefix into DQLITE_.

cole-miller · 2023-01-27T23:13:45Z

My concern with the "layering" approach is that we have to decide (and ideally document) on a function-by-function basis what set of return codes is being used. Then, whenever you call a function from another translation unit (or even another function in the same file), you have to check the semantics of its return code in order to determine whether to propagate it or turn it into something else. Once we've got the conventions established it's not so bad, but getting to that state from the current situation where for many functions it's not immediately clear (to me at least) what return codes are used seems like a non-trivial amount of work. But I take your point about masking errors from lower levels of the implementation that might not be useful to callers.

freeekanayaka · 2023-01-28T08:19:24Z

My concern with the "layering" approach is that we have to decide (and ideally document) on a function-by-function basis what set of return codes is being used. Then, whenever you call a function from another translation unit (or even another function in the same file), you have to check the semantics of its return code in order to determine whether to propagate it or turn it into something else.

In my mind if a function in raft calls a libuv function it should immediately handle and possibly translate the error code it gets. That being said, we have just one exception to that in the code base, that I know of, which is the uv_os.c file. That's basically a thin shim around libuv AND the stdlib, in order to have a more uniform API (e.g. return negated error codes). So essentially the only additional rule here is that if you call a function from uv_os.h it's basically like you call a libuv function directly.

With the above in mind, there shouldn't be cases where you need to check the semantics of the function to determine the return codes set.

Once we've got the conventions established it's not so bad, but getting to that state from the current situation where for many functions it's not immediately clear (to me at least) what return codes are used seems like a non-trivial amount of work.

As I said, I think for raft we are already good. It should be apparent by grepping the code for uv_, you shouldn't find cases that fall outside the rule above. In case there are, they should be few and easy to fix.

For dqlite, I'm not entirely sure, but I wouldn't expect the situation to be a lot different, perhaps it will need some more love.

But I take your point about masking errors from lower levels of the implementation that might not be useful to callers.

I believe the most useful thing for callers is to have a uniform (same error codes set), bounded (you know exactly which errors from that set could be returned and under which conditions) and documented list of failure modes, precisely as the stdlib functions.

We'll probably not get to that ideal any time soon, but we can opportunistically move towards that vision.

freeekanayaka · 2023-01-28T10:46:44Z

PS: I've doubled check in the raft code, by grepping uv_, and I didn't find any case where the return code from libuv is propagated verbatim, it's always checked, possibly handed, and an appropriate RAFT_ code is returned. The only exception are all functions in uv_os.h which as said above are just shims around libuv/stdlib (probably this bit should be documented in a comment in uv_os.h.

cole-miller · 2023-01-30T18:18:13Z

Thanks -- I'm getting a clearer picture of the status quo. Would you say that all fallible internal dqlite functions should use either SQLite error codes (plus our dqlite-specific additions) or the dqlite set, i.e. all raft/libuv/stdlib errors should be "masked" before returning from the containing function? (With the idea that dqlite codes will eventually surface as the return value of a dqlite.h function, while SQLite-ish codes will be send over the wire.) What are the principles you'd use to determine which of these two sets a particular function should use?

freeekanayaka · 2023-01-30T19:00:44Z

Thanks -- I'm getting a clearer picture of the status quo. Would you say that all fallible internal dqlite functions should use either SQLite error codes (plus our dqlite-specific additions) or the dqlite set, i.e. all raft/libuv/stdlib errors should be "masked" before returning from the containing function?

Yes, I'd say that's the idea.

(With the idea that dqlite codes will eventually surface as the return value of a dqlite.h function, while SQLite-ish codes will be send over the wire.) What are the principles you'd use to determine which of these two sets a particular function should use?

I'd distinguish two types of functions, the ones that are being called during the handling of a request (essentially anything called directly or indirectly by gateway__handle() in gateway.c) and everything else.

The first type should return SQLite-sh error codes because those codes will be encoded in the response and sent to the client (so not surfaced by any dqlite.h user function or other internal dqlite engine stuff).

The second type should return dqlite-specific error codes, tailored to make sense for the dqlite.h domain (not the database/wire-protocol domain).

MathieuBordere added the Feature New feature, not a bug label Jun 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error codes, the wire protocol, and clients #462

Error codes, the wire protocol, and clients #462

cole-miller commented Jan 25, 2023

freeekanayaka commented Jan 25, 2023

cole-miller commented Jan 25, 2023 •

edited

Loading

cole-miller commented Jan 25, 2023

freeekanayaka commented Jan 26, 2023

freeekanayaka commented Jan 26, 2023

cole-miller commented Jan 26, 2023 •

edited

Loading

freeekanayaka commented Jan 26, 2023

cole-miller commented Jan 27, 2023

freeekanayaka commented Jan 28, 2023

freeekanayaka commented Jan 28, 2023

cole-miller commented Jan 30, 2023

freeekanayaka commented Jan 30, 2023 •

edited

Loading

Error codes, the wire protocol, and clients #462

Error codes, the wire protocol, and clients #462

Comments

cole-miller commented Jan 25, 2023

freeekanayaka commented Jan 25, 2023

cole-miller commented Jan 25, 2023 • edited Loading

cole-miller commented Jan 25, 2023

freeekanayaka commented Jan 26, 2023

freeekanayaka commented Jan 26, 2023

cole-miller commented Jan 26, 2023 • edited Loading

freeekanayaka commented Jan 26, 2023

cole-miller commented Jan 27, 2023

freeekanayaka commented Jan 28, 2023

freeekanayaka commented Jan 28, 2023

cole-miller commented Jan 30, 2023

freeekanayaka commented Jan 30, 2023 • edited Loading

cole-miller commented Jan 25, 2023 •

edited

Loading

cole-miller commented Jan 26, 2023 •

edited

Loading

freeekanayaka commented Jan 30, 2023 •

edited

Loading