Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix trailers race condition #26

Merged
merged 4 commits into from
Apr 20, 2023
Merged

Fix trailers race condition #26

merged 4 commits into from
Apr 20, 2023

Conversation

doctor-pi
Copy link
Collaborator

@doctor-pi doctor-pi commented Apr 7, 2023

We recently noticed that we were getting Status = Unknown with the default message "Server did not return grpc-status", when a server was in fact responding properly with a non-default value.

We now reproduced the issue in both grpc-async and grpc-lwt, using a variation of the greeter client example, with repeated requests. It seems that we randomly fail to get the real status, and we are filling in the default.

This seems to be a race condition due to the way this is implemented, by checking if a future is filled when in fact it is simply not filled yet.

How to reproduce

See branch reproduce/unknown-status-issue for a reproduction of the issue.
(https://github.com/dialohq/ocaml-grpc/tree/reproduce/unknown-status-issue.)

Run this in one terminal:

  • dune exec -- examples/greeter-server-lwt/greeter_server_lwt.exe

Then run either of the following and it should blow up at some random (numbered) message:

  • dune exec -- examples/greeter-test-async/test_grpc_status_async.exe test
  • dune exec -- examples/greeter-test-lwt/test_grpc_status_lwt.exe test

Check solution

See branch test/unknown-status-issue, which contains the same commits that are in this branch (the solution), on top of the code to reproduce the issue.
(https://github.com/dialohq/ocaml-grpc/tree/test/unknown-status-issue.)

Run the same commands as above and after 100_000 iterations the programs should terminate without error.
(Feel free to remove or increase the limit, I tested without it but it's convenient.)

Assumptions

We assume that trailers only responses are not an issue, and we should never hang when there is no body in the response. This is consistent with expectations of HTTP2, so any hangs would probably indicate an issue with the underlying HTTP2 library and should probably be addressed there.

Should be tested with the etcd example, see #1.
Could it be an etcd / proxy issue? I think there might be a missing trailer handler, but the situation needs some confirmation.

Update (13-Apr-2023)

Refactored the code a bit.

  • Renamed the reproduction tests to avoid compilation errors, adding suffixes _lwt, _async. See above.
  • The code to extract the status from headers does not depend on Lwt / Async any more, so it was moved to status.ml.
  • In my testing, when there was an immediate error we hanged waiting for the status to be filled from the trailers handler, which never happened. Now we only wait for this on Ok responses (which, of course, can contain a proper gRPC error status).
  • When the response is Ok, we first check for the presence of grpc status in the headers, and if we don't find it there we block waiting for the trailers status.

See branch reproduce/unknown-status-issue.

We assume that trailers only responses are not an issue, and we should
never hang when there is no body in the response. This is consistent
with expectations of HTTP2, so any hangs would probably indicate an
issue with the underlying HTTP2 library.

Should be tested with the etcd example.
@doctor-pi doctor-pi requested a review from quernd April 7, 2023 01:31
@quernd
Copy link
Collaborator

quernd commented Apr 7, 2023

Thanks, great catch! This definitely makes sense. I will test it for our main use case (streaming).

@quernd
Copy link
Collaborator

quernd commented Apr 12, 2023

It works well for our use case, however your intuition to check with the etcd example was right because this change does indeed seem to introduce a regression (see #1, e.g. when dialing a non-existing method it will hang again). That's because we still need to get the gRPC status from the headers when the body is empty. I will investigate some more tomorrow, thanks again for reporting and the reproduction!

@doctor-pi
Copy link
Collaborator Author

It works well for our use case, however your intuition to check with the etcd example was right because this change does indeed seem to introduce a regression (see #1, e.g. when dialing a non-existing method it will hang again). That's because we still need to get the gRPC status from the headers when the body is empty. I will investigate some more tomorrow, thanks again for reporting and the reproduction!

You are right. I just tested in our codebase as well, and it hanged on a non-existing method.

I'll also have a look.

Tested to not hang on unknown methods, but may need some improvements.
@doctor-pi
Copy link
Collaborator Author

It works well for our use case, however your intuition to check with the etcd example was right because this change does indeed seem to introduce a regression (see #1, e.g. when dialing a non-existing method it will hang again). That's because we still need to get the gRPC status from the headers when the body is empty. I will investigate some more tomorrow, thanks again for reporting and the reproduction!

I adapted the solution, and at least in my test it won't hang when the method is changed to a non-registered one.
For example ~rpc:"SayHello" -> ~rpc:"SayHello1" will now result in an error with 404 status.

The server does not return gprc headers in that case so I may have to refine the solution a bit, let me know how it works for you.

@doctor-pi
Copy link
Collaborator Author

@quernd i suspect the etcd example might behave as expected now. (After the latest commit: fefc3d8)

@doctor-pi doctor-pi changed the title Draft: Fix trailers race condition Fix trailers race condition Apr 13, 2023
@doctor-pi
Copy link
Collaborator Author

doctor-pi commented Apr 13, 2023

We should probably also probe for the presence of grpc-status in the response.status = Ok case, before waiting for the trailers ivar which blocks.

I added a commit that implements this solution.

@quernd
Copy link
Collaborator

quernd commented Apr 15, 2023

Thanks again @doctor-pi, it all looks good in my tests now. I'll review and merge soon.

Copy link
Collaborator

@quernd quernd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me. Thanks again for your contribution!

@doctor-pi
Copy link
Collaborator Author

It looks good to me. Thanks again for your contribution!

Nice. I'll merge.

@doctor-pi doctor-pi merged commit 6ab4178 into main Apr 20, 2023
1 check passed
@doctor-pi
Copy link
Collaborator Author

It looks good to me. Thanks again for your contribution!

I guess I should have squashed the commits?

@quernd
Copy link
Collaborator

quernd commented Apr 21, 2023

I guess I should have squashed the commits?

Don't worry about that, we don't have any such policy (yet).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants