Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubleshoot and fix issues in grpc-web JS conformance client #831

Merged
merged 2 commits into from
Mar 15, 2024

Conversation

jhump
Copy link
Member

@jhump jhump commented Mar 15, 2024

There were a couple of bugs in the conformance client that:

  1. Were causing the tests to outright hang. That is why I had to add "--skip" flags in the Makefile, so we could at
    least run all of the test cases that didn't hang. It turns out this was due to two issues:
    • Sometimes (rarely, just in two cases) the "status" event is never invoked for a unary RPC (only happens with certain kinds of failing RPCs). This was the event where the code was resolving the promise.
    • In at least one case, the metadata field of an error object was undefined. This tickled a bug that produced an uncaught error, that in turn prevented the result promise from ever being resolved.
  2. Were causing a few of the unary RPC test cases to report "no error", even though an error was in fact reported. This is because, in apparently very rare circumstances, a "stream" event is observed before the callback is invoked. So we resolved the promise prematurely with a result that had no error, even though knowledge of the error was coming afterwards.

Previously, the unary case always resolved the promise from the "status" event, and the server-stream case would resolve the promise from the "error" or "end" event (whichever occurred first). I've changed them to now be consistent: we resolve the RPC from the "error" or "end" event (in the case of a unary RPC, it treats the result callback as the "error" event if it supplies an error instead of a message). Also, since we're always resolving from an "error" event, which almost always happens before a "status" event, we now pull the trailers from the error object.

This also adds some other logging and safeguards so that we can hopefully more easily detect problems in the future. So if there is ever an uncaught error, we'll log it. And we also forcibly resolve promises after 15s (with an appropriate error message), so that a failure to resolve the promise from the RPC callbacks won't cause the whole thing to hang.

I added lots of temporary instrumentation and logging in the client to learn the ins-and-outs of what was going on. Here's a summary from that instrumentation:

Out of 71 unary RPCs (of which 57 failed):

  • 0 successful RPCs did not report a 'status' event
  • 0 successful RPCs did not report an 'end' event
  • 2 failed RPCs did not report a 'status' event
  • 50 failed RPCs did not report an 'end' event
  • 3 failed RPCs observed status before error

Out of 62 server-stream RPCs (of which 44 failed):

  • 1 successful RPC did not report a 'status' event
  • 0 successful RPCs did not report an 'end' event
  • 0 failed RPCs did not report a 'status' event
  • 3 failed RPCs did not report an 'end' event
  • 0 failed RPCs observed status before error

I went ahead a pushed a couple of branches with some exploratory code related to my investigation. The instrumentation that produced the above stats is here. And I also experimented with the use of a non-headless browser, so I code poke at the JavaScript console and see what was happening; that code is here.

@jhump jhump requested a review from smaye81 March 15, 2024 17:16
// One might expect that the "status" event is always delivered (since
// consistency would make it much easier to implement interceptors or
// decorators, to instrument all RPCs with cross-cutting concerns, like
// metrics, logging, etc). But one would be wrong: as of 3/15/2024, there
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤣

testing/grpcwebclient/browserscript.ts Outdated Show resolved Hide resolved
@@ -110,11 +110,7 @@ runclienttests: $(BIN)/connectconformance $(BIN)/referenceclient $(BIN)/grpcclie
$(BIN)/connectconformance -v --conf ./testing/grpc-impls-config.yaml --mode client --trace \
--known-failing @./testing/grpcclient-known-failing.txt \
-- $(BIN)/grpcclient
@# Note that use of --skip is discouraged, but if we don't skip them the client crashes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the test cases don't actually cause anything to crash, per se. What was happening is that a test case was taking too long (test runner imposes a 20 second timeout). After that time elapsed, the test runner would attempt to stop the client by sending it a SIGTERM. That signal was then propagated to the Chrome child process (maybe node does that automatically, or maybe puppeteer adds some sort of shutdown hook or signal handler?). So Chrome then terminated, which causes the code that was awaiting the result of the hung test to unblock with an error. The script would then try to shutdown, and log a nasty error when it tried to call page.close() on a browser that was already disconnected/killed. I had misinterpreted that nasty error to mean maybe the browser had crashed, vs. it being caused by the test runner trying to stop the program under test.

@jhump jhump enabled auto-merge (squash) March 15, 2024 17:56
@jhump jhump merged commit 89e2c05 into main Mar 15, 2024
4 checks passed
@jhump jhump deleted the jh/grpcwebclient-minimal-fixes branch March 15, 2024 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants