Troubleshoot and fix issues in grpc-web JS conformance client #831

jhump · 2024-03-15T17:16:52Z

There were a couple of bugs in the conformance client that:

Were causing the tests to outright hang. That is why I had to add "--skip" flags in the Makefile, so we could at
least run all of the test cases that didn't hang. It turns out this was due to two issues:
- Sometimes (rarely, just in two cases) the "status" event is never invoked for a unary RPC (only happens with certain kinds of failing RPCs). This was the event where the code was resolving the promise.
- In at least one case, the metadata field of an error object was undefined. This tickled a bug that produced an uncaught error, that in turn prevented the result promise from ever being resolved.
Were causing a few of the unary RPC test cases to report "no error", even though an error was in fact reported. This is because, in apparently very rare circumstances, a "stream" event is observed before the callback is invoked. So we resolved the promise prematurely with a result that had no error, even though knowledge of the error was coming afterwards.

Previously, the unary case always resolved the promise from the "status" event, and the server-stream case would resolve the promise from the "error" or "end" event (whichever occurred first). I've changed them to now be consistent: we resolve the RPC from the "error" or "end" event (in the case of a unary RPC, it treats the result callback as the "error" event if it supplies an error instead of a message). Also, since we're always resolving from an "error" event, which almost always happens before a "status" event, we now pull the trailers from the error object.

This also adds some other logging and safeguards so that we can hopefully more easily detect problems in the future. So if there is ever an uncaught error, we'll log it. And we also forcibly resolve promises after 15s (with an appropriate error message), so that a failure to resolve the promise from the RPC callbacks won't cause the whole thing to hang.

I added lots of temporary instrumentation and logging in the client to learn the ins-and-outs of what was going on. Here's a summary from that instrumentation:

Out of 71 unary RPCs (of which 57 failed):

0 successful RPCs did not report a 'status' event
0 successful RPCs did not report an 'end' event
2 failed RPCs did not report a 'status' event
50 failed RPCs did not report an 'end' event
3 failed RPCs observed status before error

Out of 62 server-stream RPCs (of which 44 failed):

1 successful RPC did not report a 'status' event
0 successful RPCs did not report an 'end' event
0 failed RPCs did not report a 'status' event
3 failed RPCs did not report an 'end' event
0 failed RPCs observed status before error

I went ahead a pushed a couple of branches with some exploratory code related to my investigation. The instrumentation that produced the above stats is here. And I also experimented with the use of a non-headless browser, so I code poke at the JavaScript console and see what was happening; that code is here.

smaye81 · 2024-03-15T17:20:36Z

testing/grpcwebclient/browserscript.ts

+    // One might expect that the "status" event is always delivered (since
+    // consistency would make it much easier to implement interceptors or
+    // decorators, to instrument all RPCs with cross-cutting concerns, like
+    // metrics, logging, etc). But one would be wrong: as of 3/15/2024, there


testing/grpcwebclient/browserscript.ts

jhump · 2024-03-15T17:43:43Z

Makefile

@@ -110,11 +110,7 @@ runclienttests: $(BIN)/connectconformance $(BIN)/referenceclient $(BIN)/grpcclie
 	$(BIN)/connectconformance -v --conf ./testing/grpc-impls-config.yaml --mode client --trace \
 		--known-failing @./testing/grpcclient-known-failing.txt \
 		-- $(BIN)/grpcclient
-	@# Note that use of --skip is discouraged, but if we don't skip them the client crashes.


BTW, the test cases don't actually cause anything to crash, per se. What was happening is that a test case was taking too long (test runner imposes a 20 second timeout). After that time elapsed, the test runner would attempt to stop the client by sending it a SIGTERM. That signal was then propagated to the Chrome child process (maybe node does that automatically, or maybe puppeteer adds some sort of shutdown hook or signal handler?). So Chrome then terminated, which causes the code that was awaiting the result of the hung test to unblock with an error. The script would then try to shutdown, and log a nasty error when it tried to call page.close() on a browser that was already disconnected/killed. I had misinterpreted that nasty error to mean maybe the browser had crashed, vs. it being caused by the test runner trying to stop the program under test.

troubleshoot and fix grpc-web-client issues

4c8b2c5

jhump requested a review from smaye81 March 15, 2024 17:16

smaye81 approved these changes Mar 15, 2024

View reviewed changes

jhump commented Mar 15, 2024

View reviewed changes

rename initTests -> addErrorListeners; fix up some comments

bcc3eb8

jhump enabled auto-merge (squash) March 15, 2024 17:56

jhump merged commit 89e2c05 into main Mar 15, 2024
4 checks passed

jhump deleted the jh/grpcwebclient-minimal-fixes branch March 15, 2024 18:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshoot and fix issues in grpc-web JS conformance client #831

Troubleshoot and fix issues in grpc-web JS conformance client #831

jhump commented Mar 15, 2024

smaye81 Mar 15, 2024

jhump Mar 15, 2024

Troubleshoot and fix issues in grpc-web JS conformance client #831

Troubleshoot and fix issues in grpc-web JS conformance client #831

Conversation

jhump commented Mar 15, 2024

smaye81 Mar 15, 2024

Choose a reason for hiding this comment

jhump Mar 15, 2024

Choose a reason for hiding this comment