[bsr-383] Fix flaky studio agent test #1306

unmultimedio · 2022-07-29T00:11:25Z

Apparently, when doing conn.Close() in the test, to fake a server-closed connection, it's mapped sometimes as connect.CodeUnavailable instead of connect.CodeUnknown, that's what's causing the flakiness.

Dig into connect-go, and it's being returned in the envelope, by reading the error from the request.

The error that's being triggered is on the Connect's httpClient.Do invocation, a net/url.Error:

error(*net/url.Error) *{Op: "Post", URL: "http://127.0.0.1:56188", Err: error(*errors.errorString) *{s: "io: read/write on closed pipe"}}

Which later is unwrapped into a regular error string "io: read/write on closed pipe", and then wrapped again into a connect.CodeUnavailable.

That's the one that's being returned, but when doing conn.Close() the test is expecting a BadGateway.

Not sure if this is expected, or what's the original intention of the test t.Run("invalid_upstream",. Ticket BSR-383

tbh, not sure if the fix is here, or in Connect by returning a CodeUnknown instead of CodeUnavailable on the catch-all non-Connect error. Related commit.

bsr-383

doriable · 2022-07-29T17:45:32Z

Took a look at this and had a quick chat about this with @unmultimedio offline, summarising our points here:

I think fix should definitely be on this side of things, since the flakiness seems to be caused by the fact that conn.Close() is called in a go routine
We're going to look to clean-up the flow of the test, but it is also worth noting/considering that CodeUnavailable should maybe also be a BadGateway since it should deter retry mechanisms

private/bufpkg/bufstudioagent/bufstudioagent_test.go

akshayjshah

Caught up with Doria about this today.

I think that this approach to producing errors will always be flaky to one extent or another. The core of the problem is that it's very difficult to synchronize between the accept goroutine and the net/http code in HTTPClient.Do - even with the proposed changes, we don't have any guarantee that the connection is closed before the request issues the syscall to read data. connect-go tries to inspect the Go errors from HTTPClient.Do and assign meaningful error codes, but it's never going to be perfect.

In short, it's difficult to reliably make an HTTP request using a TCP connection which the remote has already closed. I can imagine some slightly more elaborate code to improve what we have, but I don't think it's worthwhile.

To me, the tests would be simpler to read, have no flakes, and be more exhaustive if we factored the error-to-HTTP-status mapping code out of plainPostHandler.ServeHTTP to a helper function. We can then easily test handling of every connect error code, which serves as documentation of our intentions. (Reading this code, for example, I can't tell why we don't also map ResourceExhausted to 503 - did we just forget, or do we specifically not want to do that?) This approach sacrifices some "realism," but it doesn't look to me like that realism is particularly valuable here.

bsr-383

unmultimedio · 2022-08-02T21:20:03Z

private/bufpkg/bufstudioagent/bufstudioagent_test.go

-		go func() {
-			conn, err := listener.Accept()
-			require.NoError(t, err)
-			require.NoError(t, conn.Close())


This line is inherently flaky because of the behavior of http clients handling closed connections. So we can write a similiar test instead, to return the Unknown code, and test the same behavior.

unmultimedio · 2022-08-02T21:21:10Z

private/bufpkg/bufstudioagent/plain_post_handler.go

+		// Any issue connecting that is not a Connect error is assumed to be because
+		// the server is in some kind of bad and/or unreachable state.


No need for checking net.OpErr or url.Error, all of those cases were doing the same thing as this catch-all. Same for the Unknown and CodeUnavailable connect codes.

unmultimedio · 2022-08-02T21:23:36Z

@doriable @akshayjshah thanks for the revisions and help. As right now the only two codes that cannot be handled with an invokeResponse and err.Meta() are Unavailable and Unknown, it's easier to have those two conditions in the code inlined. But if that list grows, and other cases appear, we should definitely take Akshay's approach to having a map of connect codes to http status codes for the studio agent.

pkwarren · 2022-08-05T18:08:07Z

private/bufpkg/bufstudioagent/bufstudioagent_test.go

@@ -236,6 +227,13 @@ func newTestConnectServer(t *testing.T, tls bool) *httptest.Server {
 		},
 		connect.WithCodec(&bufferCodec{name: "proto"}),
 	))
+	mux.Handle(unknownPath, connect.NewUnaryHandler(
+		errorPath,


Should this be unknownPath?

~~Yes! will fix~~ Fixed in the latest commit, thx.

doriable

This seems reasonable to me.
I think given the slight behavioural change of catching connect.CodeUnavailable, we may want to add a changelog entry for the buf beta studio-agent command, but I'm going to approve first on the code :)

pkwarren · 2022-08-05T18:21:29Z

private/bufpkg/bufstudioagent/plain_post_handler.go

-		if urlErr := new(url.Error); errors.As(err, &urlErr) {
-			http.Error(w, err.Error(), http.StatusBadGateway)
-			return
-		}


I wonder if we need https://github.com/bufbuild/connect-go/issues/222 to accurately determine what to return here? I don't like that to get a test passing we're treating any CodeUnavailable errors (whether from client connection or returned from server) as StatusBadGateway.

I'd feel better if this fix was restricted to the unit tests and didn't change this logic.

If you check carefully the code that was written, the only logic I changed was adding connect.CodeUnavailable as another connect error code to trigger an http.StatusBadGateway. All the other cases: net.OpError, url.Error, or the catch-all in L196, all of them were doing the same thing, triggering an http.StatusBadGateway. All I did was simplify that code into a single catch-all in new L186.

The main difference from before is that if you had a handler that returned CodeUnavailable (perhaps with error details) and it didn't wrap net.OpError or url.Error, we'd call i.writeProtoMessage with the error metadata.

Now we're treating CodeUnavailable (whether from a network error or from the handler) as StatusBadGateway. This means that people who use CodeUnavailable for whatever reason in their handler won't be able to dig into more details in the error.

As I understand it, this PR is purely attempting to improve test flakiness. Is that correct? If so, it's important but not super urgent.

@pkwarren is correct about the error handling gotchas we're introducing here. If this change can wait a few days, or if we're comfortable reverting it in a few days, let's address bufbuild/connect-go#222 and use that logic here.

Just had a zoom with @pkwarren, explained me the error logic here, thank you! I'll update this PR to just be skipping the flaky test, and improving comments.

unmultimedio · 2022-08-05T18:26:46Z

CHANGELOG.md

+- Change default for `--timeout` flag of `buf beta studio-agent` to `0` (no timeout). Before it was
+  `2m` (the default for all the other `buf` commands).


Missing from https://github.com/bufbuild/buf/pull/1305/files

bsr-409

unmultimedio · 2022-08-05T18:37:33Z

we may want to add a changelog entry for the buf beta studio-agent command

@doriable Added in the latest commit :)

bsr-383

* Sometimes conn.Close is mapped as unavailable? bsr-383 * Comment for CodeUnavailable, and sync test * test * Setting custom timeout to the agent client * Rewrite how StatusBadGateway is triggered bsr-383 * Remove agent server timeout * Changelog and test path fix bsr-409 * Revert changes, improve code comments bsr-383

Sometimes conn.Close is mapped as unavailable?

bfa9bc4

bsr-383

unmultimedio requested review from pkwarren, robbertvanginkel and akshayjshah July 29, 2022 00:11

unmultimedio self-assigned this Jul 29, 2022

unmultimedio requested a review from doriable July 29, 2022 00:12

Comment for CodeUnavailable, and sync test

7eaddba

unmultimedio added 2 commits July 29, 2022 12:58

test

4bb2aa7

Setting custom timeout to the agent client

103de2b

unmultimedio commented Jul 29, 2022

View reviewed changes

private/bufpkg/bufstudioagent/bufstudioagent_test.go Outdated Show resolved Hide resolved

akshayjshah reviewed Aug 2, 2022

View reviewed changes

Rewrite how StatusBadGateway is triggered

7446a8a

bsr-383

unmultimedio commented Aug 2, 2022

View reviewed changes

Merge branch 'main' into jfigueroa/bsr-383-fix-flaky-test

a40dafc

unmultimedio marked this pull request as ready for review August 2, 2022 21:23

unmultimedio requested a review from akshayjshah August 2, 2022 21:23

Remove agent server timeout

e25ca0d

pkwarren reviewed Aug 5, 2022

View reviewed changes

doriable approved these changes Aug 5, 2022

View reviewed changes

pkwarren reviewed Aug 5, 2022

View reviewed changes

unmultimedio commented Aug 5, 2022

View reviewed changes

unmultimedio force-pushed the jfigueroa/bsr-383-fix-flaky-test branch 2 times, most recently from 54dcfbf to 5488985 Compare August 5, 2022 18:29

Changelog and test path fix

902ddba

bsr-409

unmultimedio force-pushed the jfigueroa/bsr-383-fix-flaky-test branch from 5488985 to 902ddba Compare August 5, 2022 18:35

doriable approved these changes Aug 5, 2022

View reviewed changes

unmultimedio force-pushed the jfigueroa/bsr-383-fix-flaky-test branch from 22ad791 to e90ee43 Compare August 5, 2022 21:27

Revert changes, improve code comments

bc7617e

bsr-383

unmultimedio force-pushed the jfigueroa/bsr-383-fix-flaky-test branch from e90ee43 to bc7617e Compare August 5, 2022 21:28

pkwarren approved these changes Aug 5, 2022

View reviewed changes

unmultimedio merged commit 27930ca into main Aug 5, 2022

unmultimedio deleted the jfigueroa/bsr-383-fix-flaky-test branch August 5, 2022 21:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bsr-383] Fix flaky studio agent test #1306

[bsr-383] Fix flaky studio agent test #1306

unmultimedio commented Jul 29, 2022 •

edited

Loading

doriable commented Jul 29, 2022

akshayjshah left a comment

unmultimedio Aug 2, 2022 •

edited

Loading

unmultimedio Aug 2, 2022

unmultimedio commented Aug 2, 2022

pkwarren Aug 5, 2022

unmultimedio Aug 5, 2022 •

edited

Loading

doriable left a comment

pkwarren Aug 5, 2022

unmultimedio Aug 5, 2022

pkwarren Aug 5, 2022

akshayjshah Aug 5, 2022

unmultimedio Aug 5, 2022

unmultimedio Aug 5, 2022

unmultimedio commented Aug 5, 2022

		// Any issue connecting that is not a Connect error is assumed to be because
		// the server is in some kind of bad and/or unreachable state.

		- Change default for `--timeout` flag of `buf beta studio-agent` to `0` (no timeout). Before it was
		`2m` (the default for all the other `buf` commands).

[bsr-383] Fix flaky studio agent test #1306

[bsr-383] Fix flaky studio agent test #1306

Conversation

unmultimedio commented Jul 29, 2022 • edited Loading

doriable commented Jul 29, 2022

akshayjshah left a comment

Choose a reason for hiding this comment

unmultimedio Aug 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

unmultimedio commented Aug 2, 2022

Choose a reason for hiding this comment

unmultimedio Aug 5, 2022 • edited Loading

Choose a reason for hiding this comment

doriable left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

unmultimedio commented Aug 5, 2022

unmultimedio commented Jul 29, 2022 •

edited

Loading

unmultimedio Aug 2, 2022 •

edited

Loading

unmultimedio Aug 5, 2022 •

edited

Loading