-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bsr-383] Fix flaky studio agent test #1306
Conversation
Took a look at this and had a quick chat about this with @unmultimedio offline, summarising our points here:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caught up with Doria about this today.
I think that this approach to producing errors will always be flaky to one extent or another. The core of the problem is that it's very difficult to synchronize between the accept goroutine and the net/http
code in HTTPClient.Do
- even with the proposed changes, we don't have any guarantee that the connection is closed before the request issues the syscall to read data. connect-go
tries to inspect the Go errors from HTTPClient.Do
and assign meaningful error codes, but it's never going to be perfect.
In short, it's difficult to reliably make an HTTP request using a TCP connection which the remote has already closed. I can imagine some slightly more elaborate code to improve what we have, but I don't think it's worthwhile.
To me, the tests would be simpler to read, have no flakes, and be more exhaustive if we factored the error-to-HTTP-status mapping code out of plainPostHandler.ServeHTTP
to a helper function. We can then easily test handling of every connect error code, which serves as documentation of our intentions. (Reading this code, for example, I can't tell why we don't also map ResourceExhausted
to 503 - did we just forget, or do we specifically not want to do that?) This approach sacrifices some "realism," but it doesn't look to me like that realism is particularly valuable here.
go func() { | ||
conn, err := listener.Accept() | ||
require.NoError(t, err) | ||
require.NoError(t, conn.Close()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is inherently flaky because of the behavior of http clients handling closed connections. So we can write a similiar test instead, to return the Unknown code, and test the same behavior.
// Any issue connecting that is not a Connect error is assumed to be because | ||
// the server is in some kind of bad and/or unreachable state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for checking net.OpErr
or url.Error
, all of those cases were doing the same thing as this catch-all. Same for the Unknown
and CodeUnavailable
connect codes.
@doriable @akshayjshah thanks for the revisions and help. As right now the only two codes that cannot be handled with an |
@@ -236,6 +227,13 @@ func newTestConnectServer(t *testing.T, tls bool) *httptest.Server { | |||
}, | |||
connect.WithCodec(&bufferCodec{name: "proto"}), | |||
)) | |||
mux.Handle(unknownPath, connect.NewUnaryHandler( | |||
errorPath, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be unknownPath
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! will fix Fixed in the latest commit, thx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable to me.
I think given the slight behavioural change of catching connect.CodeUnavailable
, we may want to add a changelog entry for the buf beta studio-agent
command, but I'm going to approve first on the code :)
if urlErr := new(url.Error); errors.As(err, &urlErr) { | ||
http.Error(w, err.Error(), http.StatusBadGateway) | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we need https://github.com/bufbuild/connect-go/issues/222 to accurately determine what to return here? I don't like that to get a test passing we're treating any CodeUnavailable errors (whether from client connection or returned from server) as StatusBadGateway.
I'd feel better if this fix was restricted to the unit tests and didn't change this logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you check carefully the code that was written, the only logic I changed was adding connect.CodeUnavailable
as another connect error code to trigger an http.StatusBadGateway
. All the other cases: net.OpError, url.Error, or the catch-all in L196, all of them were doing the same thing, triggering an http.StatusBadGateway
. All I did was simplify that code into a single catch-all in new L186.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main difference from before is that if you had a handler that returned CodeUnavailable (perhaps with error details) and it didn't wrap net.OpError or url.Error, we'd call i.writeProtoMessage
with the error metadata.
Now we're treating CodeUnavailable (whether from a network error or from the handler) as StatusBadGateway. This means that people who use CodeUnavailable for whatever reason in their handler won't be able to dig into more details in the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, this PR is purely attempting to improve test flakiness. Is that correct? If so, it's important but not super urgent.
@pkwarren is correct about the error handling gotchas we're introducing here. If this change can wait a few days, or if we're comfortable reverting it in a few days, let's address bufbuild/connect-go#222 and use that logic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had a zoom with @pkwarren, explained me the error logic here, thank you! I'll update this PR to just be skipping the flaky test, and improving comments.
- Change default for `--timeout` flag of `buf beta studio-agent` to `0` (no timeout). Before it was | ||
`2m` (the default for all the other `buf` commands). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing from https://github.com/bufbuild/buf/pull/1305/files
54dcfbf
to
5488985
Compare
5488985
to
902ddba
Compare
@doriable Added in the latest commit :) |
22ad791
to
e90ee43
Compare
e90ee43
to
bc7617e
Compare
* Sometimes conn.Close is mapped as unavailable? bsr-383 * Comment for CodeUnavailable, and sync test * test * Setting custom timeout to the agent client * Rewrite how StatusBadGateway is triggered bsr-383 * Remove agent server timeout * Changelog and test path fix bsr-409 * Revert changes, improve code comments bsr-383
Apparently, when doing
conn.Close()
in the test, to fake a server-closed connection, it's mapped sometimes asconnect.CodeUnavailable
instead ofconnect.CodeUnknown
, that's what's causing the flakiness.Dig into
connect-go
, and it's being returned in the envelope, by reading the error from the request.The error that's being triggered is on the Connect's
httpClient.Do
invocation, anet/url.Error
:Which later is unwrapped into a regular error string
"io: read/write on closed pipe"
, and then wrapped again into aconnect.CodeUnavailable
.That's the one that's being returned, but when doing
conn.Close()
the test is expecting aBadGateway
.Not sure if this is expected, or what's the original intention of the test
t.Run("invalid_upstream",
. Ticket BSR-383tbh, not sure if the fix is here, or in Connect by returning a
CodeUnknown
instead ofCodeUnavailable
on the catch-all non-Connect error. Related commit.