Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bsr-383] Fix flaky studio agent test #1306

Merged
merged 9 commits into from
Aug 5, 2022

Conversation

unmultimedio
Copy link
Member

@unmultimedio unmultimedio commented Jul 29, 2022

Apparently, when doing conn.Close() in the test, to fake a server-closed connection, it's mapped sometimes as connect.CodeUnavailable instead of connect.CodeUnknown, that's what's causing the flakiness.

Dig into connect-go, and it's being returned in the envelope, by reading the error from the request.

The error that's being triggered is on the Connect's httpClient.Do invocation, a net/url.Error:

error(*net/url.Error) *{Op: "Post", URL: "http://127.0.0.1:56188", Err: error(*errors.errorString) *{s: "io: read/write on closed pipe"}}

Which later is unwrapped into a regular error string "io: read/write on closed pipe", and then wrapped again into a connect.CodeUnavailable.

That's the one that's being returned, but when doing conn.Close() the test is expecting a BadGateway.

Not sure if this is expected, or what's the original intention of the test t.Run("invalid_upstream",. Ticket BSR-383

tbh, not sure if the fix is here, or in Connect by returning a CodeUnknown instead of CodeUnavailable on the catch-all non-Connect error. Related commit.

@doriable
Copy link
Member

Took a look at this and had a quick chat about this with @unmultimedio offline, summarising our points here:

  • I think fix should definitely be on this side of things, since the flakiness seems to be caused by the fact that conn.Close() is called in a go routine
  • We're going to look to clean-up the flow of the test, but it is also worth noting/considering that CodeUnavailable should maybe also be a BadGateway since it should deter retry mechanisms

Copy link
Member

@akshayjshah akshayjshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caught up with Doria about this today.

I think that this approach to producing errors will always be flaky to one extent or another. The core of the problem is that it's very difficult to synchronize between the accept goroutine and the net/http code in HTTPClient.Do - even with the proposed changes, we don't have any guarantee that the connection is closed before the request issues the syscall to read data. connect-go tries to inspect the Go errors from HTTPClient.Do and assign meaningful error codes, but it's never going to be perfect.

In short, it's difficult to reliably make an HTTP request using a TCP connection which the remote has already closed. I can imagine some slightly more elaborate code to improve what we have, but I don't think it's worthwhile.

To me, the tests would be simpler to read, have no flakes, and be more exhaustive if we factored the error-to-HTTP-status mapping code out of plainPostHandler.ServeHTTP to a helper function. We can then easily test handling of every connect error code, which serves as documentation of our intentions. (Reading this code, for example, I can't tell why we don't also map ResourceExhausted to 503 - did we just forget, or do we specifically not want to do that?) This approach sacrifices some "realism," but it doesn't look to me like that realism is particularly valuable here.

go func() {
conn, err := listener.Accept()
require.NoError(t, err)
require.NoError(t, conn.Close())
Copy link
Member Author

@unmultimedio unmultimedio Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is inherently flaky because of the behavior of http clients handling closed connections. So we can write a similiar test instead, to return the Unknown code, and test the same behavior.

Comment on lines 184 to 185
// Any issue connecting that is not a Connect error is assumed to be because
// the server is in some kind of bad and/or unreachable state.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for checking net.OpErr or url.Error, all of those cases were doing the same thing as this catch-all. Same for the Unknown and CodeUnavailable connect codes.

@unmultimedio
Copy link
Member Author

@doriable @akshayjshah thanks for the revisions and help. As right now the only two codes that cannot be handled with an invokeResponse and err.Meta() are Unavailable and Unknown, it's easier to have those two conditions in the code inlined. But if that list grows, and other cases appear, we should definitely take Akshay's approach to having a map of connect codes to http status codes for the studio agent.

@unmultimedio unmultimedio marked this pull request as ready for review August 2, 2022 21:23
@@ -236,6 +227,13 @@ func newTestConnectServer(t *testing.T, tls bool) *httptest.Server {
},
connect.WithCodec(&bufferCodec{name: "proto"}),
))
mux.Handle(unknownPath, connect.NewUnaryHandler(
errorPath,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be unknownPath?

Copy link
Member Author

@unmultimedio unmultimedio Aug 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! will fix Fixed in the latest commit, thx.

Copy link
Member

@doriable doriable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to me.
I think given the slight behavioural change of catching connect.CodeUnavailable, we may want to add a changelog entry for the buf beta studio-agent command, but I'm going to approve first on the code :)

if urlErr := new(url.Error); errors.As(err, &urlErr) {
http.Error(w, err.Error(), http.StatusBadGateway)
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need https://github.com/bufbuild/connect-go/issues/222 to accurately determine what to return here? I don't like that to get a test passing we're treating any CodeUnavailable errors (whether from client connection or returned from server) as StatusBadGateway.

I'd feel better if this fix was restricted to the unit tests and didn't change this logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you check carefully the code that was written, the only logic I changed was adding connect.CodeUnavailable as another connect error code to trigger an http.StatusBadGateway. All the other cases: net.OpError, url.Error, or the catch-all in L196, all of them were doing the same thing, triggering an http.StatusBadGateway. All I did was simplify that code into a single catch-all in new L186.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main difference from before is that if you had a handler that returned CodeUnavailable (perhaps with error details) and it didn't wrap net.OpError or url.Error, we'd call i.writeProtoMessage with the error metadata.

Now we're treating CodeUnavailable (whether from a network error or from the handler) as StatusBadGateway. This means that people who use CodeUnavailable for whatever reason in their handler won't be able to dig into more details in the error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, this PR is purely attempting to improve test flakiness. Is that correct? If so, it's important but not super urgent.

@pkwarren is correct about the error handling gotchas we're introducing here. If this change can wait a few days, or if we're comfortable reverting it in a few days, let's address bufbuild/connect-go#222 and use that logic here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had a zoom with @pkwarren, explained me the error logic here, thank you! I'll update this PR to just be skipping the flaky test, and improving comments.

Comment on lines +6 to +7
- Change default for `--timeout` flag of `buf beta studio-agent` to `0` (no timeout). Before it was
`2m` (the default for all the other `buf` commands).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@unmultimedio unmultimedio force-pushed the jfigueroa/bsr-383-fix-flaky-test branch 2 times, most recently from 54dcfbf to 5488985 Compare August 5, 2022 18:29
@unmultimedio unmultimedio force-pushed the jfigueroa/bsr-383-fix-flaky-test branch from 5488985 to 902ddba Compare August 5, 2022 18:35
@unmultimedio
Copy link
Member Author

we may want to add a changelog entry for the buf beta studio-agent command

@doriable Added in the latest commit :)

@unmultimedio unmultimedio force-pushed the jfigueroa/bsr-383-fix-flaky-test branch from 22ad791 to e90ee43 Compare August 5, 2022 21:27
@unmultimedio unmultimedio force-pushed the jfigueroa/bsr-383-fix-flaky-test branch from e90ee43 to bc7617e Compare August 5, 2022 21:28
@unmultimedio unmultimedio merged commit 27930ca into main Aug 5, 2022
@unmultimedio unmultimedio deleted the jfigueroa/bsr-383-fix-flaky-test branch August 5, 2022 21:44
Monirul1 pushed a commit to Monirul1/buf that referenced this pull request Apr 30, 2023
* Sometimes conn.Close is mapped as unavailable?

bsr-383

* Comment for CodeUnavailable, and sync test

* test

* Setting custom timeout to the agent client

* Rewrite how StatusBadGateway is triggered

bsr-383

* Remove agent server timeout

* Changelog and test path fix

bsr-409

* Revert changes, improve code comments

bsr-383
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants