-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bulk: don't return error in the case of closed/draining consumer #77938
bulk: don't return error in the case of closed/draining consumer #77938
Conversation
My main concern here is whether the retry behaviour is acceptable. While we return the list of files just from the successful retry as a response the the query, the external storage will have stray files from the failed attempt. For example, in an export I did when testing, we emitted up to 5 files before the retry go hit:
So users who have grown accustomed to being able to process all of the files written into the directory after a successful export would potentially see a behaviour change: |
This seems like an improvement! I wonder if there's a clean way to wipe the files from the failed export attempt? I guess a user would not want this if they process the csv files as they get written to external storage. We could also add a new option to export, |
Historically, I think we have maintained that the list of files returned to the user is the only list to be trusted/consumed. I remember previous escalations that led us to add this blurb in the docs - https://www.cockroachlabs.com/docs/stable/export.html#export-file-url |
drive-by comment: i was browsing around the repo and saw this. does it need the |
@rafiss Definitely does, thanks |
@yuzefovich Are were guaranteed to have something else return the real error if we pretend this case isn't an error and exit quietly? My reluctance to do this originally was that this code has hit an unexpected/error condition and should error rather than claim it completed successfully. Now of course this isn't the error, and ideally whatever error bubbles up to the user or the retry loop or whatever should be the other error that led to this, so returning an error here shouldn't matter if we take the right one, but it seems like we don't take the right one? If we're sure we'll always get the real error in the right place with this silently claiming success, then I guess it is okay to do this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it'd be good to refactor the export processors to not use EmitRow
directly since they are the only ones using it - I'd remove the usage of NoMetadataRowSource
and use rowexec.emitHelper
instead.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @msbutler, and @stevendanna)
pkg/sql/importer/exportcsv.go, line 216 at r1 (raw file):
break } row, err := input.NextRow()
I currently don't understand how we could replace ReadWithinUncertaintyIntervalError
with the custom "unexpected closure". If I'm reading the code right, then ReadWithinUncertaintyIntervalError
must be encountered by our input, pushed into the output, and then returned as err
here. What am I missing?
@yuzefovich I don't think we get an error at that point on all processors. Here is the story I told myself based on some traces but I certainly could have missed something: Processors on at least one node do get the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, indeed, this explanation makes sense to me. Just adding a bit more details on how that happens: a remote processor encounters RWUI error, pushes it to the outbox, the outbox then sends it across the wire, and on the gateway it is pushed into the RowChannel
(the Push
happens in processProducerMessage
in inbound.go
). DistSQLReceiver
gets the error, stores it, and then because it's not a context cancellation, it transitions into DrainRequested
status. The updated status is propagated to the RowChannel
, and we send the drain signal to the remote node. However, the processor running on the gateway will only observe the updated consumer status of the RowChannel
only when a new row is pushed into it.
Thus, answering David's question, I think it is guaranteed that there is an error on the DistSQLReceiver
whenever a status other than "more rows needed" is observed by the gateway export processor.
Thanks for taking on the suggestion for using emitHelper
. I think we should extract the second commit into a separate PR with more unification between two exporters and will not be backporting that commit, and let's keep only the first commit in this PR that will be backported.
Reviewed 2 of 2 files at r1, 6 of 6 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @dt, @msbutler, and @stevendanna)
-- commits, line 14 at r1:
nit: missing category, i.e. Release note (bug fix)
.
pkg/sql/importer/exportcsv.go, line 293 at r2 (raw file):
return nil }() execinfra.DrainAndClose(ctx, sp.output, err, pushTrailingMeta /* pushTrailingMeta */, sp.input)
nit: no need for the inlined comment anymore.
pkg/sql/importer/exportcsv.go, line 296 at r2 (raw file):
} func (sp *csvWriter) writeToExternalStorage(
nit: this new method could easily be made without a receiver and reused between two exporters.
pkg/sql/rowexec/processors.go, line 25 at r2 (raw file):
) // EmitHelper is a utility wrapper on top of ProcOutputHelper.EmitRow().
nit: let's unexport ProcOutputHelper.EmitRow
method now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it'd be good to add a regression test for the first commit which exercises exactly the scenario described by Steven. For an example on how to inject a RWUI error you could take a look at TestDrainingProcessorSwallowsUncertaintyError
in rowexec/processors_test.go
.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @dt, @msbutler, and @stevendanna)
75b85c6
to
e188601
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should extract the second commit into a separate PR with more unification between two exporters and will not be backporting that commit, and let's keep only the first commit in this PR that will be backported.
I agree. I've removed the second commit and will follow up with a different PR.
I guess it'd be good to add a regression test for the first commit which exercises exactly the scenario described by Steven. For an example on how to inject a RWUI error you could take a look at TestDrainingProcessorSwallowsUncertaintyError in rowexec/processors_test.go.
Thanks for this pointer, that was very helpful. I've added a test that confirms we see the error in the case that we have emitted rows to the client, and a test that confirms that we retry if we haven't emitted rows.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dt, @msbutler, @stevendanna, and @yuzefovich)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 7 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @dt, @msbutler, @stevendanna, and @yuzefovich)
pkg/sql/importer/exportcsv_test.go, line 558 at r4 (raw file):
} // Test that processors either returns or retries. a ReadWithinUncertaintyIntervalError encountered
nit: s/retries. a/retries a
.
pkg/sql/importer/exportcsv_test.go, line 691 at r4 (raw file):
// The export is issued on node 0. // Node 1 is blocked on its read. // Node 0 is allowed to read and write a single file to the sink. We do this to prevent an internal retry of the RWUI.
nit: we try to keep the comments with 80 character cap.
pkg/sql/importer/exportcsv_test.go, line 692 at r4 (raw file):
// Node 1 is blocked on its read. // Node 0 is allowed to read and write a single file to the sink. We do this to prevent an internal retry of the RWUI. // Node 0 is then be blocked writing the next file
nit: s/be blocked/blocked
, also missing period.
pkg/sql/importer/exportcsv_test.go, line 706 at r4 (raw file):
origDB0.Exec("SET CLUSTER SETTING sql.defaults.results_buffer.size = '0'") require.NoError(t, err) // Create a new connection that will use this new buffer size default
nit: missing period.
pkg/sql/importer/exportcsv_test.go, line 739 at r4 (raw file):
t.Run("before result rows are emitted retries", func(t *testing.T) { // The export is issued on node 0. // Node 1 will immediately return a RWUI
nit: missing period here and two lines below.
55056b2
to
c683e34
Compare
@dt Happy to add tests for other cases. I dug through the git history a bit and it looks to me that this error return has existed from the original commit of the feature. There was a lot of discussion around this line of code in the original pull request; but reading through that discussion didn't yield much new insight other than reinforcing that it may be time to refactor this processor to move away from using EmitRow(). |
48822e2
to
ce3c125
Compare
@stevendanna the user story you pointed to here is still needs to be addressed, yeah?
|
A few options we could consider:
(3) is definitely what I would do if the only thing we were worrying about was filesystem storage with unix-like rename semantics. But, since the common case is writing to cloud storage providers (which often don't have atomic "folder" rename) and since the user specifies the destination and may not expect any writes outside of that given path/prefix, it doesn't seem clear cut to me. (4) would likely mean that we never retry RWUI errors because the gateway itself typically is going to be able to write one file before receiving the error. It would still get raised to the user which would be more useful than the current behaviour though. I think that for any of 2, 3, or 4 to be correct in the face of multiple versions, we likely need a version gate which will make backporting a problem -- although perhaps there is something I just overlooked there since I haven't implemented them yet. |
I'm in favor of option 1, plus adding an optional I also am not sure if we should backport this fix if it changes default UX. |
We could potentially pair this with some check to see if we are already draining before writing a file. |
I like (1). IIRC we have the flow ID in the file names, so you can use .* if you want as long as you use the flow ID from the names we returned and only get the results from the one that succeeded. It is unfortunate that we can end up retrying -- and writing -- forever with no indication of why though. If only this were a job. |
ce3c125
to
e8adfaf
Compare
Given that this is something I'd like to backport to release branches, I don't think it is something we need to push into 22.1.0. But, I would like to wrap something up this week.
👍 Sounds like you are good with this change as-is then?
I can look at adding an option here but my guess is that it might require some larger changes to the structure of the processors that I'm not sure would be worth backporting. |
Yeah, maybe it's not worth adding an option to export. Hopefully the FlowID is in the file name, which we should then document. |
Just noticed that this hasn't been merged - should we merge it? |
@yuzefovich Thanks, I'll reabase and merge this today. |
EXPORTs could encounter an error such as ReadWithinUncertaintyIntervalError but only receive the "unexpected closure of consumer" error previously returned in the modified branches. Not returning an error in this case is consistent with other callers of EmitRow. Not returning the error means that in some cases a retriable error encountered during export is now retried while it wasn't in the past. Note that this retry can happen after some files have already been written to external storage. Release note (bug fix): Fix issue where some exports would receive "unexpected closure of consumer" rather than the actual error the export encountered.
e8adfaf
to
0545796
Compare
bors r=yuzefovich |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 0545796 to blathers/backport-release-21.1-77938: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 21.1.x failed. See errors above. error creating merge commit from 0545796 to blathers/backport-release-21.2-77938: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 21.2.x failed. See errors above. error creating merge commit from 0545796 to blathers/backport-release-22.1-77938: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Users hit errors such as ReadWithinUncertaintyIntervalError but
only receive the "unexpected closure of consumer" error previously returned
by the export processors.
Not returning an error in this case is consistent with other callers of
EmitRow.
Not returning the error means that in some cases a retriable error
encountered during export is now retried while it wasn't in the
past. Note that this retry can happen after some files have already
been written to external storage.
Fixes #79229
Release note: Fix issue where some exports would receive "unexpected
closure of consumer" rather than the actual error the export
encountered.