bulk: don't return error in the case of closed/draining consumer #77938

stevendanna · 2022-03-16T14:05:09Z

Users hit errors such as ReadWithinUncertaintyIntervalError but
only receive the "unexpected closure of consumer" error previously returned
by the export processors.

Not returning an error in this case is consistent with other callers of
EmitRow.

Not returning the error means that in some cases a retriable error
encountered during export is now retried while it wasn't in the
past. Note that this retry can happen after some files have already
been written to external storage.

Fixes #79229

Release note: Fix issue where some exports would receive "unexpected
closure of consumer" rather than the actual error the export
encountered.

cockroach-teamcity · 2022-03-16T14:06:16Z

This change is

stevendanna · 2022-03-16T14:23:32Z

My main concern here is whether the retry behaviour is acceptable. While we return the list of files just from the successful retry as a response the the query, the external storage will have stray files from the failed attempt. For example, in an export I did when testing, we emitted up to 5 files before the retry go hit:

ls /mnt/data1/cockroach/extern/exports/ | cut -d '-' -f1 | sort | uniq -c
      5 export16dce13864dabd490000000000000001
   2259 export16dce138996394450000000000000001

So users who have grown accustomed to being able to process all of the files written into the directory after a successful export would potentially see a behaviour change:

msbutler · 2022-03-16T14:29:01Z

This seems like an improvement! I wonder if there's a clean way to wipe the files from the failed export attempt? I guess a user would not want this if they process the csv files as they get written to external storage.

We could also add a new option to export, retry, for which we can clearly state the expected behavior.

adityamaru · 2022-03-16T14:45:34Z

My main concern here is whether the retry behaviour is acceptable

Historically, I think we have maintained that the list of files returned to the user is the only list to be trusted/consumed. I remember previous escalations that led us to add this blurb in the docs - https://www.cockroachlabs.com/docs/stable/export.html#export-file-url

rafiss · 2022-03-16T14:47:23Z

drive-by comment: i was browsing around the repo and saw this. does it need the backport-22.1 label too?

stevendanna · 2022-03-16T14:53:44Z

@rafiss Definitely does, thanks

dt · 2022-03-16T14:59:14Z

@yuzefovich Are were guaranteed to have something else return the real error if we pretend this case isn't an error and exit quietly?

My reluctance to do this originally was that this code has hit an unexpected/error condition and should error rather than claim it completed successfully. Now of course this isn't the error, and ideally whatever error bubbles up to the user or the retry loop or whatever should be the other error that led to this, so returning an error here shouldn't matter if we take the right one, but it seems like we don't take the right one? If we're sure we'll always get the real error in the right place with this silently claiming success, then I guess it is okay to do this?

yuzefovich

IMO it'd be good to refactor the export processors to not use EmitRow directly since they are the only ones using it - I'd remove the usage of NoMetadataRowSource and use rowexec.emitHelper instead.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @msbutler, and @stevendanna)

pkg/sql/importer/exportcsv.go, line 216 at r1 (raw file):

					break
				}
				row, err := input.NextRow()

I currently don't understand how we could replace ReadWithinUncertaintyIntervalError with the custom "unexpected closure". If I'm reading the code right, then ReadWithinUncertaintyIntervalError must be encountered by our input, pushed into the output, and then returned as err here. What am I missing?

stevendanna · 2022-03-16T17:37:28Z

@yuzefovich I don't think we get an error at that point on all processors. Here is the story I told myself based on some traces but I certainly could have missed something: Processors on at least one node do get the ReadWithinUncertaintyIntervalError from the input, nearly immediately. But the processor on the gateway is unlikely to see an ReadWithinUncertaintyIntervalError, so it is happily reading and eventually goes to write the file. Once written, we EmitRow() and that point learn that we are supposed to be draining. We then return our custom error which gets pushed to the DistSQLReceiver. Since it is the higher priority error, it is the one ultimately returned.

yuzefovich

I see, indeed, this explanation makes sense to me. Just adding a bit more details on how that happens: a remote processor encounters RWUI error, pushes it to the outbox, the outbox then sends it across the wire, and on the gateway it is pushed into the RowChannel (the Push happens in processProducerMessage in inbound.go). DistSQLReceiver gets the error, stores it, and then because it's not a context cancellation, it transitions into DrainRequested status. The updated status is propagated to the RowChannel, and we send the drain signal to the remote node. However, the processor running on the gateway will only observe the updated consumer status of the RowChannel only when a new row is pushed into it.

Thus, answering David's question, I think it is guaranteed that there is an error on the DistSQLReceiver whenever a status other than "more rows needed" is observed by the gateway export processor.

Thanks for taking on the suggestion for using emitHelper. I think we should extract the second commit into a separate PR with more unification between two exporters and will not be backporting that commit, and let's keep only the first commit in this PR that will be backported.

The first commit is

Reviewed 2 of 2 files at r1, 6 of 6 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @dt, @msbutler, and @stevendanna)

-- commits, line 14 at r1:
nit: missing category, i.e. Release note (bug fix).

pkg/sql/importer/exportcsv.go, line 293 at r2 (raw file):

		return nil
	}()
	execinfra.DrainAndClose(ctx, sp.output, err, pushTrailingMeta /* pushTrailingMeta */, sp.input)

nit: no need for the inlined comment anymore.

pkg/sql/importer/exportcsv.go, line 296 at r2 (raw file):

}

func (sp *csvWriter) writeToExternalStorage(

nit: this new method could easily be made without a receiver and reused between two exporters.

pkg/sql/rowexec/processors.go, line 25 at r2 (raw file):

)

// EmitHelper is a utility wrapper on top of ProcOutputHelper.EmitRow().

nit: let's unexport ProcOutputHelper.EmitRow method now.

yuzefovich

I guess it'd be good to add a regression test for the first commit which exercises exactly the scenario described by Steven. For an example on how to inject a RWUI error you could take a look at TestDrainingProcessorSwallowsUncertaintyError in rowexec/processors_test.go.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @dt, @msbutler, and @stevendanna)

stevendanna

I think we should extract the second commit into a separate PR with more unification between two exporters and will not be backporting that commit, and let's keep only the first commit in this PR that will be backported.

I agree. I've removed the second commit and will follow up with a different PR.

I guess it'd be good to add a regression test for the first commit which exercises exactly the scenario described by Steven. For an example on how to inject a RWUI error you could take a look at TestDrainingProcessorSwallowsUncertaintyError in rowexec/processors_test.go.

Thanks for this pointer, that was very helpful. I've added a test that confirms we see the error in the case that we have emitted rows to the client, and a test that confirms that we retry if we haven't emitted rows.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dt, @msbutler, @stevendanna, and @yuzefovich)

yuzefovich

Nice!

Reviewed 3 of 7 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @dt, @msbutler, @stevendanna, and @yuzefovich)

pkg/sql/importer/exportcsv_test.go, line 558 at r4 (raw file):

}

// Test that processors either returns or retries. a ReadWithinUncertaintyIntervalError encountered

nit: s/retries. a/retries a.

pkg/sql/importer/exportcsv_test.go, line 691 at r4 (raw file):

		// The export is issued on node 0.
		// Node 1 is blocked on its read.
		// Node 0 is allowed to read and write a single file to the sink. We do this to prevent an internal retry of the RWUI.

nit: we try to keep the comments with 80 character cap.

pkg/sql/importer/exportcsv_test.go, line 692 at r4 (raw file):

		// Node 1 is blocked on its read.
		// Node 0 is allowed to read and write a single file to the sink. We do this to prevent an internal retry of the RWUI.
		// Node 0 is then be blocked writing the next file

nit: s/be blocked/blocked, also missing period.

pkg/sql/importer/exportcsv_test.go, line 706 at r4 (raw file):

		origDB0.Exec("SET CLUSTER SETTING sql.defaults.results_buffer.size = '0'")
		require.NoError(t, err)
		// Create a new connection that will use this new buffer size default

nit: missing period.

pkg/sql/importer/exportcsv_test.go, line 739 at r4 (raw file):

	t.Run("before result rows are emitted retries", func(t *testing.T) {
		// The export is issued on node 0.
		// Node 1 will immediately return a RWUI

nit: missing period here and two lines below.

stevendanna · 2022-03-17T16:40:51Z

My reluctance to do this originally was that this code has hit an unexpected/error condition and should error rather than claim it completed successfully.

@dt Happy to add tests for other cases. I dug through the git history a bit and it looks to me that this error return has existed from the original commit of the feature. There was a lot of discussion around this line of code in the original pull request; but reading through that discussion didn't yield much new insight other than reinforcing that it may be time to refactor this processor to move away from using EmitRow().

msbutler · 2022-03-18T15:47:44Z

@stevendanna the user story you pointed to here is still needs to be addressed, yeah?

users who have grown accustomed to being able to process all of the files written into the directory after a successful export would potentially see a behaviour change

stevendanna · 2022-03-21T10:03:54Z

@stevendanna the user story you pointed tohttps://github.com//pull/77938#issuecomment-1069185884 is still needs to be addressed, yeah?

A few options we could consider:

Do nothing. This is arguably correct since the query returns the list of files.
Attempt to clean up the files on failure.
Write the files first into a temporary location and then move them all into place once all of the uploads are complete.
Add some API that allows us to communicate back to the conn executor that some external side-effect has been performed and that the transaction should not be retried.

(3) is definitely what I would do if the only thing we were worrying about was filesystem storage with unix-like rename semantics. But, since the common case is writing to cloud storage providers (which often don't have atomic "folder" rename) and since the user specifies the destination and may not expect any writes outside of that given path/prefix, it doesn't seem clear cut to me.

(4) would likely mean that we never retry RWUI errors because the gateway itself typically is going to be able to write one file before receiving the error. It would still get raised to the user which would be more useful than the current behaviour though.

I think that for any of 2, 3, or 4 to be correct in the face of multiple versions, we likely need a version gate which will make backporting a problem -- although perhaps there is something I just overlooked there since I haven't implemented them yet.

msbutler · 2022-03-25T01:26:44Z

I'm in favor of option 1, plus adding an optional retry parameter to the EXPORT command. Without the retry param, EXPORT behaves as it did before this pr (we could tell the user in the error msg here to consider using the retry parameter). Adding this default off parameter would prevent a surprise UX change.

I also am not sure if we should backport this fix if it changes default UX.

stevendanna · 2022-03-30T12:39:44Z

(4) would likely mean that we never retry RWUI errors because the gateway itself typically is going to be able to write one file before receiving the error. It would still get raised to the user which would be more useful than the current behaviour though.

We could potentially pair this with some check to see if we are already draining before writing a file.

dt · 2022-03-30T12:50:00Z

I like (1). IIRC we have the flow ID in the file names, so you can use .* if you want as long as you use the flow ID from the names we returned and only get the results from the one that succeeded.

It is unfortunate that we can end up retrying -- and writing -- forever with no indication of why though. If only this were a job.

stevendanna · 2022-04-26T09:17:26Z

Given that this is something I'd like to backport to release branches, I don't think it is something we need to push into 22.1.0. But, I would like to wrap something up this week.

I like (1). IIRC we have the flow ID in the file names, so you can use .* if you want as long as you use the flow ID from the names we returned and only get the results from the one that succeeded.

👍 Sounds like you are good with this change as-is then?

I'm in favor of option 1, plus adding an optional retry parameter to the EXPORT command. Without the retry param, EXPORT behaves as it did before this pr (we could tell the user in the error msg here to consider using the retry parameter). Adding this default off parameter would prevent a surprise UX change.

I can look at adding an option here but my guess is that it might require some larger changes to the structure of the processors that I'm not sure would be worth backporting.

msbutler · 2022-04-26T13:00:43Z

Yeah, maybe it's not worth adding an option to export. Hopefully the FlowID is in the file name, which we should then document.

yuzefovich · 2022-06-30T05:13:39Z

Just noticed that this hasn't been merged - should we merge it?

stevendanna · 2022-06-30T12:42:16Z

@yuzefovich Thanks, I'll reabase and merge this today.

EXPORTs could encounter an error such as ReadWithinUncertaintyIntervalError but only receive the "unexpected closure of consumer" error previously returned in the modified branches. Not returning an error in this case is consistent with other callers of EmitRow. Not returning the error means that in some cases a retriable error encountered during export is now retried while it wasn't in the past. Note that this retry can happen after some files have already been written to external storage. Release note (bug fix): Fix issue where some exports would receive "unexpected closure of consumer" rather than the actual error the export encountered.

stevendanna · 2022-07-01T20:26:32Z

bors r=yuzefovich

craig · 2022-07-01T21:49:33Z

Build succeeded:

GitHub CI (Cockroach)

blathers-crl · 2022-07-01T21:49:49Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 0545796 to blathers/backport-release-21.1-77938: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.1.x failed. See errors above.

error creating merge commit from 0545796 to blathers/backport-release-21.2-77938: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.

error creating merge commit from 0545796 to blathers/backport-release-22.1-77938: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

stevendanna requested review from a team and msbutler and removed request for a team March 16, 2022 14:05

stevendanna added backport-21.1.x 21.1 is EOL backport-21.2.x labels Mar 16, 2022

stevendanna requested a review from dt March 16, 2022 14:10

stevendanna added the backport-22.1.x label Mar 16, 2022

dt requested a review from yuzefovich March 16, 2022 14:55

yuzefovich reviewed Mar 16, 2022

View reviewed changes

yuzefovich approved these changes Mar 16, 2022

View reviewed changes

yuzefovich reviewed Mar 16, 2022

View reviewed changes

stevendanna force-pushed the export-no-return-error-on-close branch 2 times, most recently from 75b85c6 to e188601 Compare March 17, 2022 15:47

stevendanna commented Mar 17, 2022

View reviewed changes

yuzefovich approved these changes Mar 17, 2022

View reviewed changes

stevendanna force-pushed the export-no-return-error-on-close branch 2 times, most recently from 55056b2 to c683e34 Compare March 17, 2022 16:30

stevendanna force-pushed the export-no-return-error-on-close branch 2 times, most recently from 48822e2 to ce3c125 Compare March 18, 2022 08:37

stevendanna force-pushed the export-no-return-error-on-close branch from ce3c125 to e8adfaf Compare April 26, 2022 09:12

stevendanna force-pushed the export-no-return-error-on-close branch from e8adfaf to 0545796 Compare June 30, 2022 19:55

stevendanna requested a review from a team June 30, 2022 19:55

craig bot merged commit d45ca84 into cockroachdb:master Jul 1, 2022

yuzefovich removed backport-21.1.x 21.1 is EOL backport-21.2.x labels Jul 18, 2023

stevendanna mentioned this pull request Jul 28, 2023

bulk: "unexpected consumer closed" returned from EXPORT rather than underlying error #77928

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bulk: don't return error in the case of closed/draining consumer #77938

bulk: don't return error in the case of closed/draining consumer #77938

stevendanna commented Mar 16, 2022 •

edited

Loading

cockroach-teamcity commented Mar 16, 2022

stevendanna commented Mar 16, 2022

msbutler commented Mar 16, 2022 •

edited

Loading

adityamaru commented Mar 16, 2022

rafiss commented Mar 16, 2022

stevendanna commented Mar 16, 2022

dt commented Mar 16, 2022

yuzefovich left a comment

stevendanna commented Mar 16, 2022 •

edited

Loading

yuzefovich left a comment

yuzefovich left a comment

stevendanna left a comment

yuzefovich left a comment

stevendanna commented Mar 17, 2022

msbutler commented Mar 18, 2022

stevendanna commented Mar 21, 2022 •

edited

Loading

msbutler commented Mar 25, 2022 •

edited

Loading

stevendanna commented Mar 30, 2022

dt commented Mar 30, 2022 •

edited

Loading

stevendanna commented Apr 26, 2022

msbutler commented Apr 26, 2022

yuzefovich commented Jun 30, 2022

stevendanna commented Jun 30, 2022

stevendanna commented Jul 1, 2022

craig bot commented Jul 1, 2022

blathers-crl bot commented Jul 1, 2022

bulk: don't return error in the case of closed/draining consumer #77938

bulk: don't return error in the case of closed/draining consumer #77938

Conversation

stevendanna commented Mar 16, 2022 • edited Loading

cockroach-teamcity commented Mar 16, 2022

stevendanna commented Mar 16, 2022

msbutler commented Mar 16, 2022 • edited Loading

adityamaru commented Mar 16, 2022

rafiss commented Mar 16, 2022

stevendanna commented Mar 16, 2022

dt commented Mar 16, 2022

yuzefovich left a comment

Choose a reason for hiding this comment

stevendanna commented Mar 16, 2022 • edited Loading

yuzefovich left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

stevendanna left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

stevendanna commented Mar 17, 2022

msbutler commented Mar 18, 2022

stevendanna commented Mar 21, 2022 • edited Loading

msbutler commented Mar 25, 2022 • edited Loading

stevendanna commented Mar 30, 2022

dt commented Mar 30, 2022 • edited Loading

stevendanna commented Apr 26, 2022

msbutler commented Apr 26, 2022

yuzefovich commented Jun 30, 2022

stevendanna commented Jun 30, 2022

stevendanna commented Jul 1, 2022

craig bot commented Jul 1, 2022

blathers-crl bot commented Jul 1, 2022

stevendanna commented Mar 16, 2022 •

edited

Loading

msbutler commented Mar 16, 2022 •

edited

Loading

stevendanna commented Mar 16, 2022 •

edited

Loading

stevendanna commented Mar 21, 2022 •

edited

Loading

msbutler commented Mar 25, 2022 •

edited

Loading

dt commented Mar 30, 2022 •

edited

Loading