Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bulk: don't return error in the case of closed/draining consumer #77938

Merged

Conversation

stevendanna
Copy link
Collaborator

@stevendanna stevendanna commented Mar 16, 2022

Users hit errors such as ReadWithinUncertaintyIntervalError but
only receive the "unexpected closure of consumer" error previously returned
by the export processors.

Not returning an error in this case is consistent with other callers of
EmitRow.

Not returning the error means that in some cases a retriable error
encountered during export is now retried while it wasn't in the
past. Note that this retry can happen after some files have already
been written to external storage.

Fixes #79229

Release note: Fix issue where some exports would receive "unexpected
closure of consumer" rather than the actual error the export
encountered.

@stevendanna stevendanna requested review from a team and msbutler and removed request for a team March 16, 2022 14:05
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@stevendanna
Copy link
Collaborator Author

My main concern here is whether the retry behaviour is acceptable. While we return the list of files just from the successful retry as a response the the query, the external storage will have stray files from the failed attempt. For example, in an export I did when testing, we emitted up to 5 files before the retry go hit:

ls /mnt/data1/cockroach/extern/exports/ | cut -d '-' -f1 | sort | uniq -c
      5 export16dce13864dabd490000000000000001
   2259 export16dce138996394450000000000000001

So users who have grown accustomed to being able to process all of the files written into the directory after a successful export would potentially see a behaviour change:

@msbutler
Copy link
Collaborator

msbutler commented Mar 16, 2022

This seems like an improvement! I wonder if there's a clean way to wipe the files from the failed export attempt? I guess a user would not want this if they process the csv files as they get written to external storage.

We could also add a new option to export, retry, for which we can clearly state the expected behavior.

@adityamaru
Copy link
Contributor

My main concern here is whether the retry behaviour is acceptable

Historically, I think we have maintained that the list of files returned to the user is the only list to be trusted/consumed. I remember previous escalations that led us to add this blurb in the docs - https://www.cockroachlabs.com/docs/stable/export.html#export-file-url

@rafiss
Copy link
Collaborator

rafiss commented Mar 16, 2022

drive-by comment: i was browsing around the repo and saw this. does it need the backport-22.1 label too?

@stevendanna
Copy link
Collaborator Author

@rafiss Definitely does, thanks

@dt
Copy link
Member

dt commented Mar 16, 2022

@yuzefovich Are were guaranteed to have something else return the real error if we pretend this case isn't an error and exit quietly?

My reluctance to do this originally was that this code has hit an unexpected/error condition and should error rather than claim it completed successfully. Now of course this isn't the error, and ideally whatever error bubbles up to the user or the retry loop or whatever should be the other error that led to this, so returning an error here shouldn't matter if we take the right one, but it seems like we don't take the right one? If we're sure we'll always get the real error in the right place with this silently claiming success, then I guess it is okay to do this?

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it'd be good to refactor the export processors to not use EmitRow directly since they are the only ones using it - I'd remove the usage of NoMetadataRowSource and use rowexec.emitHelper instead.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @msbutler, and @stevendanna)


pkg/sql/importer/exportcsv.go, line 216 at r1 (raw file):

					break
				}
				row, err := input.NextRow()

I currently don't understand how we could replace ReadWithinUncertaintyIntervalError with the custom "unexpected closure". If I'm reading the code right, then ReadWithinUncertaintyIntervalError must be encountered by our input, pushed into the output, and then returned as err here. What am I missing?

@stevendanna
Copy link
Collaborator Author

stevendanna commented Mar 16, 2022

@yuzefovich I don't think we get an error at that point on all processors. Here is the story I told myself based on some traces but I certainly could have missed something: Processors on at least one node do get the ReadWithinUncertaintyIntervalError from the input, nearly immediately. But the processor on the gateway is unlikely to see an ReadWithinUncertaintyIntervalError, so it is happily reading and eventually goes to write the file. Once written, we EmitRow() and that point learn that we are supposed to be draining. We then return our custom error which gets pushed to the DistSQLReceiver. Since it is the higher priority error, it is the one ultimately returned.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, indeed, this explanation makes sense to me. Just adding a bit more details on how that happens: a remote processor encounters RWUI error, pushes it to the outbox, the outbox then sends it across the wire, and on the gateway it is pushed into the RowChannel (the Push happens in processProducerMessage in inbound.go). DistSQLReceiver gets the error, stores it, and then because it's not a context cancellation, it transitions into DrainRequested status. The updated status is propagated to the RowChannel, and we send the drain signal to the remote node. However, the processor running on the gateway will only observe the updated consumer status of the RowChannel only when a new row is pushed into it.

Thus, answering David's question, I think it is guaranteed that there is an error on the DistSQLReceiver whenever a status other than "more rows needed" is observed by the gateway export processor.

Thanks for taking on the suggestion for using emitHelper. I think we should extract the second commit into a separate PR with more unification between two exporters and will not be backporting that commit, and let's keep only the first commit in this PR that will be backported.

The first commit is :lgtm:

Reviewed 2 of 2 files at r1, 6 of 6 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @dt, @msbutler, and @stevendanna)


-- commits, line 14 at r1:
nit: missing category, i.e. Release note (bug fix).


pkg/sql/importer/exportcsv.go, line 293 at r2 (raw file):

		return nil
	}()
	execinfra.DrainAndClose(ctx, sp.output, err, pushTrailingMeta /* pushTrailingMeta */, sp.input)

nit: no need for the inlined comment anymore.


pkg/sql/importer/exportcsv.go, line 296 at r2 (raw file):

}

func (sp *csvWriter) writeToExternalStorage(

nit: this new method could easily be made without a receiver and reused between two exporters.


pkg/sql/rowexec/processors.go, line 25 at r2 (raw file):

)

// EmitHelper is a utility wrapper on top of ProcOutputHelper.EmitRow().

nit: let's unexport ProcOutputHelper.EmitRow method now.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it'd be good to add a regression test for the first commit which exercises exactly the scenario described by Steven. For an example on how to inject a RWUI error you could take a look at TestDrainingProcessorSwallowsUncertaintyError in rowexec/processors_test.go.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @dt, @msbutler, and @stevendanna)

@stevendanna stevendanna force-pushed the export-no-return-error-on-close branch 2 times, most recently from 75b85c6 to e188601 Compare March 17, 2022 15:47
Copy link
Collaborator Author

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should extract the second commit into a separate PR with more unification between two exporters and will not be backporting that commit, and let's keep only the first commit in this PR that will be backported.

I agree. I've removed the second commit and will follow up with a different PR.

I guess it'd be good to add a regression test for the first commit which exercises exactly the scenario described by Steven. For an example on how to inject a RWUI error you could take a look at TestDrainingProcessorSwallowsUncertaintyError in rowexec/processors_test.go.

Thanks for this pointer, that was very helpful. I've added a test that confirms we see the error in the case that we have emitted rows to the client, and a test that confirms that we retry if we haven't emitted rows.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dt, @msbutler, @stevendanna, and @yuzefovich)

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! :lgtm:

Reviewed 3 of 7 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @dt, @msbutler, @stevendanna, and @yuzefovich)


pkg/sql/importer/exportcsv_test.go, line 558 at r4 (raw file):

}

// Test that processors either returns or retries. a ReadWithinUncertaintyIntervalError encountered

nit: s/retries. a/retries a.


pkg/sql/importer/exportcsv_test.go, line 691 at r4 (raw file):

		// The export is issued on node 0.
		// Node 1 is blocked on its read.
		// Node 0 is allowed to read and write a single file to the sink. We do this to prevent an internal retry of the RWUI.

nit: we try to keep the comments with 80 character cap.


pkg/sql/importer/exportcsv_test.go, line 692 at r4 (raw file):

		// Node 1 is blocked on its read.
		// Node 0 is allowed to read and write a single file to the sink. We do this to prevent an internal retry of the RWUI.
		// Node 0 is then be blocked writing the next file

nit: s/be blocked/blocked, also missing period.


pkg/sql/importer/exportcsv_test.go, line 706 at r4 (raw file):

		origDB0.Exec("SET CLUSTER SETTING sql.defaults.results_buffer.size = '0'")
		require.NoError(t, err)
		// Create a new connection that will use this new buffer size default

nit: missing period.


pkg/sql/importer/exportcsv_test.go, line 739 at r4 (raw file):

	t.Run("before result rows are emitted retries", func(t *testing.T) {
		// The export is issued on node 0.
		// Node 1 will immediately return a RWUI

nit: missing period here and two lines below.

@stevendanna stevendanna force-pushed the export-no-return-error-on-close branch 2 times, most recently from 55056b2 to c683e34 Compare March 17, 2022 16:30
@stevendanna
Copy link
Collaborator Author

My reluctance to do this originally was that this code has hit an unexpected/error condition and should error rather than claim it completed successfully.

@dt Happy to add tests for other cases. I dug through the git history a bit and it looks to me that this error return has existed from the original commit of the feature. There was a lot of discussion around this line of code in the original pull request; but reading through that discussion didn't yield much new insight other than reinforcing that it may be time to refactor this processor to move away from using EmitRow().

@stevendanna stevendanna force-pushed the export-no-return-error-on-close branch 2 times, most recently from 48822e2 to ce3c125 Compare March 18, 2022 08:37
@msbutler
Copy link
Collaborator

@stevendanna the user story you pointed to here is still needs to be addressed, yeah?

users who have grown accustomed to being able to process all of the files written into the directory after a successful export would potentially see a behaviour change

@stevendanna
Copy link
Collaborator Author

stevendanna commented Mar 21, 2022

@stevendanna the user story you pointed tohttps://github.com//pull/77938#issuecomment-1069185884 is still needs to be addressed, yeah?

A few options we could consider:

  1. Do nothing. This is arguably correct since the query returns the list of files.
  2. Attempt to clean up the files on failure.
  3. Write the files first into a temporary location and then move them all into place once all of the uploads are complete.
  4. Add some API that allows us to communicate back to the conn executor that some external side-effect has been performed and that the transaction should not be retried.

(3) is definitely what I would do if the only thing we were worrying about was filesystem storage with unix-like rename semantics. But, since the common case is writing to cloud storage providers (which often don't have atomic "folder" rename) and since the user specifies the destination and may not expect any writes outside of that given path/prefix, it doesn't seem clear cut to me.

(4) would likely mean that we never retry RWUI errors because the gateway itself typically is going to be able to write one file before receiving the error. It would still get raised to the user which would be more useful than the current behaviour though.

I think that for any of 2, 3, or 4 to be correct in the face of multiple versions, we likely need a version gate which will make backporting a problem -- although perhaps there is something I just overlooked there since I haven't implemented them yet.

@msbutler
Copy link
Collaborator

msbutler commented Mar 25, 2022

I'm in favor of option 1, plus adding an optional retry parameter to the EXPORT command. Without the retry param, EXPORT behaves as it did before this pr (we could tell the user in the error msg here to consider using the retry parameter). Adding this default off parameter would prevent a surprise UX change.

I also am not sure if we should backport this fix if it changes default UX.

@stevendanna
Copy link
Collaborator Author

(4) would likely mean that we never retry RWUI errors because the gateway itself typically is going to be able to write one file before receiving the error. It would still get raised to the user which would be more useful than the current behaviour though.

We could potentially pair this with some check to see if we are already draining before writing a file.

@dt
Copy link
Member

dt commented Mar 30, 2022

I like (1). IIRC we have the flow ID in the file names, so you can use .* if you want as long as you use the flow ID from the names we returned and only get the results from the one that succeeded.

It is unfortunate that we can end up retrying -- and writing -- forever with no indication of why though. If only this were a job.

@stevendanna stevendanna force-pushed the export-no-return-error-on-close branch from ce3c125 to e8adfaf Compare April 26, 2022 09:12
@stevendanna
Copy link
Collaborator Author

Given that this is something I'd like to backport to release branches, I don't think it is something we need to push into 22.1.0. But, I would like to wrap something up this week.

I like (1). IIRC we have the flow ID in the file names, so you can use .* if you want as long as you use the flow ID from the names we returned and only get the results from the one that succeeded.

👍 Sounds like you are good with this change as-is then?

I'm in favor of option 1, plus adding an optional retry parameter to the EXPORT command. Without the retry param, EXPORT behaves as it did before this pr (we could tell the user in the error msg here to consider using the retry parameter). Adding this default off parameter would prevent a surprise UX change.

I can look at adding an option here but my guess is that it might require some larger changes to the structure of the processors that I'm not sure would be worth backporting.

@msbutler
Copy link
Collaborator

Yeah, maybe it's not worth adding an option to export. Hopefully the FlowID is in the file name, which we should then document.

@yuzefovich
Copy link
Member

Just noticed that this hasn't been merged - should we merge it?

@stevendanna
Copy link
Collaborator Author

@yuzefovich Thanks, I'll reabase and merge this today.

EXPORTs could encounter an error such as
ReadWithinUncertaintyIntervalError but only receive the "unexpected
closure of consumer" error previously returned in the modified
branches.

Not returning an error in this case is consistent with other callers
of EmitRow.

Not returning the error means that in some cases a retriable error
encountered during export is now retried while it wasn't in the
past. Note that this retry can happen after some files have already
been written to external storage.

Release note (bug fix): Fix issue where some exports would receive
"unexpected closure of consumer" rather than the actual error the
export encountered.
@stevendanna stevendanna force-pushed the export-no-return-error-on-close branch from e8adfaf to 0545796 Compare June 30, 2022 19:55
@stevendanna stevendanna requested a review from a team June 30, 2022 19:55
@stevendanna
Copy link
Collaborator Author

bors r=yuzefovich

@craig
Copy link
Contributor

craig bot commented Jul 1, 2022

Build succeeded:

@craig craig bot merged commit d45ca84 into cockroachdb:master Jul 1, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jul 1, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 0545796 to blathers/backport-release-21.1-77938: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.1.x failed. See errors above.


error creating merge commit from 0545796 to blathers/backport-release-21.2-77938: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.


error creating merge commit from 0545796 to blathers/backport-release-22.1-77938: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bulk: EXPORT returns unactionable "unexpected closure of consume"
7 participants