Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cdc: Treat node draining errors as retryable. #49743

Merged
merged 1 commit into from
Jun 2, 2020

Conversation

miretskiy
Copy link
Contributor

@miretskiy miretskiy commented May 31, 2020

Fixes #46515
Fixes #43771

Handle flow registration errors due to draining node as retryable.

Release notes (reliability): Treat errors due to draining nodes
as retryable when starting CDC.

@miretskiy miretskiy requested review from ajwerner, a team and pbardea and removed request for a team May 31, 2020 20:04
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@miretskiy miretskiy force-pushed the changefeeed branch 2 times, most recently from 5387253 to fc4a300 Compare May 31, 2020 20:06
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: thanks for writing the test

Reviewed 5 of 5 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @miretskiy and @pbardea)


pkg/ccl/changefeedccl/changefeed_stmt.go, line 583 at r1 (raw file):

				// Instead, we want to make sure that the changefeed job is not marked failed
				// due to a transient, retryable error.
				err = jobs.NewRetryJobError(fmt.Sprintf("retryable flow error: %+v", err))

Is there a way to use https://godoc.org/github.com/cockroachdb/errors#CombineErrors to retain the structured error?


pkg/ccl/changefeedccl/changefeed_test.go, line 2866 at r1 (raw file):

	// Even though we disabled merges via the store testing knob, we must also
	// disable the setting in order for manual splits to be allowed.
	sqlDB.Exec(t, "SET CLUSTER SETTING kv.range_merge.queue_enabled = false")

Is this true? I thought since 19.2 you can do manual splits with the queue enabled. You may need to specify an expiration using a WITH EXPIRATION clause.

@pbardea
Copy link
Contributor

pbardea commented May 31, 2020

Just a note that I think the first issue linked the PR looks like it's not the right issue. Perhaps you meant #46515?

@miretskiy miretskiy requested a review from ajwerner June 1, 2020 12:57
Copy link
Contributor Author

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @miretskiy, and @pbardea)


pkg/ccl/changefeedccl/changefeed_stmt.go, line 583 at r1 (raw file):

Previously, ajwerner wrote…

Is there a way to use https://godoc.org/github.com/cockroachdb/errors#CombineErrors to retain the structured error?

Probably... This jobs error api is not great. I think we need to take a pass to change type retry-able job error type to be something other than the string. I think it's better to take this cleanup as a separate PR.

Copy link
Contributor Author

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Updated.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner and @pbardea)


pkg/ccl/changefeedccl/changefeed_test.go, line 2866 at r1 (raw file):

Previously, ajwerner wrote…
	// Even though we disabled merges via the store testing knob, we must also
	// disable the setting in order for manual splits to be allowed.
	sqlDB.Exec(t, "SET CLUSTER SETTING kv.range_merge.queue_enabled = false")

Is this true? I thought since 19.2 you can do manual splits with the queue enabled. You may need to specify an expiration using a WITH EXPIRATION clause.

I just copied this from another test. Removing it didn't change anything. So, it's gone.

@miretskiy
Copy link
Contributor Author

Hmmm.. Strangely, your lg didn't make it from reviewable to github. Can I ask for another lg from you @pbardea and @ajwerner ?

Copy link
Contributor

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner and @pbardea)

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r2.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @miretskiy)


pkg/ccl/changefeedccl/changefeed_stmt.go, line 583 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Probably... This jobs error api is not great. I think we need to take a pass to change type retry-able job error type to be something other than the string. I think it's better to take this cleanup as a separate PR.

Works for me. Committer's discretion.

@miretskiy miretskiy force-pushed the changefeeed branch 2 times, most recently from 11d7475 to 816c314 Compare June 2, 2020 19:14
@miretskiy miretskiy requested a review from a team June 2, 2020 19:14
@miretskiy miretskiy requested review from a team as code owners June 2, 2020 19:14
Handle flow registration errors due to draining node as retryable.

Release notes (reliability): Treat errors due to draining nodes
as retryable when starting CDC.
@miretskiy
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 2, 2020

Build succeeded

@craig craig bot merged commit a18b32f into cockroachdb:master Jun 2, 2020
craig bot pushed a commit that referenced this pull request Jun 11, 2020
50088: release-20.1: cdc: Treat node draining errors as retryable. r=miretskiy a=miretskiy

Backport 1/1 commits from #49743.

/cc @cockroachdb/release

---

Fixes #46515
Fixes #43771

Handle flow registration errors due to draining node as retryable.

Release notes (reliability): Treat errors due to draining nodes
as retryable when starting CDC.


Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com>
@miretskiy miretskiy deleted the changefeeed branch July 12, 2020 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants