importer: add checks and validations for row count validation failures#168607
importer: add checks and validations for row count validation failures#168607bghal wants to merge 1 commit intocockroachdb:masterfrom
Conversation
|
Merging to
After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here |
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
abf1228 to
bba169f
Compare
spilchen
left a comment
There was a problem hiding this comment.
@spilchen made 3 comments.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on bghal and michae2).
pkg/sql/importer/import_job.go line 478 at r1 (raw file):
switch validationMode { case ImportRowCountValidationSync: if err := p.ExecCfg().JobRegistry.WaitForJobsIgnoringJobErrors(ctx, []jobspb.JobID{inspectJob.ID()}); err != nil {
the old API would return an error if the job was either paused or cancelled. I think this new API (WaitForJobsIgnoringJobError) returns success in those cases, which if true could be a problem.
pkg/sql/importer/import_job.go line 503 at r1 (raw file):
// Run a count(*) to independently validate the inspect row // count. besteffort.Warning(ctx, "import-expected-count-validation", func(ctx context.Context) error {
I'm not sure if besteffort.Warning is the correct thing to do. It seems like it will only run the closure 50% of time (based on a random seed). I think we always want to run it if we hit an inspect error.
pkg/sql/importer/import_job.go line 518 at r1 (raw file):
return err } actualCount := uint64(tree.MustBeDInt(row[0]))
nit: we should check if row == nil, just in case the query didn't return any rows.
bba169f to
27a52e5
Compare
bghal
left a comment
There was a problem hiding this comment.
@bghal made 3 comments and resolved 1 discussion.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on michae2 and spilchen).
pkg/sql/importer/import_job.go line 478 at r1 (raw file):
Previously, spilchen wrote…
the old API would return an error if the job was either paused or cancelled. I think this new API (
WaitForJobsIgnoringJobError) returns success in those cases, which if true could be a problem.
Separated the validator into its own scope so it should be easier to clean up.
pkg/sql/importer/import_job.go line 503 at r1 (raw file):
Previously, spilchen wrote…
I'm not sure if
besteffort.Warningis the correct thing to do. It seems like it will only run the closure 50% of time (based on a random seed). I think we always want to run it if we hit an inspect error.
Moved it out so it'll surface as an assertion error.
pkg/sql/importer/import_job.go line 518 at r1 (raw file):
Previously, spilchen wrote…
nit: we should check if
row == nil, just in case the query didn't return any rows.
Done.
spilchen
left a comment
There was a problem hiding this comment.
@spilchen made 2 comments and resolved 1 discussion.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on bghal and michae2).
pkg/sql/importer/import_job.go line 478 at r1 (raw file):
Previously, bghal (Brendan) wrote…
Separated the validator into its own scope so it should be easier to clean up.
It's actually the double wait (WaitForJobsIgnoringJobErrors + WaitForJob) that allows to handle paused/cancelled jobs. Is that right? If so, we should have a comment for the WaitForJob call to clarify that it's needed for this case. Otherwise, someone may opt to remove it.
pkg/sql/importer/import_job.go line 518 at r2 (raw file):
) if err != nil { return err
we are already in an error state, I am wondering if we should preserve the decodedErr and combine them together. Can we use errors.CombineErrors? Similar comment for other spots. One way that we can avoid duplicating the combine logic is to move this chunk of code (extra select count(*) validation) into it's own separate function, then have the caller combine the errors once.
27a52e5 to
220113d
Compare
bghal
left a comment
There was a problem hiding this comment.
@bghal made 2 comments.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on michae2 and spilchen).
pkg/sql/importer/import_job.go line 478 at r1 (raw file):
Previously, spilchen wrote…
It's actually the double wait (WaitForJobsIgnoringJobErrors + WaitForJob) that allows to handle paused/cancelled jobs. Is that right? If so, we should have a comment for the WaitForJob call to clarify that it's needed for this case. Otherwise, someone may opt to remove it.
Added the comment.
pkg/sql/importer/import_job.go line 518 at r2 (raw file):
Previously, spilchen wrote…
we are already in an error state, I am wondering if we should preserve the
decodedErrand combine them together. Can we useerrors.CombineErrors? Similar comment for other spots. One way that we can avoid duplicating the combine logic is to move this chunk of code (extra select count(*) validation) into it's own separate function, then have the caller combine the errors once.
Done. What's one more scope.
This makes two temporary changes with the aim to debug cockroachdb#168396: After the `INSPECT` job finds inconsistencies, the `IMPORT` job runs an independent `SELECT count(*)` on the target table. This determines if the row-count discrepancy is due to the `INSPECT` job's row counting over the spans or the `IMPORT` job's calculation of the expected row count. The `INSPECT` resumer asserts that the spans it delegates to workers are not overlapping which would cause overcounting of rows in the overlap. Informs: cockroachdb#168396 Release note: None
220113d to
bd19280
Compare
spilchen
left a comment
There was a problem hiding this comment.
@spilchen reviewed all commit messages, made 2 comments, and resolved 1 discussion.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on bghal and michae2).
pkg/sql/importer/import_job.go line 569 at r3 (raw file):
if rowCountTableErr != nil { return errors.CombineErrors(errors.WithHintf(rowCountTableErr, "Run 'SHOW INSPECT ERRORS FOR JOB %d WITH DETAILS' for more information.",
I think this hint should always be attached. Regardless if the separate count(*) validation found anything or not. We get in this path if INSPECT found an issue.
pkg/sql/importer/import_job.go line 582 at r3 (raw file):
// The broader second wait captures errors from the job being // cancelled or paused that the `validateInspectRowCount` debug
do we know if this is needed for job cancellation? It might only be for paused jobs. I was looking at FinalResumeError and it's set whenever the job has to revert, which I believe includes cancellation.
Doesn't this mean the flow for cancellation will do the 'SELECT COUNT(*)'? And we'd also surface an error saying "Run 'SHOW INSPECT ERRORS FOR JOB %d WITH DETAILS' for more information.", which isn't right also.
This makes two temporary changes with the aim to debug #168396:
After the
INSPECTjob finds inconsistencies, theIMPORTjob runs anindependent
SELECT count(*)on the target table. This determines ifthe row-count discrepancy is due to the
INSPECTjob's row countingover the spans or the
IMPORTjob's calculation of the expected rowcount.
The
INSPECTresumer asserts that the spans it delegates to workers arenot overlapping which would cause overcounting of rows in the overlap.
Informs: #168396
Release note: None