Skip to content

importer: add checks and validations for row count validation failures#168607

Open
bghal wants to merge 1 commit intocockroachdb:masterfrom
bghal:import-inspect-inconsistencies
Open

importer: add checks and validations for row count validation failures#168607
bghal wants to merge 1 commit intocockroachdb:masterfrom
bghal:import-inspect-inconsistencies

Conversation

@bghal
Copy link
Copy Markdown
Contributor

@bghal bghal commented Apr 17, 2026

This makes two temporary changes with the aim to debug #168396:

After the INSPECT job finds inconsistencies, the IMPORT job runs an
independent SELECT count(*) on the target table. This determines if
the row-count discrepancy is due to the INSPECT job's row counting
over the spans or the IMPORT job's calculation of the expected row
count.

The INSPECT resumer asserts that the spans it delegates to workers are
not overlapping which would cause overcounting of rows in the overlap.

Informs: #168396

Release note: None

@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented Apr 17, 2026

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented Apr 17, 2026

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@bghal bghal force-pushed the import-inspect-inconsistencies branch from abf1228 to bba169f Compare April 17, 2026 18:45
@bghal bghal marked this pull request as ready for review April 17, 2026 18:45
@bghal bghal requested review from a team as code owners April 17, 2026 18:45
@bghal bghal requested review from michae2 and removed request for a team April 17, 2026 18:45
Copy link
Copy Markdown
Contributor

@spilchen spilchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spilchen made 3 comments.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on bghal and michae2).


pkg/sql/importer/import_job.go line 478 at r1 (raw file):

			switch validationMode {
			case ImportRowCountValidationSync:
				if err := p.ExecCfg().JobRegistry.WaitForJobsIgnoringJobErrors(ctx, []jobspb.JobID{inspectJob.ID()}); err != nil {

the old API would return an error if the job was either paused or cancelled. I think this new API (WaitForJobsIgnoringJobError) returns success in those cases, which if true could be a problem.


pkg/sql/importer/import_job.go line 503 at r1 (raw file):

					// Run a count(*) to independently validate the inspect row
					// count.
					besteffort.Warning(ctx, "import-expected-count-validation", func(ctx context.Context) error {

I'm not sure if besteffort.Warning is the correct thing to do. It seems like it will only run the closure 50% of time (based on a random seed). I think we always want to run it if we hit an inspect error.


pkg/sql/importer/import_job.go line 518 at r1 (raw file):

							return err
						}
						actualCount := uint64(tree.MustBeDInt(row[0]))

nit: we should check if row == nil, just in case the query didn't return any rows.

@bghal bghal force-pushed the import-inspect-inconsistencies branch from bba169f to 27a52e5 Compare April 17, 2026 21:20
Copy link
Copy Markdown
Contributor Author

@bghal bghal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bghal made 3 comments and resolved 1 discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on michae2 and spilchen).


pkg/sql/importer/import_job.go line 478 at r1 (raw file):

Previously, spilchen wrote…

the old API would return an error if the job was either paused or cancelled. I think this new API (WaitForJobsIgnoringJobError) returns success in those cases, which if true could be a problem.

Separated the validator into its own scope so it should be easier to clean up.


pkg/sql/importer/import_job.go line 503 at r1 (raw file):

Previously, spilchen wrote…

I'm not sure if besteffort.Warning is the correct thing to do. It seems like it will only run the closure 50% of time (based on a random seed). I think we always want to run it if we hit an inspect error.

Moved it out so it'll surface as an assertion error.


pkg/sql/importer/import_job.go line 518 at r1 (raw file):

Previously, spilchen wrote…

nit: we should check if row == nil, just in case the query didn't return any rows.

Done.

Copy link
Copy Markdown
Contributor

@spilchen spilchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spilchen made 2 comments and resolved 1 discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on bghal and michae2).


pkg/sql/importer/import_job.go line 478 at r1 (raw file):

Previously, bghal (Brendan) wrote…

Separated the validator into its own scope so it should be easier to clean up.

It's actually the double wait (WaitForJobsIgnoringJobErrors + WaitForJob) that allows to handle paused/cancelled jobs. Is that right? If so, we should have a comment for the WaitForJob call to clarify that it's needed for this case. Otherwise, someone may opt to remove it.


pkg/sql/importer/import_job.go line 518 at r2 (raw file):

						)
						if err != nil {
							return err

we are already in an error state, I am wondering if we should preserve the decodedErr and combine them together. Can we use errors.CombineErrors? Similar comment for other spots. One way that we can avoid duplicating the combine logic is to move this chunk of code (extra select count(*) validation) into it's own separate function, then have the caller combine the errors once.

@bghal bghal force-pushed the import-inspect-inconsistencies branch from 27a52e5 to 220113d Compare April 21, 2026 21:38
Copy link
Copy Markdown
Contributor Author

@bghal bghal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bghal made 2 comments.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on michae2 and spilchen).


pkg/sql/importer/import_job.go line 478 at r1 (raw file):

Previously, spilchen wrote…

It's actually the double wait (WaitForJobsIgnoringJobErrors + WaitForJob) that allows to handle paused/cancelled jobs. Is that right? If so, we should have a comment for the WaitForJob call to clarify that it's needed for this case. Otherwise, someone may opt to remove it.

Added the comment.


pkg/sql/importer/import_job.go line 518 at r2 (raw file):

Previously, spilchen wrote…

we are already in an error state, I am wondering if we should preserve the decodedErr and combine them together. Can we use errors.CombineErrors? Similar comment for other spots. One way that we can avoid duplicating the combine logic is to move this chunk of code (extra select count(*) validation) into it's own separate function, then have the caller combine the errors once.

Done. What's one more scope.

@bghal bghal requested a review from spilchen April 22, 2026 21:42
This makes two temporary changes with the aim to debug cockroachdb#168396:

After the `INSPECT` job finds inconsistencies, the `IMPORT` job runs an
independent `SELECT count(*)` on the target table. This determines if
the row-count discrepancy is due to the `INSPECT` job's row counting
over the spans or the `IMPORT` job's calculation of the expected row
count.

The `INSPECT` resumer asserts that the spans it delegates to workers are
not overlapping which would cause overcounting of rows in the overlap.

Informs: cockroachdb#168396

Release note: None
@bghal bghal force-pushed the import-inspect-inconsistencies branch from 220113d to bd19280 Compare April 22, 2026 21:54
Copy link
Copy Markdown
Contributor

@spilchen spilchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spilchen reviewed all commit messages, made 2 comments, and resolved 1 discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on bghal and michae2).


pkg/sql/importer/import_job.go line 569 at r3 (raw file):

						if rowCountTableErr != nil {
							return errors.CombineErrors(errors.WithHintf(rowCountTableErr,
								"Run 'SHOW INSPECT ERRORS FOR JOB %d WITH DETAILS' for more information.",

I think this hint should always be attached. Regardless if the separate count(*) validation found anything or not. We get in this path if INSPECT found an issue.


pkg/sql/importer/import_job.go line 582 at r3 (raw file):

				// The broader second wait captures errors from the job being
				// cancelled or paused that the `validateInspectRowCount` debug

do we know if this is needed for job cancellation? It might only be for paused jobs. I was looking at FinalResumeError and it's set whenever the job has to revert, which I believe includes cancellation.

Doesn't this mean the flow for cancellation will do the 'SELECT COUNT(*)'? And we'd also surface an error saying "Run 'SHOW INSPECT ERRORS FOR JOB %d WITH DETAILS' for more information.", which isn't right also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants