Skip to content

inspect: fix false row count mismatch errors after IMPORT#168158

Merged
trunk-io[bot] merged 1 commit intocockroachdb:masterfrom
spilchen:gh-168001/260410/1511/inspect/fix-row-count
Apr 10, 2026
Merged

inspect: fix false row count mismatch errors after IMPORT#168158
trunk-io[bot] merged 1 commit intocockroachdb:masterfrom
spilchen:gh-168001/260410/1511/inspect/fix-row-count

Conversation

@spilchen
Copy link
Copy Markdown
Contributor

IMPORT validates data integrity by running an inspect job that checks the row count against the expected number of imported rows. When the primary index has an empty span (a range with no rows), the inspect job incorrectly inflates the accumulated row count, producing a spurious RowCountMismatch error. This makes it appear that the imported data is corrupt when the data is actually correct.

The root cause is in getPredicateAndQueryArgs, which returned an empty predicate for empty spans. Downstream queries then ran without a WHERE clause, counting all rows in the table instead of zero for the empty span. The fix derives query bounds directly from the span boundaries when no rows exist, keeping all queries properly scoped. The function now also returns a hasRows bool so callers can distinguish empty spans without relying on a sentinel empty string.

Closes #168001

Release note: None

@spilchen spilchen self-assigned this Apr 10, 2026
@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io bot commented Apr 10, 2026

😎 Merged successfully - details.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@spilchen spilchen force-pushed the gh-168001/260410/1511/inspect/fix-row-count branch from 7a0be44 to 4ef9b62 Compare April 10, 2026 18:30
@spilchen spilchen marked this pull request as ready for review April 10, 2026 18:31
@spilchen spilchen requested a review from a team as a code owner April 10, 2026 18:31
Copy link
Copy Markdown
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Nice find @spilchen

@fqazi reviewed 8 files and all commit messages, and made 1 comment.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on spilchen).

Copy link
Copy Markdown
Contributor

@bghal bghal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamping to unblock the release but had some questions on the interface.

Comment thread pkg/sql/inspect/uniqueness_check.go Outdated
Comment on lines -167 to -170
// If no rows exist in the primary index span, we still need to check for dangling
// secondary index entries. We run the check with an empty predicate, which will
// scan the entire secondary index within the span. Any secondary index entries found
// will be dangling since there are no corresponding primary index rows.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: I feel like some detail is lost here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

log.Dev.Infof(ctx, "skipping hash precheck for index %s: column type not compatible with datums_to_bytes",
c.secIndex.GetName())
} else {
match, rowCount, hashErr := c.hashesMatch(ctx, allColNames, predicate, queryArgs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the overly broad count here? Would it make sense for the inspectCheckRowCount interface to expose an invalid count rather than relying on its user doing the its own scan validation?

Copy link
Copy Markdown
Contributor

@bghal bghal Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make the interface changes in a subsequent diff if they're sensible and not worth holding off on.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is where we double counted. The query had no predicates so scanned the entire table.

I'm not sure 100% sure what you are suggesting for the interface change. Can you clarify? It sounds like something to handle in a follow-on as it doesn't impact the fix.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that c.rowCount should never be filled if we know it's a bad count from hasRows.

Noted down my thoughts but best as a follow-up.

@cockroach-teamcity cockroach-teamcity added the X-perf-gain Microbenchmarks CI: Added if a performance gain is detected label Apr 10, 2026
@spilchen spilchen force-pushed the gh-168001/260410/1511/inspect/fix-row-count branch from 4ef9b62 to 86bc0d2 Compare April 10, 2026 20:00
Copy link
Copy Markdown
Contributor Author

@spilchen spilchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reviews

@spilchen made 4 comments.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on bghal and fqazi).

Comment on lines -167 to -170
// If no rows exist in the primary index span, we still need to check for dangling
// secondary index entries. We run the check with an empty predicate, which will
// scan the entire secondary index within the span. Any secondary index entries found
// will be dangling since there are no corresponding primary index rows.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

log.Dev.Infof(ctx, "skipping hash precheck for index %s: column type not compatible with datums_to_bytes",
c.secIndex.GetName())
} else {
match, rowCount, hashErr := c.hashesMatch(ctx, allColNames, predicate, queryArgs)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is where we double counted. The query had no predicates so scanned the entire table.

I'm not sure 100% sure what you are suggesting for the interface change. Can you clarify? It sounds like something to handle in a follow-on as it doesn't impact the fix.

Comment thread pkg/sql/inspect/uniqueness_check.go Outdated
IMPORT validates data integrity by running an inspect job that checks
the row count against the expected number of imported rows. When the
primary index has an empty span (a range with no rows), the inspect job
incorrectly inflates the accumulated row count, producing a spurious
RowCountMismatch error. This makes it appear that the imported data is
corrupt when the data is actually correct.

The root cause is in getPredicateAndQueryArgs, which returned an empty
predicate for empty spans. Downstream queries then ran without a WHERE
clause, counting all rows in the table instead of zero for the empty
span. The fix derives query bounds directly from the span boundaries
when no rows exist, keeping all queries properly scoped. The function
now also returns a hasRows bool so callers can distinguish empty spans
without relying on a sentinel empty string.

Closes cockroachdb#168001

Release note: None

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@spilchen spilchen force-pushed the gh-168001/260410/1511/inspect/fix-row-count branch from 86bc0d2 to d558c05 Compare April 10, 2026 20:11
// its primary validation.
type inspectCheckRowCount interface {
// Rows returns the number of rows counted by the check.
RowCount() uint64
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RowCount() *uint64

The interface can indicate there is no valid row count for the span.

Comment on lines 129 to 132
if check, ok := check.(inspectCheckRowCount); ok {
data.SpanRowCount = check.RowCount()
return nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if check, ok := check.(inspectCheckRowCount); ok {
if rowCount := check.RowCount(); rowCount != nil {
data.SpanRowCount = check.RowCount()
return nil
}
}

Same behavior but the row count check never gets an incorrect value from the consistency check.

log.Dev.Infof(ctx, "skipping hash precheck for index %s: column type not compatible with datums_to_bytes",
c.secIndex.GetName())
} else {
match, rowCount, hashErr := c.hashesMatch(ctx, allColNames, predicate, queryArgs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that c.rowCount should never be filled if we know it's a bad count from hasRows.

Noted down my thoughts but best as a follow-up.

@spilchen spilchen added the backport-26.2.x Flags PRs that need to be backported to 26.2 label Apr 10, 2026
@spilchen
Copy link
Copy Markdown
Contributor Author

TFTR!

/trunk merge

@trunk-io trunk-io bot merged commit aa58e85 into cockroachdb:master Apr 10, 2026
30 checks passed
@rafiss
Copy link
Copy Markdown
Collaborator

rafiss commented Apr 10, 2026

blathers backport 26.2.0-rc

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Apr 10, 2026

Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches.


Issue #168001: branch-release-26.2.0-rc.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

spilchen added a commit to spilchen/cockroach that referenced this pull request Apr 15, 2026
IMPORT row count validation has been failing on drt-chaos-aws with
multiple distinct failure modes (off-by-one, 2x doubling, deficit)
despite prior fixes (cockroachdb#165697, cockroachdb#168158). This is suspected to be an
accounting problem and not actual data corruption.

Change the default value of `bulkio.import.row_count_validation.mode`
from `async` to `off` to prevent spurious INSPECT errors from surfacing
in customer environments. The setting remains user-configurable and can
be re-enabled once the root cause of cockroachdb#168396 is identified and fixed.

Metamorphic test builds continue to randomly exercise all three modes
(off, async, sync).

Fixes: cockroachdb#168400
Epic: none
Release note: none

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
spilchen added a commit to spilchen/cockroach that referenced this pull request Apr 15, 2026
IMPORT row count validation has been failing on drt-chaos-aws with
multiple distinct failure modes (off-by-one, 2x doubling, deficit)
despite prior fixes (cockroachdb#165697, cockroachdb#168158). This is suspected to be an
accounting problem and not actual data corruption.

Change the default value of `bulkio.import.row_count_validation.mode`
from `async` to `off` to prevent spurious INSPECT errors from surfacing
in customer environments. The setting remains user-configurable and can
be re-enabled once the root cause of cockroachdb#168396 is identified and fixed.

Metamorphic test builds continue to randomly exercise all three modes
(off, async, sync).

Fixes: cockroachdb#168400
Epic: none
Release note: none

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
spilchen added a commit to spilchen/cockroach that referenced this pull request Apr 15, 2026
IMPORT row count validation has been failing on drt-chaos-aws with
multiple distinct failure modes (off-by-one, 2x doubling, deficit)
despite prior fixes (cockroachdb#165697, cockroachdb#168158). This is suspected to be an
accounting problem and not actual data corruption.

Change the default value of `bulkio.import.row_count_validation.mode`
from `async` to `off` to prevent spurious INSPECT errors from surfacing
in customer environments. The setting remains user-configurable and can
be re-enabled once the root cause of cockroachdb#168396 is identified and fixed.

Metamorphic test builds continue to randomly exercise all three modes
(off, async, sync).

Fixes: cockroachdb#168400
Epic: none
Release note: none

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
spilchen added a commit to spilchen/cockroach that referenced this pull request Apr 15, 2026
IMPORT row count validation has been failing on drt-chaos-aws with
multiple distinct failure modes (off-by-one, 2x doubling, deficit)
despite prior fixes (cockroachdb#165697, cockroachdb#168158). This is suspected to be an
accounting problem and not actual data corruption.

Change the default value of `bulkio.import.row_count_validation.mode`
from `async` to `off` to prevent spurious INSPECT errors from surfacing
in customer environments. The setting remains user-configurable and can
be re-enabled once the root cause of cockroachdb#168396 is identified and fixed.

Metamorphic test builds continue to randomly exercise all three modes
(off, async, sync).

Fixes: cockroachdb#168400
Epic: none
Release note: none

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
spilchen added a commit to spilchen/cockroach that referenced this pull request Apr 15, 2026
IMPORT row count validation has been failing on drt-chaos-aws with
multiple distinct failure modes (off-by-one, 2x doubling, deficit)
despite prior fixes (cockroachdb#165697, cockroachdb#168158). This is suspected to be an
accounting problem and not actual data corruption.

Change the default value of `bulkio.import.row_count_validation.mode`
from `async` to `off` to prevent spurious INSPECT errors from surfacing
in customer environments. The setting remains user-configurable and can
be re-enabled once the root cause of cockroachdb#168396 is identified and fixed.

Metamorphic test builds continue to randomly exercise all three modes
(off, async, sync).

Fixes: cockroachdb#168400
Epic: none
Release note: none

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-26.2.x Flags PRs that need to be backported to 26.2 target-release-26.3.0 X-perf-gain Microbenchmarks CI: Added if a performance gain is detected

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sql: import validation job fails with 2x as many rows

5 participants