inspect: fix false row count mismatch errors after IMPORT#168158
Conversation
|
😎 Merged successfully - details. |
7a0be44 to
4ef9b62
Compare
fqazi
left a comment
There was a problem hiding this comment.
Nice find @spilchen
@fqazi reviewed 8 files and all commit messages, and made 1 comment.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on spilchen).
bghal
left a comment
There was a problem hiding this comment.
Stamping to unblock the release but had some questions on the interface.
| // If no rows exist in the primary index span, we still need to check for dangling | ||
| // secondary index entries. We run the check with an empty predicate, which will | ||
| // scan the entire secondary index within the span. Any secondary index entries found | ||
| // will be dangling since there are no corresponding primary index rows. |
There was a problem hiding this comment.
super nit: I feel like some detail is lost here.
| log.Dev.Infof(ctx, "skipping hash precheck for index %s: column type not compatible with datums_to_bytes", | ||
| c.secIndex.GetName()) | ||
| } else { | ||
| match, rowCount, hashErr := c.hashesMatch(ctx, allColNames, predicate, queryArgs) |
There was a problem hiding this comment.
Is the overly broad count here? Would it make sense for the inspectCheckRowCount interface to expose an invalid count rather than relying on its user doing the its own scan validation?
There was a problem hiding this comment.
I can make the interface changes in a subsequent diff if they're sensible and not worth holding off on.
There was a problem hiding this comment.
Yes, this is where we double counted. The query had no predicates so scanned the entire table.
I'm not sure 100% sure what you are suggesting for the interface change. Can you clarify? It sounds like something to handle in a follow-on as it doesn't impact the fix.
There was a problem hiding this comment.
I'm thinking that c.rowCount should never be filled if we know it's a bad count from hasRows.
Noted down my thoughts but best as a follow-up.
4ef9b62 to
86bc0d2
Compare
spilchen
left a comment
There was a problem hiding this comment.
Thanks for the reviews
@spilchen made 4 comments.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on bghal and fqazi).
| // If no rows exist in the primary index span, we still need to check for dangling | ||
| // secondary index entries. We run the check with an empty predicate, which will | ||
| // scan the entire secondary index within the span. Any secondary index entries found | ||
| // will be dangling since there are no corresponding primary index rows. |
| log.Dev.Infof(ctx, "skipping hash precheck for index %s: column type not compatible with datums_to_bytes", | ||
| c.secIndex.GetName()) | ||
| } else { | ||
| match, rowCount, hashErr := c.hashesMatch(ctx, allColNames, predicate, queryArgs) |
There was a problem hiding this comment.
Yes, this is where we double counted. The query had no predicates so scanned the entire table.
I'm not sure 100% sure what you are suggesting for the interface change. Can you clarify? It sounds like something to handle in a follow-on as it doesn't impact the fix.
IMPORT validates data integrity by running an inspect job that checks the row count against the expected number of imported rows. When the primary index has an empty span (a range with no rows), the inspect job incorrectly inflates the accumulated row count, producing a spurious RowCountMismatch error. This makes it appear that the imported data is corrupt when the data is actually correct. The root cause is in getPredicateAndQueryArgs, which returned an empty predicate for empty spans. Downstream queries then ran without a WHERE clause, counting all rows in the table instead of zero for the empty span. The fix derives query bounds directly from the span boundaries when no rows exist, keeping all queries properly scoped. The function now also returns a hasRows bool so callers can distinguish empty spans without relying on a sentinel empty string. Closes cockroachdb#168001 Release note: None Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
86bc0d2 to
d558c05
Compare
| // its primary validation. | ||
| type inspectCheckRowCount interface { | ||
| // Rows returns the number of rows counted by the check. | ||
| RowCount() uint64 |
There was a problem hiding this comment.
| RowCount() *uint64 |
The interface can indicate there is no valid row count for the span.
| if check, ok := check.(inspectCheckRowCount); ok { | ||
| data.SpanRowCount = check.RowCount() | ||
| return nil | ||
| } |
There was a problem hiding this comment.
| if check, ok := check.(inspectCheckRowCount); ok { | |
| if rowCount := check.RowCount(); rowCount != nil { | |
| data.SpanRowCount = check.RowCount() | |
| return nil | |
| } | |
| } |
Same behavior but the row count check never gets an incorrect value from the consistency check.
| log.Dev.Infof(ctx, "skipping hash precheck for index %s: column type not compatible with datums_to_bytes", | ||
| c.secIndex.GetName()) | ||
| } else { | ||
| match, rowCount, hashErr := c.hashesMatch(ctx, allColNames, predicate, queryArgs) |
There was a problem hiding this comment.
I'm thinking that c.rowCount should never be filled if we know it's a bad count from hasRows.
Noted down my thoughts but best as a follow-up.
|
TFTR! /trunk merge |
|
blathers backport 26.2.0-rc |
|
Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches. Issue #168001: branch-release-26.2.0-rc. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
IMPORT row count validation has been failing on drt-chaos-aws with multiple distinct failure modes (off-by-one, 2x doubling, deficit) despite prior fixes (cockroachdb#165697, cockroachdb#168158). This is suspected to be an accounting problem and not actual data corruption. Change the default value of `bulkio.import.row_count_validation.mode` from `async` to `off` to prevent spurious INSPECT errors from surfacing in customer environments. The setting remains user-configurable and can be re-enabled once the root cause of cockroachdb#168396 is identified and fixed. Metamorphic test builds continue to randomly exercise all three modes (off, async, sync). Fixes: cockroachdb#168400 Epic: none Release note: none Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
IMPORT row count validation has been failing on drt-chaos-aws with multiple distinct failure modes (off-by-one, 2x doubling, deficit) despite prior fixes (cockroachdb#165697, cockroachdb#168158). This is suspected to be an accounting problem and not actual data corruption. Change the default value of `bulkio.import.row_count_validation.mode` from `async` to `off` to prevent spurious INSPECT errors from surfacing in customer environments. The setting remains user-configurable and can be re-enabled once the root cause of cockroachdb#168396 is identified and fixed. Metamorphic test builds continue to randomly exercise all three modes (off, async, sync). Fixes: cockroachdb#168400 Epic: none Release note: none Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
IMPORT row count validation has been failing on drt-chaos-aws with multiple distinct failure modes (off-by-one, 2x doubling, deficit) despite prior fixes (cockroachdb#165697, cockroachdb#168158). This is suspected to be an accounting problem and not actual data corruption. Change the default value of `bulkio.import.row_count_validation.mode` from `async` to `off` to prevent spurious INSPECT errors from surfacing in customer environments. The setting remains user-configurable and can be re-enabled once the root cause of cockroachdb#168396 is identified and fixed. Metamorphic test builds continue to randomly exercise all three modes (off, async, sync). Fixes: cockroachdb#168400 Epic: none Release note: none Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
IMPORT row count validation has been failing on drt-chaos-aws with multiple distinct failure modes (off-by-one, 2x doubling, deficit) despite prior fixes (cockroachdb#165697, cockroachdb#168158). This is suspected to be an accounting problem and not actual data corruption. Change the default value of `bulkio.import.row_count_validation.mode` from `async` to `off` to prevent spurious INSPECT errors from surfacing in customer environments. The setting remains user-configurable and can be re-enabled once the root cause of cockroachdb#168396 is identified and fixed. Metamorphic test builds continue to randomly exercise all three modes (off, async, sync). Fixes: cockroachdb#168400 Epic: none Release note: none Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
IMPORT row count validation has been failing on drt-chaos-aws with multiple distinct failure modes (off-by-one, 2x doubling, deficit) despite prior fixes (cockroachdb#165697, cockroachdb#168158). This is suspected to be an accounting problem and not actual data corruption. Change the default value of `bulkio.import.row_count_validation.mode` from `async` to `off` to prevent spurious INSPECT errors from surfacing in customer environments. The setting remains user-configurable and can be re-enabled once the root cause of cockroachdb#168396 is identified and fixed. Metamorphic test builds continue to randomly exercise all three modes (off, async, sync). Fixes: cockroachdb#168400 Epic: none Release note: none Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
IMPORT validates data integrity by running an inspect job that checks the row count against the expected number of imported rows. When the primary index has an empty span (a range with no rows), the inspect job incorrectly inflates the accumulated row count, producing a spurious RowCountMismatch error. This makes it appear that the imported data is corrupt when the data is actually correct.
The root cause is in getPredicateAndQueryArgs, which returned an empty predicate for empty spans. Downstream queries then ran without a WHERE clause, counting all rows in the table instead of zero for the empty span. The fix derives query bounds directly from the span boundaries when no rows exist, keeping all queries properly scoped. The function now also returns a hasRows bool so callers can distinguish empty spans without relying on a sentinel empty string.
Closes #168001
Release note: None