importer: fix expected row count after ingestWithRetry retries#165697
Conversation
|
😎 Merged successfully - details. |
Add `TestImportIntoRowCountCheckAfterIngestRetry` which verifies that INSPECT row count validation passes after `ingestWithRetry` retries a failed `distImport`. The test injects a retryable error (via a new `afterDistImport` testing knob) after the first successful `distImport` call, forcing a retry. On retry, all files are at `MaxInt64` resume pos so 0 new rows are ingested. The test covers both empty-table and non-empty-table cases. The `afterDistImport` knob is called after each successful `distImport` inside `ingestWithRetry`. If it returns an error, that error replaces the nil result and triggers the retry path. Epic: None Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
48778f2 to
d2963a1
Compare
bghal
left a comment
There was a problem hiding this comment.
@bghal made 2 comments.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on msbutler, mw5h, and spilchen).
pkg/sql/importer/import_into_test.go line 599 at r1 (raw file):
runner.Exec(t, fmt.Sprintf(`CREATE TABLE %s (k INT PRIMARY KEY, v INT)`, tc.name))
Maybe worth adding an assertion here that the once function was invoked?
rafiss
left a comment
There was a problem hiding this comment.
great find!
@rafiss made 2 comments.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on msbutler, mw5h, and spilchen).
pkg/sql/importer/import_job.go line 347 at r2 (raw file):
// Reload the job to pick up progress updates that may have been // written to a reloaded job pointer during ingestWithRetry retries. if reloadedJob, reloadErr := p.ExecCfg().JobRegistry.LoadClaimedJob(ctx, r.job.ID()); reloadErr == nil {
nit: i think we should first check reloadErr != nil and return the error in this case. but if there's a reason to ignore it, a comment would help.
The expected row count for IMPORT's post-ingest validation was computed from the return value of the last distImport attempt. This correctly handled retries that re-entered Resume (e.g. pause/resume), but not retries within ingestWithRetry, where rows from a prior attempt persist in KV but aren't reflected in the final attempt's result. Fix this by reading the total imported row count from the job progress summary, which accumulates across all attempts. Also add a final checkpoint persist in distImport to ensure the summary is flushed before it returns. Fixes: cockroachdb#165540 Release note: None Epic: None
d2963a1 to
2a2feae
Compare
spilchen
left a comment
There was a problem hiding this comment.
@spilchen made 2 comments.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on bghal, msbutler, mw5h, and rafiss).
pkg/sql/importer/import_job.go line 347 at r2 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
nit: i think we should first check
reloadErr != niland return the error in this case. but if there's a reason to ignore it, a comment would help.
Yeah, good point. I'll check the error and return it.
pkg/sql/importer/import_into_test.go line 599 at r1 (raw file):
Previously, bghal (Brendan Gerrity) wrote…
Maybe worth adding an assertion here that the
oncefunction was invoked?
Done.
|
TFTRs! /trunk merge |
IMPORT row count validation has been failing on drt-chaos-aws with multiple distinct failure modes (off-by-one, 2x doubling, deficit) despite prior fixes (cockroachdb#165697, cockroachdb#168158). This is suspected to be an accounting problem and not actual data corruption. Change the default value of `bulkio.import.row_count_validation.mode` from `async` to `off` to prevent spurious INSPECT errors from surfacing in customer environments. The setting remains user-configurable and can be re-enabled once the root cause of cockroachdb#168396 is identified and fixed. Metamorphic test builds continue to randomly exercise all three modes (off, async, sync). Fixes: cockroachdb#168400 Epic: none Release note: none Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
IMPORT row count validation has been failing on drt-chaos-aws with multiple distinct failure modes (off-by-one, 2x doubling, deficit) despite prior fixes (cockroachdb#165697, cockroachdb#168158). This is suspected to be an accounting problem and not actual data corruption. Change the default value of `bulkio.import.row_count_validation.mode` from `async` to `off` to prevent spurious INSPECT errors from surfacing in customer environments. The setting remains user-configurable and can be re-enabled once the root cause of cockroachdb#168396 is identified and fixed. Metamorphic test builds continue to randomly exercise all three modes (off, async, sync). Fixes: cockroachdb#168400 Epic: none Release note: none Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
IMPORT row count validation has been failing on drt-chaos-aws with multiple distinct failure modes (off-by-one, 2x doubling, deficit) despite prior fixes (cockroachdb#165697, cockroachdb#168158). This is suspected to be an accounting problem and not actual data corruption. Change the default value of `bulkio.import.row_count_validation.mode` from `async` to `off` to prevent spurious INSPECT errors from surfacing in customer environments. The setting remains user-configurable and can be re-enabled once the root cause of cockroachdb#168396 is identified and fixed. Metamorphic test builds continue to randomly exercise all three modes (off, async, sync). Fixes: cockroachdb#168400 Epic: none Release note: none Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
IMPORT row count validation has been failing on drt-chaos-aws with multiple distinct failure modes (off-by-one, 2x doubling, deficit) despite prior fixes (cockroachdb#165697, cockroachdb#168158). This is suspected to be an accounting problem and not actual data corruption. Change the default value of `bulkio.import.row_count_validation.mode` from `async` to `off` to prevent spurious INSPECT errors from surfacing in customer environments. The setting remains user-configurable and can be re-enabled once the root cause of cockroachdb#168396 is identified and fixed. Metamorphic test builds continue to randomly exercise all three modes (off, async, sync). Fixes: cockroachdb#168400 Epic: none Release note: none Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
IMPORT row count validation has been failing on drt-chaos-aws with multiple distinct failure modes (off-by-one, 2x doubling, deficit) despite prior fixes (cockroachdb#165697, cockroachdb#168158). This is suspected to be an accounting problem and not actual data corruption. Change the default value of `bulkio.import.row_count_validation.mode` from `async` to `off` to prevent spurious INSPECT errors from surfacing in customer environments. The setting remains user-configurable and can be re-enabled once the root cause of cockroachdb#168396 is identified and fixed. Metamorphic test builds continue to randomly exercise all three modes (off, async, sync). Fixes: cockroachdb#168400 Epic: none Release note: none Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The row count validation added for IMPORT INTO accounted for retries that re-enter Resume (pause/resume, node restart) but not retries within ingestWithRetry. When a transient error triggers a retry, rows from the successful first attempt persist in KV, but the result only reflects the last attempt. This caused a row count mismatch.
Fix by using the job progress summary as the source of truth for total imported rows, and ensure distImport always flushes the summary before returning.
Fixes: #165540
Release note: None
Epic: None