Skip to content

importer: fix expected row count after ingestWithRetry retries#165697

Merged
trunk-io[bot] merged 2 commits intocockroachdb:masterfrom
spilchen:gh-165540/260312/2133/inspect/fix-double-row-count
Mar 17, 2026
Merged

importer: fix expected row count after ingestWithRetry retries#165697
trunk-io[bot] merged 2 commits intocockroachdb:masterfrom
spilchen:gh-165540/260312/2133/inspect/fix-double-row-count

Conversation

@spilchen
Copy link
Copy Markdown
Contributor

@spilchen spilchen commented Mar 13, 2026

The row count validation added for IMPORT INTO accounted for retries that re-enter Resume (pause/resume, node restart) but not retries within ingestWithRetry. When a transient error triggers a retry, rows from the successful first attempt persist in KV, but the result only reflects the last attempt. This caused a row count mismatch.

Fix by using the job progress summary as the source of truth for total imported rows, and ensure distImport always flushes the summary before returning.

Fixes: #165540
Release note: None
Epic: None

@spilchen spilchen self-assigned this Mar 13, 2026
@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io bot commented Mar 13, 2026

😎 Merged successfully - details.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Add `TestImportIntoRowCountCheckAfterIngestRetry` which verifies that
INSPECT row count validation passes after `ingestWithRetry` retries a
failed `distImport`. The test injects a retryable error (via a new
`afterDistImport` testing knob) after the first successful `distImport`
call, forcing a retry. On retry, all files are at `MaxInt64` resume pos
so 0 new rows are ingested. The test covers both empty-table and
non-empty-table cases.

The `afterDistImport` knob is called after each successful `distImport`
inside `ingestWithRetry`. If it returns an error, that error replaces
the nil result and triggers the retry path.

Epic: None
Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
@spilchen spilchen force-pushed the gh-165540/260312/2133/inspect/fix-double-row-count branch 3 times, most recently from 48778f2 to d2963a1 Compare March 16, 2026 18:50
@spilchen spilchen marked this pull request as ready for review March 16, 2026 20:24
@spilchen spilchen requested review from a team as code owners March 16, 2026 20:24
@spilchen spilchen requested review from a team, bghal, msbutler and mw5h and removed request for a team March 16, 2026 20:24
Copy link
Copy Markdown
Contributor

@bghal bghal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@bghal made 2 comments.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on msbutler, mw5h, and spilchen).


pkg/sql/importer/import_into_test.go line 599 at r1 (raw file):

			runner.Exec(t, fmt.Sprintf(`CREATE TABLE %s (k INT PRIMARY KEY, v INT)`, tc.name))

Maybe worth adding an assertion here that the once function was invoked?

Copy link
Copy Markdown
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great find!

@rafiss made 2 comments.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on msbutler, mw5h, and spilchen).


pkg/sql/importer/import_job.go line 347 at r2 (raw file):

	// Reload the job to pick up progress updates that may have been
	// written to a reloaded job pointer during ingestWithRetry retries.
	if reloadedJob, reloadErr := p.ExecCfg().JobRegistry.LoadClaimedJob(ctx, r.job.ID()); reloadErr == nil {

nit: i think we should first check reloadErr != nil and return the error in this case. but if there's a reason to ignore it, a comment would help.

The expected row count for IMPORT's post-ingest validation was computed
from the return value of the last distImport attempt. This correctly
handled retries that re-entered Resume (e.g. pause/resume), but not
retries within ingestWithRetry, where rows from a prior attempt persist
in KV but aren't reflected in the final attempt's result.

Fix this by reading the total imported row count from the job progress
summary, which accumulates across all attempts. Also add a final
checkpoint persist in distImport to ensure the summary is flushed
before it returns.

Fixes: cockroachdb#165540
Release note: None
Epic: None
@spilchen spilchen force-pushed the gh-165540/260312/2133/inspect/fix-double-row-count branch from d2963a1 to 2a2feae Compare March 17, 2026 11:34
Copy link
Copy Markdown
Contributor Author

@spilchen spilchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spilchen made 2 comments.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on bghal, msbutler, mw5h, and rafiss).


pkg/sql/importer/import_job.go line 347 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

nit: i think we should first check reloadErr != nil and return the error in this case. but if there's a reason to ignore it, a comment would help.

Yeah, good point. I'll check the error and return it.


pkg/sql/importer/import_into_test.go line 599 at r1 (raw file):

Previously, bghal (Brendan Gerrity) wrote…

Maybe worth adding an assertion here that the once function was invoked?

Done.

@spilchen
Copy link
Copy Markdown
Contributor Author

TFTRs!

/trunk merge

@trunk-io trunk-io bot merged commit 7b8cebd into cockroachdb:master Mar 17, 2026
27 checks passed
spilchen added a commit to spilchen/cockroach that referenced this pull request Apr 15, 2026
IMPORT row count validation has been failing on drt-chaos-aws with
multiple distinct failure modes (off-by-one, 2x doubling, deficit)
despite prior fixes (cockroachdb#165697, cockroachdb#168158). This is suspected to be an
accounting problem and not actual data corruption.

Change the default value of `bulkio.import.row_count_validation.mode`
from `async` to `off` to prevent spurious INSPECT errors from surfacing
in customer environments. The setting remains user-configurable and can
be re-enabled once the root cause of cockroachdb#168396 is identified and fixed.

Metamorphic test builds continue to randomly exercise all three modes
(off, async, sync).

Fixes: cockroachdb#168400
Epic: none
Release note: none

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
spilchen added a commit to spilchen/cockroach that referenced this pull request Apr 15, 2026
IMPORT row count validation has been failing on drt-chaos-aws with
multiple distinct failure modes (off-by-one, 2x doubling, deficit)
despite prior fixes (cockroachdb#165697, cockroachdb#168158). This is suspected to be an
accounting problem and not actual data corruption.

Change the default value of `bulkio.import.row_count_validation.mode`
from `async` to `off` to prevent spurious INSPECT errors from surfacing
in customer environments. The setting remains user-configurable and can
be re-enabled once the root cause of cockroachdb#168396 is identified and fixed.

Metamorphic test builds continue to randomly exercise all three modes
(off, async, sync).

Fixes: cockroachdb#168400
Epic: none
Release note: none

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
spilchen added a commit to spilchen/cockroach that referenced this pull request Apr 15, 2026
IMPORT row count validation has been failing on drt-chaos-aws with
multiple distinct failure modes (off-by-one, 2x doubling, deficit)
despite prior fixes (cockroachdb#165697, cockroachdb#168158). This is suspected to be an
accounting problem and not actual data corruption.

Change the default value of `bulkio.import.row_count_validation.mode`
from `async` to `off` to prevent spurious INSPECT errors from surfacing
in customer environments. The setting remains user-configurable and can
be re-enabled once the root cause of cockroachdb#168396 is identified and fixed.

Metamorphic test builds continue to randomly exercise all three modes
(off, async, sync).

Fixes: cockroachdb#168400
Epic: none
Release note: none

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
spilchen added a commit to spilchen/cockroach that referenced this pull request Apr 15, 2026
IMPORT row count validation has been failing on drt-chaos-aws with
multiple distinct failure modes (off-by-one, 2x doubling, deficit)
despite prior fixes (cockroachdb#165697, cockroachdb#168158). This is suspected to be an
accounting problem and not actual data corruption.

Change the default value of `bulkio.import.row_count_validation.mode`
from `async` to `off` to prevent spurious INSPECT errors from surfacing
in customer environments. The setting remains user-configurable and can
be re-enabled once the root cause of cockroachdb#168396 is identified and fixed.

Metamorphic test builds continue to randomly exercise all three modes
(off, async, sync).

Fixes: cockroachdb#168400
Epic: none
Release note: none

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
spilchen added a commit to spilchen/cockroach that referenced this pull request Apr 15, 2026
IMPORT row count validation has been failing on drt-chaos-aws with
multiple distinct failure modes (off-by-one, 2x doubling, deficit)
despite prior fixes (cockroachdb#165697, cockroachdb#168158). This is suspected to be an
accounting problem and not actual data corruption.

Change the default value of `bulkio.import.row_count_validation.mode`
from `async` to `off` to prevent spurious INSPECT errors from surfacing
in customer environments. The setting remains user-configurable and can
be re-enabled once the root cause of cockroachdb#168396 is identified and fixed.

Metamorphic test builds continue to randomly exercise all three modes
(off, async, sync).

Fixes: cockroachdb#168400
Epic: none
Release note: none

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sql: import validation jobs report 2x expected row count on drt-chaos-aws

4 participants