Skip to content

sql/bulkingest: add ingest processor for distributed SST ingestion #156659

@spilchen

Description

@spilchen

This PR adds the Ingest Processor, which is responsible for the final phase of the distributed import pipeline: ingesting pre-merged SSTs into CockroachDB’s KV layer using AddSSTable.

It consumes metadata produced by the distributed merge phase, splits and scatters ranges as needed, and ingests each SST directly into Pebble. This replaces the placeholder import ingestion logic with a real, distributed AddSST flow.

This PR implements the Ingest phase, which follows the distributed merge stage introduced in #PR-6.

After the merge processor produces range-aligned, sorted SSTs:

  • The ingest phase pre-splits ranges (using splitAndScatterSpans() from #PR-2).
  • Each processor ingests SSTs directly into Pebble via DB.AddSSTable().
  • The entire flow executes as a DistSQL job, parallelized across nodes.

This has been implemented already in the prototype fork at: jeffswenson/cockroach@feature-distributed-merge. The goal of this issue is to pull it out into a smaller PR thats easier to review.

Goal

Implement an ingestion stage that:

  1. Takes merged SST metadata as input (URIs, start/end keys).
  2. Splits and scatters KV ranges before ingestion.
  3. Efficiently ingests SST files via DB.AddSSTable().
  4. Handles hundreds of SSTs in parallel across nodes.

This processor is designed to be used both by:

  • The distributed import pipeline (final ingestion phase).
  • Future bulk restore or rebalancing operations that ingest pre-built SSTs.

Implementation Highlights

ingestFileProcessor

  • New DistSQL processor implemented under pkg/sql/bulkingest/ingest_file_processor.go.

  • Accepts a BulkMergeSpec_Output (list of SSTs with start/end keys) via the DistSQL flow.

  • For each SST:

    1. Opens it via the CloudStorageMux (sql/bulkutil: add CloudStorageMux for managing multi-node external storage #156587)
    2. Reads SST bytes into memory with ioctx.ReadAllWithScratch().
    3. Calls db.AddSSTable(ctx, startKey, endKey, data, ...).
    4. Closes and releases all handles.
  • Uses Pebble’s built-in validation to ensure SSTs are well-formed and range-aligned.

Range Preparation

Cleanup and Safety

  • Each SST read and write is scoped to a context; all readers are closed promptly.
  • Calls defer reader.Close(ctx) and releases buffers when complete.
  • Uses CloudStorageMux.Close() to release all cached ExternalStorage instances at the end of execution.

Planning and Flow Integration

  • Integrated into DistSQL via rowexec.NewIngestFileProcessor.
  • Can be scheduled as a final stage after the merge coordinator processor.

Testing Plan

Unit Tests

  • ingest_test.go
    • Writes synthetic SSTs via bulksst.Writer.
    • Runs a mock ingestion flow using the new processor.
    • Verifies:
      • SSTs are successfully ingested into KV.
      • Data can be queried and matches expected key/value pairs.
      • No range overlap or data loss occurs.

Integration Tests

  • Multi-node cluster test:
    • Generates SSTs for a 3-node cluster using the merge pipeline.
    • Runs the ingest phase to AddSST each file.
    • Confirms:
      • All expected ranges exist (via SHOW RANGES).
      • All data is visible in SQL queries.
      • No “range missing” or “duplicate key” errors appear.

Dependencies

Jira issue: CRDB-56097

Epic CRDB-48845

Metadata

Metadata

Assignees

Labels

C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-sql-foundationsSQL Foundations Team (formerly SQL Schema + SQL Sessions)target-release-26.1.0

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions