[SPARK-56232][SQL][SS] V2 streaming read for FileTable by LuciferYang · Pull Request #55231 · apache/spark

LuciferYang · 2026-04-07T06:59:19Z

What changes were proposed in this pull request?

Implements MicroBatchStream support for V2 file tables, enabling structured streaming reads through the V2 data source path.

New FileMicroBatchStream (430 lines) implementing MicroBatchStream, SupportsAdmissionControl, and SupportsTriggerAvailableNow — handles file discovery, offset management via FileStreamSourceLog, dedup via SeenFilesMap, rate limiting (maxFilesPerTrigger / maxBytesPerTrigger), and cross-batch file caching
Override FileScan.toMicroBatchStream() to create FileMicroBatchStream
Add withFileIndex method to FileScan and all 6 concrete scans for creating batch-specific scans in planInputPartitions
Add MICRO_BATCH_READ to FileTable.CAPABILITIES
Update ResolveDataSource to allow FileDataSourceV2 into the V2 streaming path, respecting USE_V1_SOURCE_LIST for backward compatibility
Remove the FileTable streaming fallback in FindDataSourceTable

Reuses V1 infrastructure for checkpoint compatibility: FileStreamSourceLog (metadata tracking), FileStreamSourceOffset (offset type), SeenFilesMap (dedup). Existing streaming queries can upgrade from V1 to V2 without checkpoint migration.

Why are the changes needed?

File streaming reads currently fall back to V1 FileStreamSource, preventing deprecation of V1 file source code. This is part of SPARK-56170 which aims to make V2 the default path for all file source operations.

Does this PR introduce any user-facing change?

No. By default, USE_V1_SOURCE_LIST includes all file formats, so streaming reads still use V1. Users can opt into V2 by clearing the list (spark.sql.sources.useV1SourceList=""). Existing checkpoints are compatible.

How was this patch tested?

New FileStreamV2ReadSuite with 6 E2E tests: basic streaming read, file discovery across batches, maxFilesPerTrigger rate limiting, checkpoint recovery, V2 path verification (MicroBatchScanExec), and JSON format. Existing FileStreamSourceSuite (76 tests) passes with V1 forced via USE_V1_SOURCE_LIST. Total: 82 streaming file tests pass.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

…Frame API writes and delete FallBackFileSourceV2 Key changes: - FileWrite: added partitionSchema, customPartitionLocations, dynamicPartitionOverwrite, isTruncate; path creation and truncate logic; dynamic partition overwrite via FileCommitProtocol - FileTable: createFileWriteBuilder with SupportsDynamicOverwrite and SupportsTruncate; capabilities now include TRUNCATE and OVERWRITE_DYNAMIC; fileIndex skips file existence checks when userSpecifiedSchema is provided (write path) - All file format writes (Parquet, ORC, CSV, JSON, Text, Avro) use createFileWriteBuilder with partition/truncate/overwrite support - DataFrameWriter.lookupV2Provider: enabled FileDataSourceV2 for non-partitioned Append and Overwrite via df.write.save(path) - DataFrameWriter.insertInto: V1 fallback for file sources (TODO: SPARK-56175) - DataFrameWriter.saveAsTable: V1 fallback for file sources (TODO: SPARK-56230, needs StagingTableCatalog) - DataSourceV2Utils.getTableProvider: V1 fallback for file sources (TODO: SPARK-56175) - Removed FallBackFileSourceV2 rule - V2SessionCatalog.createTable: V1 FileFormat data type validation

…catalog table loading, and gate removal Key changes: - FileTable extends SupportsPartitionManagement with createPartition, dropPartition, listPartitionIdentifiers, partitionSchema - Partition operations sync to catalog metastore (best-effort) - V2SessionCatalog.loadTable returns FileTable instead of V1Table, sets catalogTable and useCatalogFileIndex on FileTable - V2SessionCatalog.getDataSourceOptions includes storage.properties for proper option propagation (header, ORC bloom filter, etc.) - V2SessionCatalog.createTable validates data types via FileTable - FileTable.columns() restores NOT NULL constraints from catalogTable - FileTable.partitioning() falls back to userSpecifiedPartitioning or catalog partition columns - FileTable.fileIndex uses CatalogFileIndex when catalog has registered partitions (custom partition locations) - FileTable.schema checks column name duplication for non-catalog tables only - DataSourceV2Utils.getTableProvider: removed FileDataSourceV2 gate - DataFrameWriter.insertInto: enabled V2 for file sources - DataFrameWriter.saveAsTable: V1 fallback (TODO: SPARK-56230) - ResolveSessionCatalog: V1 fallback for FileTable-backed commands (AnalyzeTable, AnalyzeColumn, TruncateTable, TruncatePartition, ShowPartitions, RecoverPartitions, AddPartitions, RenamePartitions, DropPartitions, SetTableLocation, CREATE TABLE validation, REPLACE TABLE blocking) - FindDataSourceTable: streaming V1 fallback for FileTable (TODO: SPARK-56233) - DataSource.planForWritingFileFormat: graceful V2 handling

…ion to FileScan

Enable bucketed writes for V2 file tables via catalog BucketSpec. Key changes: - FileWrite: add bucketSpec field, use V1WritesUtils.getWriterBucketSpec() instead of hardcoded None - FileTable: createFileWriteBuilder passes catalogTable.bucketSpec to the write pipeline - FileDataSourceV2: getTable uses collect to skip BucketTransform (handled via catalogTable.bucketSpec instead) - FileWriterFactory: use DynamicPartitionDataConcurrentWriter for bucketed writes since V2's RequiresDistributionAndOrdering cannot express hash-based ordering - All 6 format Write/Table classes updated with BucketSpec parameter Note: bucket pruning and bucket join (read-path optimization) are not included in this patch (tracked under SPARK-56231).

Add RepairTableExec to sync filesystem partition directories with catalog metastore for V2 file tables. Key changes: - New RepairTableExec: scans filesystem partitions via FileTable.listPartitionIdentifiers(), compares with catalog, registers missing partitions and drops orphaned entries - DataSourceV2Strategy: route RepairTable and RecoverPartitions for FileTable to new V2 exec node

Implement SupportsOverwriteV2 for V2 file tables to support static partition overwrite (INSERT OVERWRITE TABLE t PARTITION(p=1) SELECT ...). Key changes: - FileTable: replace SupportsTruncate with SupportsOverwriteV2 on WriteBuilder, implement overwrite(predicates) - FileWrite: extend toBatch() to delete only the matching partition directory, ordered by partitionSchema - FileTable.CAPABILITIES: add OVERWRITE_BY_FILTER - All 6 format Write/Table classes: plumb overwritePredicates parameter This is a prerequisite for SPARK-56304 (ifPartitionNotExists).

…ERT INTO

… file read path

…EAD) ### What changes were proposed in this pull request? Implements `MicroBatchStream` support for V2 file tables, enabling structured streaming reads through the V2 path instead of falling back to V1 `FileStreamSource`. Key changes: - New `FileMicroBatchStream` class implementing `MicroBatchStream`, `SupportsAdmissionControl`, and `SupportsTriggerAvailableNow` — handles file discovery, offset management, rate limiting, and partition planning - Override `FileScan.toMicroBatchStream()` to return `FileMicroBatchStream` - Add `withFileIndex` method to `FileScan` and all 6 concrete scans for creating batch-specific scans - Add `MICRO_BATCH_READ` to `FileTable.CAPABILITIES` - Update `ResolveDataSource` to allow `FileDataSourceV2` into the V2 streaming path (respects `USE_V1_SOURCE_LIST` for backward compatibility) - Remove the `FileTable` streaming fallback in `FindDataSourceTable` - Reuses V1 infrastructure (`FileStreamSourceLog`, `FileStreamSourceOffset`, `SeenFilesMap`) for checkpoint compatibility ### Why are the changes needed? V2 file tables cannot be fully adopted until streaming reads are supported. Without this, the V1 `FileStreamSource` fallback prevents deprecation of V1 file source code. ### Does this PR introduce _any_ user-facing change? No. By default, `USE_V1_SOURCE_LIST` includes all file formats, so streaming reads still use V1. Users can opt into V2 by clearing the list. Existing checkpoints are compatible. ### How was this patch tested? New `FileStreamV2ReadSuite` with 6 E2E tests. Existing `FileStreamSourceSuite` (76 tests) passes with V1 forced via `USE_V1_SOURCE_LIST`.

LuciferYang · 2026-04-07T07:00:58Z

This is the 10th PR for SPARK-56170. The commit 775c9ae contains the changes for this patch.

LuciferYang added 10 commits April 7, 2026 11:48

[SPARK-56174][SQL] Complete V2 file write path for DataFrame API

3d97e09

[SPARK-56176][SQL] V2-native ANALYZE TABLE/COLUMN with stats propagat…

f584cac

…ion to FileScan

[SPARK-56304][SQL] V2 ifPartitionNotExists support for file table INS…

fad0b1c

…ERT INTO

[SPARK-56231][SQL] Bucket pruning and bucket join optimization for V2…

770fafe

… file read path

LuciferYang marked this pull request as draft April 7, 2026 06:59

LuciferYang changed the title ~~[SPARK-56232][SQL][SS] V2 streaming read for FileTable (MICRO_BATCH_READ)~~ [SPARK-56232][SQL][SS] V2 streaming read for FileTable Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56232][SQL][SS] V2 streaming read for FileTable#55231

[SPARK-56232][SQL][SS] V2 streaming read for FileTable#55231
LuciferYang wants to merge 10 commits intoapache:masterfrom
LuciferYang:SPARK-56232

LuciferYang commented Apr 7, 2026

Uh oh!

LuciferYang commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LuciferYang commented Apr 7, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

LuciferYang commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant