Spark: Support long-valued streaming max rows per micro-batch#16571
Open
colinre wants to merge 2 commits into
Open
Spark: Support long-valued streaming max rows per micro-batch#16571colinre wants to merge 2 commits into
colinre wants to merge 2 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Spark Structured Streaming row-based micro-batch planning was effectively capped at
Integer.MAX_VALUErows. This made very large initial streaming backfills impractical because streams over multi-trillion-row tables could require thousands of micro-batches before reaching ongoing incremental ingestion.As a result, users may need to operate a separate batch backfill implementation and then hand off to streaming, even though a single streaming job should be able to handle both phases.
Root Cause
streaming-max-rows-per-micro-batchwas parsed and propagated as anint, and planner defaults initialized the effective row limit toInteger.MAX_VALUEeven when no row limit was configured.Change
This updates Spark streaming row-limit handling to use
longvalues internally and usesLong.MAX_VALUEas the unconfigured row-limit sentinel.ReadLimit.maxRows(...)now receives the long-valued limit without narrowing, and unconfigured streaming reads no longer get an implicitInteger.MAX_VALUErow cap.The legacy
SparkReadConf.maxRecordsPerMicroBatch()intaccessor is retained for source and binary compatibility, deprecated, and supplemented withmaxRecordsPerMicroBatchLong()for the long-valued behavior.File-count rate limiting, offsets, checkpoint representation, and complete-file soft-limit semantics are unchanged.
This is a Codex change. I'm unfamiliar with this codebase.
Tests
Added or updated coverage for:
Integer.MAX_VALUE,Integer.MAX_VALUErow cap,intaccessor compatibility and new long accessor behavior.Validated with:
compileTestJavafor Spark 3.5, 4.0, and 4.1TestSparkReadConf.testMaxRecordsPerMicroBatch*tests for Spark 3.5, 4.0, and 4.1spotlessJavaApplygit diff --checkCompatibility
Existing option names and configurations remain valid. Values at or below
Integer.MAX_VALUEpreserve existing behavior.The existing
int maxRecordsPerMicroBatch()method remains available to avoid breaking downstream callers, but is deprecated because it cannot represent newly supported values aboveInteger.MAX_VALUE. New callers should usemaxRecordsPerMicroBatchLong().