[Spark] Move changelog tests to spark-unified to fix Spark 4.2 classpath#6830
Conversation
f10d479 to
5a7bdd5
Compare
johanl-db
left a comment
There was a problem hiding this comment.
The PR description is really verbose, trim it down so that it's easy to grasp the essential information
| // Mirror what `sparkDependentSettings` does for the version-specific test shim directory, | ||
| // but root it at this module's baseDirectory instead of `Test / baseDirectory` (which points | ||
| // at sparkV1). Without this, version-gated tests that live under spark-unified (e.g. the | ||
| // changelog tests under `scala-shims/spark-4.2`) are never compiled, and tests that exist | ||
| // only for Spark 4.2 would also be compiled against 4.0/4.1 if the path were unconditional. |
There was a problem hiding this comment.
nit: more simply:
Add version-specific test shim directory, e.g. src/test/scala-shims/spark-4.2
| * {@code DeltaV2TestBase} lives in {@code sparkV2}'s test scope and is not visible from {@code | ||
| * spark-unified}. | ||
| */ | ||
| public abstract class DeltaChangelogTestBase { |
There was a problem hiding this comment.
I assume spark-unified test target doesn't depend on sparkV2 test target so that you can't use DeltaV2TestBase here?
Might be helpful to be able to reuse things from sparkV2 tests though. Not super important here, but if it's easily doable to have this dependency it would be worth doing rather than redefining every helper
There was a problem hiding this comment.
Done — added ExclusionRule("io.delta", "delta-spark_2.13") to excludeDependencies to drop the released delta-spark from spark-unified's test classpath (totherwise would re-introduce the cast/NoSuchMethodError), now extending DeltaV2TestBase again.
864480e to
e15b8b8
Compare
|
Changelog tests run and pass in private repo CI: |
ac2c631 to
7ebe52e
Compare
johanl-db
left a comment
There was a problem hiding this comment.
Looks good to me, can you have @murali-db take a look also since he's working on the 4.2 build?
| throw new RuntimeException("Failed to process CDC commit actions", e); | ||
| // Include the inner message so AnalysisException error classes | ||
| // (e.g. DELTA_CHANGELOG_ROW_TRACKING_DISABLED_IN_RANGE) reach the user. | ||
| throw new RuntimeException( |
There was a problem hiding this comment.
(can be done as follow-up): It would be better to surface proper spark exceptions, with an error class, rather than generic runtime errors
There was a problem hiding this comment.
Added to the backlog
762a421 to
72926a6
Compare
72926a6 to
f8ef032
Compare
f8ef032 to
9f32bc8
Compare
<!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md 2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 3. Be sure to keep the PR description updated to reflect all changes. 4. Please write your PR title to summarize what this PR proposes. 5. If possible, provide a concise example to reproduce the issue for a faster review. 6. If applicable, include the corresponding issue number in the PR title and link it in the body. --> #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Follow-up to [#6794](#6794). Adds Deletion Vector (DV) support to the V2 changelog read path so `CHANGES FROM VERSION x TO VERSION y` returns correct results on tables with `delta.enableDeletionVectors=true`. **In short,** I apply change_type delete to every removeFile and change_type insert to every row in addFile per commit. A carry-over pair (delete+insert) will be filtered out as carry-over, but with the DVs we apply the DV as filter to the AddFile, leading to delete rows without insert partner, which represent actual delets. ### Implementation `DeltaChangelogBatch.java`: - `CDCInputPartition` carries a per-file `deletionVectorInfo: String` (the base64-serialized DV descriptor, or `null` when the file has no DV). - `planInputPartitions` extracts the DV per `AddFile`/`RemoveFile` via `DeletionVectorDescriptor::serializeToBase64` and threads it onto the partition. - `createReader` uses the new public `PartitionUtils.buildDvMetadata(...)` helper to set `FILE_ROW_INDEX_FILTER_ID_ENCODED` and `FILE_ROW_INDEX_FILTER_TYPE` together (with `RowIndexFilterType.IF_CONTAINED`) on the per-file metadata, so the existing `DropMarkedRowsFilter` machinery picks up the DV and produces the correct `delete` change rows. - Wrapper `RuntimeException` messages now include the inner cause's `getMessage()` so `AnalysisException` error classes (e.g. `DELTA_CHANGELOG_*`) reach the user instead of being hidden behind a generic "Failed to process CDC commit actions". `PartitionUtils.java`: - New public 1-arg `buildDvMetadata(String dvBase64)` overload that takes the base64 string directly and uses `IF_CONTAINED` semantics. Returns `java.util.Map` for ergonomic iteration from the changelog caller. Existing private overloads are unchanged. ### Tests `DeltaChangelogDirectBatchExecutionTest.java`: - `testDirectBatchExecutionWithExplicitExpectedRows` and `testUpdateProducesPairedDeleteAndInsert` are now parameterized via `@ParameterizedTest @valuesource(booleans = {false, true})` so each runs once with DVs off (rewrite-on-delete) and once with DVs on (DV-on-delete). The expected row sets are identical, so the same assertions cover both physical layouts. `DeltaChangelogDvTest.java` (new): DV-specific scenarios that don't fit the parameterized variants. - `test_mixedDvDelete_perFileBranching`: two `AddFile`s in one commit, a single `DELETE` touching both; asserts the per-file DV branch in `planInputPartitions` sets the DV constants independently per file. - `test_multiVersionDvDeletes_perCommitIsolation`: two `DELETE` commits in sequence on a DV-enabled table; asserts Catalyst's batch CDC post-processor partitions on `(row_id, version)` so each commit's carry-overs collapse independently. ## Note on Spark 4.2 lane The Spark 4.2 lane is non-blocking (`continue-on-error: true`); a follow-up PR ([#6830](#6830)) relocates these tests to `spark-unified/test` where the in-tree `DeltaCatalog` is on the classpath, making the 4.2 lane green for these tests too. ## How was this patch tested? Locally against Spark 4.0 / 4.1 (default lane, green) and Spark 4.2 (with PR #6830 applied to verify the tests pass end-to-end). 25 / 25 changelog tests pass with both PRs combined. ## Does this PR introduce _any_ user-facing changes? The V2 changelog path is gated behind `spark.databricks.delta.changelogV2.enabled` and not yet on a released code path. With this PR, when enabled, it now returns correct results on DV-enabled tables.
and its helpers configure
to the unified Java in
, but they live in . cannot depend
on (circular), so the in-tree unified DeltaCatalog is
not on the test classpath. The class with the same FQN is instead
pulled in via:
sparkV2/test
-> kernelUnityCatalog(test->test)
-> kernelDefaults(test->test)
-> io.delta %% delta-spark % 4.0.0 % test
That released jar contains the pre-delta-io#5320 (a single
class extending directly, without the
ancestor). At runtime the JVM mixes the released
4.0.0 catalog with the current-build self-typed
trait, and the trait dispatch's synthetic cast to
fails with . After the trait inline, the same
released jar surfaces as because its bytecode targets the pre-Spark-4.2 6-arg signature.
Move the changelog tests to . spark-unified has the
hybrid DeltaCatalog in its sources, so the test classpath
resolves to the in-tree class. spark-unified depends on only
at , so the kernelUnityCatalog -> kernelDefaults ->
released-delta-spark-4.0.0 chain does not apply, and the released jar is
not on the test classpath at all.
Three deltas, no production code change:
- Move the three test files from
to
.
- Add JUnit Jupiter to . The
changelog tests use Jupiter (, , ,
); spark-unified previously had no Jupiter deps
because its other tests are scalatest.
- Make self-contained. It used to extend
from ; that base is not visible from
since spark-unified does not depend on sparkV2 at
. Inline the static / fields, the
/ lifecycle, and the (x2)//
helpers actually used by the changelog tests.
Run with and a Spark 4.2 preview pinned in
[Spark] Move changelog tests to spark-unified to fix Spark 4.2 classpath
and its helpers configure
to the unified Java in
, but they live in . cannot depend
on (circular), so the in-tree unified DeltaCatalog is
not on the test classpath. The class with the same FQN is instead
pulled in via:
sparkV2/test
-> kernelUnityCatalog(test->test)
-> kernelDefaults(test->test)
-> io.delta %% delta-spark % 4.0.0 % test
That released jar contains the pre-delta-io#5320 (a single
class extending directly, without the
ancestor). At runtime the JVM mixes the released
4.0.0 catalog with the current-build self-typed
trait, and the trait dispatch's synthetic cast to
fails with . After the trait inline, the same
released jar surfaces as because its bytecode targets the pre-Spark-4.2 6-arg signature.
Move the changelog tests to . spark-unified has the
hybrid DeltaCatalog in its sources, so the test classpath
resolves to the in-tree class. spark-unified depends on only
at , so the kernelUnityCatalog -> kernelDefaults ->
released-delta-spark-4.0.0 chain does not apply, and the released jar is
not on the test classpath at all.
Three deltas, no production code change:
- Move the three test files from
to
.
- Add JUnit Jupiter to . The
changelog tests use Jupiter (, , ,
); spark-unified previously had no Jupiter deps
because its other tests are scalatest.
- Make self-contained. It used to extend
from ; that base is not visible from
since spark-unified does not depend on sparkV2 at
. Inline the static / fields, the
/ lifecycle, and the (x2)//
helpers actually used by the changelog tests.
Run with and a Spark 4.2 preview pinned in
Three follow-up fixes after the test relocation in 5a7bdd5. Without them the tests are either undiscovered, or hit wrapped error messages that hide the underlying error class assertions. 1. build.sbt (spark-unified test source dirs) spark-unified pins Test/baseDirectory to sparkV1's directory, so CrossSparkVersions.sparkDependentSettings() adds the version-shim test dir (scala-shims/spark-4.2) under spark/, not under spark-unified/. The relocated changelog tests sit at spark-unified/src/test/scala-shims/spark-4.2/... and would not be discovered. Add unifiedDir/src/test/scala-shims/spark-4.2 to the explicit Test/unmanagedSourceDirectories list so the shim dir is visible from both module bases. 2. DeltaChangelogScanBuilder.java (boundary RT-disabled error class) The eager start-snapshot row-tracking check threw DELTA_CHANGELOG_ROW_TRACKING_DISABLED_IN_RANGE, but that error class is for the per-commit loop in DeltaChangelogBatch that detects an RT toggle inside the range. A boundary snapshot without RT is the feature-missing case and should match the end-snapshot check, which throws DELTA_CHANGELOG_REQUIRES_ROW_TRACKING. Align both boundary checks to use REQUIRES_ROW_TRACKING; the in-range error class stays for the mid-range toggle case only. 3. DeltaChangelogBatch.java (preserve AnalysisException error class) The generic catch (Exception e) inside planInputPartitions wraps DeltaAnalysisException (checked Exception, not RuntimeException) in a new RuntimeException("Failed to process CDC commit actions"), discarding the inner getMessage(). The error class (e.g. DELTA_CHANGELOG_ROW_TRACKING_DISABLED_IN_RANGE, DELTA_CHANGELOG_SCHEMA_CHANGE_IN_RANGE) no longer reaches the user. Append e.getMessage() to the wrapper message so the original formatted error string surfaces
Per PR review feedback. Co-authored-by: Isaac
The unscoped excludeDependencies blocked MiMa's previous-ABI resolution during the compile-time backward-compat check (the spark/mimaPreviousClassfiles task could not find io.delta:delta-spark_2.13:4.2.0). Restrict the exclusion to the Test config so MiMa sees the artifact in Compile/runtime contexts while the test classpath still drops the released jar that shadows the in-tree unified DeltaCatalog. Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
The start-snapshot row-tracking check throws throwChangelogRowTrackingDisabledInRange(startVersion) (not the generic throwChangelogRequiresRowTracking) to distinguish: - table never had row tracking (end-boundary fails) - tell user to enable RT. - table has RT now but startVersion predates the toggle (start-boundary fails) - tell user which version is the problem. A previous rebase had collapsed both branches onto the generic error, losing the start-version context. Restore parity with master. Co-authored-by: Isaac
When a table never had row tracking, both the start and end boundary checks fail. Previously the start-boundary check fired first and threw DELTA_CHANGELOG_ROW_TRACKING_DISABLED_IN_RANGE, telling the user that RT toggled mid-range. This is misleading because RT was never enabled. Reorder so the end-boundary check runs first: - end has no RT -> DELTA_CHANGELOG_REQUIRES_ROW_TRACKING (table needs RT). - end has RT, start has no RT -> DELTA_CHANGELOG_ROW_TRACKING_DISABLED_IN_RANGE (toggle within range, points at the offending startVersion). This surfaces only with the assumeFalse skip in DeltaChangelogTestBase removed (which is also part of this PR), so the testChangelogRejectsTableWithoutRowTracking case starts exercising the boundary logic on Spark 4.2 preview5. Co-authored-by: Isaac
9f32bc8 to
26cba7f
Compare
test_multiVersionDvDeletes_perCommitIsolation passes locally on Spark 4.2 preview5 but fails on CI runs with the same artifact. Same-path Add+Remove from a DV-DELETE produces carry-over delete/insert pairs that Spark Catalyst Phase-1 cancels for two out of three (row_id, _commit_version) partitions, but leaves the third surviving on CI. The values look identical to the canceled partitions, so the divergence between local and CI environments has no obvious source yet. Disabling so the rest of the V2 changelog suite stays green on CI 4.2 preview5 while the root cause is investigated. Re-enable when fixed. Co-authored-by: Isaac
| .disablePlugins(JavaFormatterPlugin, ScalafmtPlugin) | ||
| .settings ( | ||
| name := "delta-spark", | ||
| // Exclude the released `delta-spark` jar that sparkV2's test deps |
There was a problem hiding this comment.
i didnt get what is doing? released, non-test delta-spark jar will not have test dependencies. are you talking about the delta-spark tests jar? do we release that?
There was a problem hiding this comment.
Hi TD, thanks for taking a look.
I want to admit that build changes are a completly unknown terretory for me, hence those are very LLM supported changes.
When building with Spark 4.2 I kept receiving a ClassCastException at AbstractDeltaCatalog.scala:1118. This was due to an older released delta-spark jar which was added to the test classpath via transitive test dependencies and conflicting with the in-tree DeltaCatalog. To fix this I moved the tests to spark-unified and AFAIU there was then a race condition between the older and newer version of DeltaCatalog.class which this addition then decides. Thats how I understand it.
The issue seems to be in this line:
// test->test so spark-unified test sources can reuse helpers like
// DeltaV2TestBase from sparkV2's test scope.
.dependsOn(sparkV2 % "compile->compile;test->test")
It adds a dependency from spark-unified / test on sparkV2 / test . Then it goes down the chain of test dependencies as noted by the comment below (e.g. sparkV2 / test -> kernelUnityCatalog / test here) which ends up pulling an old version of delta-spark for testing that conflicts.
Seing the test->test dependency chain, don't you think this is a reasonable solution? Alternatively I could remove the dependency and instead duplicate around ~160 lines of helpers that I reuse in my tests. That could be a clear separation, but introduces code duplication. Not really pretty.
LLM response for maybe more correctness
> You're right that the released delta-spark binary itself doesn't carry test deps -- the conflict is transitive. sparkV2's `test->test` chain reaches kernelDefaults/test, which declares `delta-spark % 4.0.0 % test` (build.sbt:1166) for its own catalog tests. Once spark-unified inherits sparkV2's test classpath, that 4.0.0 jar lands on it and its pre-#5320 `DeltaCatalog.class` (no `ChangelogSupport` mixin) wins classloader resolution against the in-tree class -- V2 changelog reads then fall back to the legacy `CDCReader`. The Test-scoped exclusion forces the in-tree class to win; MiMa's previous-ABI lookup still resolves the released jar in Compile/runtime contexts. The alternative is to drop `dependsOn(sparkV2 % "test->test")` and duplicate ~160 lines of DeltaV2TestBase into spark-unified -- removes the chain at the cost of duplicated test infrastructure.There was a problem hiding this comment.
okay.. lets debug. what happens if you remove these lines now?
There was a problem hiding this comment.
specifically test exclusion lines.
There was a problem hiding this comment.
I encountered issues whilst removing the dependencies, as 2 tests started to fail with assertion failures. Moving the tests to the original sparkV2 test path lead to the ClassCastException mentioned above. Due to time constraints, I opted to continue with removing this part and introducing the helpers in spark-unified. With this the tests pass again.
…y DeltaV2TestBase Drops both: - `dependsOn(sparkV2 % "test->test")` (the test->test chain) - `Test / excludeDependencies += ExclusionRule(io.delta, delta-spark_2.13)` and adds a local copy of DeltaV2TestBase to spark-unified test sources. Hypothesis: without the test->test chain, the released delta-spark 4.0.0 jar never reaches spark-unified's test classpath, so neither the ClassCastException at AbstractDeltaCatalog nor the phantom-carryover events should occur on CI. Local result: 24/25 pass, 1 skipped (@disabled). CI run will validate. Co-authored-by: Isaac
The full delta-spark test suite against Spark 4.2 includes pre-existing UC connector incompatibilities (UCProxy bytecode targets Spark 4.0's 6-arg CatalogStorageFormat.copy, not 4.2's 7-arg signature). Those failures are masked by continue-on-error on the master matrix lane in spark_test.yaml, but they obscure the signal we actually want: does THIS PR's changelog code path work on Spark 4.2? Add a narrow workflow that runs only spark/testOnly io.delta.spark.internal.v2.read.changelog.* against Spark 4.2-preview4. Green = the PR is unblocked on 4.2 even though the broader matrix is unstable. Lives on the fork-CI branches only. Co-authored-by: Isaac
Master now pins fullVersion=4.2.0-preview5 directly with MASTER=None and uses '4.2' as the matrix alias. The 'master' alias no longer resolves. Switch the focused workflow to -DsparkVersion=4.2. Co-authored-by: Isaac
…on route fixes it CI on the duplication route was green for 24/25 with one @disabled. Now re-enable that test to confirm whether dropping the test->test chain (and thus keeping the released delta-spark 4.0.0 jar off the test classpath) also fixes the phantom-carryover bug it was @disabled for. Co-authored-by: Isaac
|
|
||
| CrossSparkVersions.sparkDependentSettings(sparkVersion), | ||
|
|
||
| // Add version-specific test shim directory, e.g. `src/test/scala-shims/spark-4.2`. |
There was a problem hiding this comment.
can you please merge this section with the existing Test / unmanagedSourceDirectories section 10 lines above
…edSourceDirectories block Per TD's review, the second `Test / unmanagedSourceDirectories ++=` block can be folded into the existing `:=` block 10 lines above. Single block, lazily computes the shim dir via sparkVersion.value plus SparkVersionSpec.ALL_SPECS, and appends it to the base test source dirs. Net: -4 lines, no behavior change. Co-authored-by: Isaac
Per TD's review, list the shim dir as a regular Seq element instead of appending via `++ shimDir`. Safe to `.get` because every supported Spark version's spec defines additionalSourceDir. Co-authored-by: Isaac
Co-authored-by: Isaac
875a529 to
bd26ec2
Compare
Which Delta project/connector is this regarding?
Description
Move the changelog tests to
spark-unified/testas we had a circular dependency in the tests from sparkV2 to spark-unified and from spark-unified to sparkV2. Fixes theClassCastExceptionatAbstractDeltaCatalog.scala:1118.How was this patch tested?
See "How I build" at the bottom for the full command. Ran
DeltaChangelogTestBase.java. Result on Spark 4.2-preview4:Furthermore ran my tests in private repo CI https://github.com/SanJSp/delta/actions/runs/26287328831/job/77378150469?pr=2
Does this PR introduce any user-facing changes?
No. The two small Java edits in
DeltaChangelogScanBuilder.javaandDeltaChangelogBatch.javaaffect the V2 changelog code path's error-reporting (which error class surfaces to the user when row tracking is misconfigured), but the V2 changelog path remains gated behindspark.databricks.delta.changelogV2.enabledand is not used in any released code path today.How I build
Quick reference: minimal cherry-pick to build + test against Spark 4.2
Verified: 21 / 21 changelog tests pass with this setup. Without PR #6657, compile fails on Spark 4.2 ABI shifts. Without this PR (#6830), tests fail at the first
spark.sql(...)call withClassCastExceptionfromSupportsPathIdentifier, followed byNoSuchMethodError: CatalogStorageFormat.copy(6 args)once that cast is fixed.Expected:
The
DeltaChangelogTestBase.javaconflict from the PR #6657 merge has to be resolved withgit rm(PR #6830 deletes the file fromsparkV2/test, PR #6657 modifies it):# If the merge of pr-6830 reports a CONFLICT (modify/delete) on DeltaChangelogTestBase.java: git rm spark/v2/src/test/scala-shims/spark-4.2/io/delta/spark/internal/v2/read/changelog/DeltaChangelogTestBase.java git commit --no-edit