[SPARK-57029][SQL][TESTS] Add byte-level visibility golden for ICU collation sort keys by yaooqinn · Pull Request #56096 · apache/spark

yaooqinn · 2026-05-25T05:16:17Z

What changes were proposed in this pull request?

Add a test-only visibility golden suite for ICU collation sort keys:

New test: sql/core/src/test/scala/org/apache/spark/sql/ICUCollationSortKeyGoldenSuite.scala
New golden: sql/core/src/test/resources/collations/ICU-collations-sort-keys.md (38 cells, ~1900 bytes)

The suite snapshots (collation, input) -> hex(CollationKey) for 14 dimensions covering the ICU surface Spark uses: UCA primary / tertiary case / secondary diacritic; NFC vs NFD canonical equivalence; combining-mark reorder visibility; SMP surrogate path; BMP precomposed Hangul; ASCII punct / space at primary; Turkish locale tailoring (en_USA + tr); CJK Han implicit weighting; empty string boundary; U+FFFD; C0 controls; variation selectors.

Each test asserts a contract on the recorded bytes: row existence, non-empty hex, level segmentation for NON_IGNORABLE alternate handling, prefix-share invariants for Turkish tailoring, and the ICU compressed-sortkey lead byte for CJK implicit weights. Drift-prone dims fire with named-condition messages if Spark's ICU configuration or library version changes the semantic; stable dims fire if a regression silently drops or folds a cell.

The pattern mirrors ICUCollationsMapSuite (which lists the ICU locale surface) and is scoped to ICU-backed collations only. UTF8_LCASE is out of scope -- it does not go through com.ibm.icu.text.Collator.getCollationKey() and is already covered by CollationFactorySuite.

Why are the changes needed?

icu4j upgrades silently change ORDER BY ... COLLATE semantics across Spark versions. Past upgrade PRs (e.g. SPARK-50189, SPARK-52038, SPARK-54447, SPARK-55308, SPARK-56397) touch only the dependency file and benchmark results -- they ship no byte-level regression on sort output, so a CLDR re-weighting can land in master without any reviewer signal.

Empirical evidence from a local cross-version probe (icu4j 72.1 through 78.3, 33 test cells covering Latin / Turkish / zh_CN): the icu4j 75 → 76 transition altered 23/33 cell sortkeys (UCA primary base shift, e.g. en_US 'a': 0x2a → 0x2b); 77.1 → 78.3 (Spark 4.1 → master, SPARK-52038 → SPARK-56397) altered 4/33 cells silently. None of these drifts surfaced in PR review.

This suite makes such drift visible during ICU upgrade review: any change to the recorded bytes shows up as a golden diff that a reviewer must explicitly accept. It is not a stability contract -- the disclaimer at golden line 1, the GOLDEN_DISCLAIMER constant (and the line-1 assert that pins it), and the suite scaladoc all state that downstream consumers MUST NOT rely on byte equality across Spark versions. The file is a review-trigger snapshot, nothing more.

Reviewer note: when this golden file changes on a PR that does not bump icu4j, please request a revert -- regeneration belongs in the ICU upgrade PR.

Does this PR introduce any user-facing change?

No. Test-only; no SQLConf, no public API, no production code path.

How was this patch tested?

New suite ICUCollationSortKeyGoldenSuite (16 tests). Local 16/16 PASS on master, two-pass deterministic: regenerate the golden, then re-run from disk -- bytes match.
Regenerate the golden with SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.ICUCollationSortKeyGoldenSuite"; the suite enforces idempotency and that on-disk bytes match the regen output.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7

…llation sort keys ### What changes were proposed in this pull request? Add a test-only visibility golden suite for ICU collation sort keys: - New test: `sql/core/src/test/scala/org/apache/spark/sql/ICUCollationSortKeyGoldenSuite.scala` - New golden: `sql/core/src/test/resources/collations/ICU-collations-sort-keys.md` (38 cells, ~1900 bytes) The suite snapshots `(collation, input) -> hex(CollationKey)` for 14 dimensions covering the ICU surface Spark uses: UCA primary / tertiary case / secondary diacritic; NFC vs NFD canonical equivalence; combining-mark reorder visibility; SMP surrogate path; BMP precomposed Hangul; ASCII punct / space at primary; Turkish locale tailoring (en_USA + tr); CJK Han implicit weighting; empty string boundary; U+FFFD; C0 controls; variation selectors. Each test asserts a contract on the recorded bytes: row existence, non-empty hex, level segmentation for NON_IGNORABLE alternate handling, prefix-share invariants for Turkish tailoring, and the ICU compressed-sortkey lead byte for CJK implicit weights. Drift-prone dims fire with named-condition messages if Spark's ICU configuration or library version changes the semantic; stable dims fire if a regression silently drops or folds a cell. The pattern mirrors `ICUCollationsMapSuite` (which lists the ICU locale surface) and is scoped to ICU-backed collations only. `UTF8_LCASE` is out of scope -- it does not go through `com.ibm.icu.text.Collator.getCollationKey()` and is already covered by `CollationFactorySuite`. ### Why are the changes needed? icu4j upgrades silently change `ORDER BY ... COLLATE` semantics across Spark versions. Past upgrade PRs (e.g. SPARK-50189, SPARK-52038, SPARK-54447, SPARK-55308, SPARK-56397) touch only the dependency file and benchmark results -- they ship no byte-level regression on sort output, so a CLDR re-weighting can land in master without any reviewer signal. Empirical evidence from a local cross-version probe (icu4j 72.1 through 78.3, 33 test cells covering Latin / Turkish / zh_CN): the icu4j 75 → 76 transition altered 23/33 cell sortkeys (UCA primary base shift, e.g. `en_US 'a': 0x2a → 0x2b`); 77.1 → 78.3 (Spark 4.1 → master, SPARK-52038 → SPARK-56397) altered 4/33 cells silently. None of these drifts surfaced in PR review. This suite makes such drift visible during ICU upgrade review: any change to the recorded bytes shows up as a golden diff that a reviewer must explicitly accept. It is **not** a stability contract -- the disclaimer at golden line 1, the `GOLDEN_DISCLAIMER` constant (and the line-1 assert that pins it), and the suite scaladoc all state that downstream consumers MUST NOT rely on byte equality across Spark versions. The file is a review-trigger snapshot, nothing more. Reviewer note: when this golden file changes on a PR that does not bump `icu4j`, please request a revert -- regeneration belongs in the ICU upgrade PR. ### Does this PR introduce _any_ user-facing change? No. Test-only; no SQLConf, no public API, no production code path. ### How was this patch tested? - New suite `ICUCollationSortKeyGoldenSuite` (16 tests). Local 16/16 PASS on master, two-pass deterministic: regenerate the golden, then re-run from disk -- bytes match. - Regenerate the golden with `SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.ICUCollationSortKeyGoldenSuite"`; the suite enforces idempotency and that on-disk bytes match the regen output. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.7

dongjoon-hyun

Is there a way for GitHub not to consider this MD file as a binary, @yaooqinn ?

yaooqinn · 2026-05-26T01:24:53Z

Hi @dongjoon-hyun, I shared the same concern with you until copilot showed me this sibling file.
https://github.com/apache/spark/blob/master/sql/core/src/test/resources/collations/ICU-collations-map.md

If you think we shall use txt like sql golden files, I can switch to txt based.

dongjoon-hyun · 2026-05-26T01:54:38Z

Got it. Never mind for my previous comment.

cc @cloud-fan

yaooqinn force-pushed the spark-icu-sortkey-golden branch from b0fe8cc to 8c38b17 Compare May 25, 2026 10:54

yaooqinn marked this pull request as ready for review May 25, 2026 14:01

yaooqinn requested a review from dongjoon-hyun May 25, 2026 14:55

dongjoon-hyun reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57029][SQL][TESTS] Add byte-level visibility golden for ICU collation sort keys#56096

[SPARK-57029][SQL][TESTS] Add byte-level visibility golden for ICU collation sort keys#56096
yaooqinn wants to merge 1 commit into
apache:masterfrom
yaooqinn:spark-icu-sortkey-golden

yaooqinn commented May 25, 2026 •

edited

Loading

Uh oh!

dongjoon-hyun left a comment

Uh oh!

yaooqinn commented May 26, 2026

Uh oh!

dongjoon-hyun commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yaooqinn commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented May 26, 2026

Uh oh!

dongjoon-hyun commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yaooqinn commented May 25, 2026 •

edited

Loading