Skip to content

[SPARK-57029][SQL][TESTS] Add byte-level visibility golden for ICU collation sort keys#56096

Open
yaooqinn wants to merge 1 commit into
apache:masterfrom
yaooqinn:spark-icu-sortkey-golden
Open

[SPARK-57029][SQL][TESTS] Add byte-level visibility golden for ICU collation sort keys#56096
yaooqinn wants to merge 1 commit into
apache:masterfrom
yaooqinn:spark-icu-sortkey-golden

Conversation

@yaooqinn
Copy link
Copy Markdown
Member

@yaooqinn yaooqinn commented May 25, 2026

What changes were proposed in this pull request?

Add a test-only visibility golden suite for ICU collation sort keys:

  • New test: sql/core/src/test/scala/org/apache/spark/sql/ICUCollationSortKeyGoldenSuite.scala
  • New golden: sql/core/src/test/resources/collations/ICU-collations-sort-keys.md (38 cells, ~1900 bytes)

The suite snapshots (collation, input) -> hex(CollationKey) for 14 dimensions covering the ICU surface Spark uses: UCA primary / tertiary case / secondary diacritic; NFC vs NFD canonical equivalence; combining-mark reorder visibility; SMP surrogate path; BMP precomposed Hangul; ASCII punct / space at primary; Turkish locale tailoring (en_USA + tr); CJK Han implicit weighting; empty string boundary; U+FFFD; C0 controls; variation selectors.

Each test asserts a contract on the recorded bytes: row existence, non-empty hex, level segmentation for NON_IGNORABLE alternate handling, prefix-share invariants for Turkish tailoring, and the ICU compressed-sortkey lead byte for CJK implicit weights. Drift-prone dims fire with named-condition messages if Spark's ICU configuration or library version changes the semantic; stable dims fire if a regression silently drops or folds a cell.

The pattern mirrors ICUCollationsMapSuite (which lists the ICU locale surface) and is scoped to ICU-backed collations only. UTF8_LCASE is out of scope -- it does not go through com.ibm.icu.text.Collator.getCollationKey() and is already covered by CollationFactorySuite.

Why are the changes needed?

icu4j upgrades silently change ORDER BY ... COLLATE semantics across Spark versions. Past upgrade PRs (e.g. SPARK-50189, SPARK-52038, SPARK-54447, SPARK-55308, SPARK-56397) touch only the dependency file and benchmark results -- they ship no byte-level regression on sort output, so a CLDR re-weighting can land in master without any reviewer signal.

Empirical evidence from a local cross-version probe (icu4j 72.1 through 78.3, 33 test cells covering Latin / Turkish / zh_CN): the icu4j 75 → 76 transition altered 23/33 cell sortkeys (UCA primary base shift, e.g. en_US 'a': 0x2a → 0x2b); 77.1 → 78.3 (Spark 4.1 → master, SPARK-52038SPARK-56397) altered 4/33 cells silently. None of these drifts surfaced in PR review.

This suite makes such drift visible during ICU upgrade review: any change to the recorded bytes shows up as a golden diff that a reviewer must explicitly accept. It is not a stability contract -- the disclaimer at golden line 1, the GOLDEN_DISCLAIMER constant (and the line-1 assert that pins it), and the suite scaladoc all state that downstream consumers MUST NOT rely on byte equality across Spark versions. The file is a review-trigger snapshot, nothing more.

Reviewer note: when this golden file changes on a PR that does not bump icu4j, please request a revert -- regeneration belongs in the ICU upgrade PR.

Does this PR introduce any user-facing change?

No. Test-only; no SQLConf, no public API, no production code path.

How was this patch tested?

  • New suite ICUCollationSortKeyGoldenSuite (16 tests). Local 16/16 PASS on master, two-pass deterministic: regenerate the golden, then re-run from disk -- bytes match.
  • Regenerate the golden with SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.ICUCollationSortKeyGoldenSuite"; the suite enforces idempotency and that on-disk bytes match the regen output.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7

…llation sort keys

### What changes were proposed in this pull request?

Add a test-only visibility golden suite for ICU collation sort keys:

- New test: `sql/core/src/test/scala/org/apache/spark/sql/ICUCollationSortKeyGoldenSuite.scala`
- New golden: `sql/core/src/test/resources/collations/ICU-collations-sort-keys.md` (38 cells, ~1900 bytes)

The suite snapshots `(collation, input) -> hex(CollationKey)` for 14 dimensions covering the ICU surface Spark uses: UCA primary / tertiary case / secondary diacritic; NFC vs NFD canonical equivalence; combining-mark reorder visibility; SMP surrogate path; BMP precomposed Hangul; ASCII punct / space at primary; Turkish locale tailoring (en_USA + tr); CJK Han implicit weighting; empty string boundary; U+FFFD; C0 controls; variation selectors.

Each test asserts a contract on the recorded bytes: row existence, non-empty hex, level segmentation for NON_IGNORABLE alternate handling, prefix-share invariants for Turkish tailoring, and the ICU compressed-sortkey lead byte for CJK implicit weights. Drift-prone dims fire with named-condition messages if Spark's ICU configuration or library version changes the semantic; stable dims fire if a regression silently drops or folds a cell.

The pattern mirrors `ICUCollationsMapSuite` (which lists the ICU locale surface) and is scoped to ICU-backed collations only. `UTF8_LCASE` is out of scope -- it does not go through `com.ibm.icu.text.Collator.getCollationKey()` and is already covered by `CollationFactorySuite`.

### Why are the changes needed?

icu4j upgrades silently change `ORDER BY ... COLLATE` semantics across Spark versions. Past upgrade PRs (e.g. SPARK-50189, SPARK-52038, SPARK-54447, SPARK-55308, SPARK-56397) touch only the dependency file and benchmark results -- they ship no byte-level regression on sort output, so a CLDR re-weighting can land in master without any reviewer signal.

Empirical evidence from a local cross-version probe (icu4j 72.1 through 78.3, 33 test cells covering Latin / Turkish / zh_CN): the icu4j 75 → 76 transition altered 23/33 cell sortkeys (UCA primary base shift, e.g. `en_US 'a': 0x2a → 0x2b`); 77.1 → 78.3 (Spark 4.1 → master, SPARK-52038 → SPARK-56397) altered 4/33 cells silently. None of these drifts surfaced in PR review.

This suite makes such drift visible during ICU upgrade review: any change to the recorded bytes shows up as a golden diff that a reviewer must explicitly accept. It is **not** a stability contract -- the disclaimer at golden line 1, the `GOLDEN_DISCLAIMER` constant (and the line-1 assert that pins it), and the suite scaladoc all state that downstream consumers MUST NOT rely on byte equality across Spark versions. The file is a review-trigger snapshot, nothing more.

Reviewer note: when this golden file changes on a PR that does not bump `icu4j`, please request a revert -- regeneration belongs in the ICU upgrade PR.

### Does this PR introduce _any_ user-facing change?

No. Test-only; no SQLConf, no public API, no production code path.

### How was this patch tested?

- New suite `ICUCollationSortKeyGoldenSuite` (16 tests). Local 16/16 PASS on master, two-pass deterministic: regenerate the golden, then re-run from disk -- bytes match.
- Regenerate the golden with `SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.ICUCollationSortKeyGoldenSuite"`; the suite enforces idempotency and that on-disk bytes match the regen output.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7
@yaooqinn yaooqinn force-pushed the spark-icu-sortkey-golden branch from b0fe8cc to 8c38b17 Compare May 25, 2026 10:54
@yaooqinn yaooqinn marked this pull request as ready for review May 25, 2026 14:01
@yaooqinn yaooqinn requested a review from dongjoon-hyun May 25, 2026 14:55
Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way for GitHub not to consider this MD file as a binary, @yaooqinn ?

Image Image

@yaooqinn
Copy link
Copy Markdown
Member Author

Hi @dongjoon-hyun, I shared the same concern with you until copilot showed me this sibling file.
https://github.com/apache/spark/blob/master/sql/core/src/test/resources/collations/ICU-collations-map.md

If you think we shall use txt like sql golden files, I can switch to txt based.

@dongjoon-hyun
Copy link
Copy Markdown
Member

Got it. Never mind for my previous comment.

cc @cloud-fan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants