[SPARK-47327][SQL] Fix thread safety issue in ICU Collator #45436

stefankandic · 2024-03-08T13:37:31Z

What changes were proposed in this pull request?

Freezing the ICU collator upon creation.

Why are the changes needed?

In order to avoid multiple threads writing to the collation buffer during the generation of collation sort keys which then results in data corruption and an internal error.
You can read more about collator thread safety here

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unti test

Was this patch authored or co-authored using generative AI tooling?

no

uros-db · 2024-03-08T17:43:17Z

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java

@@ -138,11 +138,13 @@ public Collation(
    collationTable[2] = new Collation(
      "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true);
    collationTable[2].collator.setStrength(Collator.TERTIARY);
+    collationTable[2].collator.freeze();


I've never seen "freeze" before, so I'm wondering how this affects us.

As I understood from the ICU docs: Once frozen, an object can never be unfrozen, so it is thread-safe from that point onward.

So what are the drawbacks of this apprach?

This is interesting, I thought Collator is just holding a bunch of functions, but it has mutable states?

@cloud-fan collator uses a buffer while writing collation keys, freezing it makes this operation safe by using a reentrant lock around it (source)

this of course raises performance issues which we should probably discuss, because now we can't generate sort keys in parallel on a single collator

they should use one buffer per thread... Anyway this is out of our control and calling freeze LGTM

Yeah, as soon as we get benchmarks working we should revisit this decision.
One option that we also prototyped is to keep Collator in ThreadLocal fields, which also solved the problem. But freeze is a bit cleaner and we don't have microbenchmarks yet so we can't make data driven decision at this point.
LGTM.

uros-db · 2024-03-08T17:46:49Z

sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala

+    // generating ICU sort keys is not thread-safe by default so this should fail
+    // if we don't handle the concurrency properly on Collator level
+
+    for (_ <- 1 to 100) {


just wondering: have you tried 1000, 10000, etc. or is 100 proven to be sufficient?

uros-db · 2024-03-08T17:47:49Z

sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala

@@ -438,6 +438,39 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {
    }
  }

+  test("test concurrently running aggregates") {


on another note, do we have some kind of truly multithread tests?

MaxGekk · 2024-03-08T19:31:18Z

sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala

+        ("unicode_CI", Seq("aaa", "aaa"), Seq(Row(2, "aaa"))),
+        ("unicode_CI", Seq("AAA", "aaa"), Seq(Row(2, "AAA"))),
+        ("unicode_CI", Seq("aaa", "bbb"), Seq(Row(1, "aaa"), Row(1, "bbb")))
+      ).foreach {


How this sequential execution proofs the fix? Maybe use a parallel collection?

that's a very good point, the test would fail simply because it would be ran 100 times and at least one of those execution would have a data race - i improved it now to just call getCollationKey in a parallel for each and not really on spark's execution of the aggregate query

HyukjinKwon

PySpark test failures look unrelated. I think #45436 (comment) has a point - might need to double check to make sure.

Otherwise, LGTM

dbatomic · 2024-03-11T08:59:09Z

sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala

@@ -438,6 +439,19 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {
    }
  }

+  test("test concurrently generating collation keys") {


Can this test go to CollationFactorySuite?

it could but I decided to put it here because it would require adding a new dependency for parallel collections which I'd like to avoid

HyukjinKwon · 2024-03-11T23:46:51Z

Merged to master.

### What changes were proposed in this pull request? Freezing the ICU collator upon creation. ### Why are the changes needed? In order to avoid multiple threads writing to the collation buffer during the generation of collation sort keys which then results in data corruption and an internal error. You can read more about collator thread safety [here](https://unicode-org.github.io/icu/userguide/icu/design.html#icu-threading-model-and-open-and-close-model) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New unti test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#45436 from stefankandic/icuConcurrencyIssue. Authored-by: Stefan Kandic <stefan.kandic@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

fix concurrecy issue by freezing the collator

d234423

github-actions bot added the SQL label Mar 8, 2024

uros-db reviewed Mar 8, 2024

View reviewed changes

stefankandic mentioned this pull request Mar 8, 2024

[SPARK-46834][SQL][Collations] Support for aggregates #45290

Closed

uros-db reviewed Mar 8, 2024

View reviewed changes

MaxGekk reviewed Mar 8, 2024

View reviewed changes

improve test for collator concurrency

8ec9efd

HyukjinKwon approved these changes Mar 10, 2024

View reviewed changes

dbatomic reviewed Mar 11, 2024

View reviewed changes

dbatomic approved these changes Mar 11, 2024

View reviewed changes

HyukjinKwon closed this in c1b9f28 Mar 11, 2024

stefankandic mentioned this pull request Mar 15, 2024

[SPARK-47327][SQL] Move sort keys concurrency test to CollationFactorySuite #45501

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47327][SQL] Fix thread safety issue in ICU Collator #45436

[SPARK-47327][SQL] Fix thread safety issue in ICU Collator #45436

stefankandic commented Mar 8, 2024 •

edited

uros-db Mar 8, 2024

cloud-fan Mar 11, 2024

stefankandic Mar 11, 2024 •

edited

cloud-fan Mar 11, 2024

dbatomic Mar 11, 2024

uros-db Mar 8, 2024

uros-db Mar 8, 2024

MaxGekk Mar 8, 2024

stefankandic Mar 8, 2024

HyukjinKwon left a comment

dbatomic Mar 11, 2024

stefankandic Mar 11, 2024

HyukjinKwon commented Mar 11, 2024

[SPARK-47327][SQL] Fix thread safety issue in ICU Collator #45436

[SPARK-47327][SQL] Fix thread safety issue in ICU Collator #45436

Conversation

stefankandic commented Mar 8, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stefankandic Mar 11, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Mar 11, 2024

stefankandic commented Mar 8, 2024 •

edited

stefankandic Mar 11, 2024 •

edited