Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47327][SQL] Fix thread safety issue in ICU Collator #45436

Closed
wants to merge 2 commits into from

Conversation

stefankandic
Copy link
Contributor

@stefankandic stefankandic commented Mar 8, 2024

What changes were proposed in this pull request?

Freezing the ICU collator upon creation.

Why are the changes needed?

In order to avoid multiple threads writing to the collation buffer during the generation of collation sort keys which then results in data corruption and an internal error.
You can read more about collator thread safety here

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unti test

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the SQL label Mar 8, 2024
@@ -138,11 +138,13 @@ public Collation(
collationTable[2] = new Collation(
"UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true);
collationTable[2].collator.setStrength(Collator.TERTIARY);
collationTable[2].collator.freeze();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never seen "freeze" before, so I'm wondering how this affects us.

As I understood from the ICU docs: Once frozen, an object can never be unfrozen, so it is thread-safe from that point onward.

So what are the drawbacks of this apprach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting, I thought Collator is just holding a bunch of functions, but it has mutable states?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan collator uses a buffer while writing collation keys, freezing it makes this operation safe by using a reentrant lock around it (source)

this of course raises performance issues which we should probably discuss, because now we can't generate sort keys in parallel on a single collator

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they should use one buffer per thread... Anyway this is out of our control and calling freeze LGTM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, as soon as we get benchmarks working we should revisit this decision.
One option that we also prototyped is to keep Collator in ThreadLocal fields, which also solved the problem. But freeze is a bit cleaner and we don't have microbenchmarks yet so we can't make data driven decision at this point.
LGTM.

// generating ICU sort keys is not thread-safe by default so this should fail
// if we don't handle the concurrency properly on Collator level

for (_ <- 1 to 100) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering: have you tried 1000, 10000, etc. or is 100 proven to be sufficient?

@@ -438,6 +438,39 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {
}
}

test("test concurrently running aggregates") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on another note, do we have some kind of truly multithread tests?

("unicode_CI", Seq("aaa", "aaa"), Seq(Row(2, "aaa"))),
("unicode_CI", Seq("AAA", "aaa"), Seq(Row(2, "AAA"))),
("unicode_CI", Seq("aaa", "bbb"), Seq(Row(1, "aaa"), Row(1, "bbb")))
).foreach {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How this sequential execution proofs the fix? Maybe use a parallel collection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a very good point, the test would fail simply because it would be ran 100 times and at least one of those execution would have a data race - i improved it now to just call getCollationKey in a parallel for each and not really on spark's execution of the aggregate query

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PySpark test failures look unrelated. I think #45436 (comment) has a point - might need to double check to make sure.

Otherwise, LGTM

@@ -438,6 +439,19 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {
}
}

test("test concurrently generating collation keys") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this test go to CollationFactorySuite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could but I decided to put it here because it would require adding a new dependency for parallel collections which I'd like to avoid

@HyukjinKwon
Copy link
Member

Merged to master.

jpcorreia99 pushed a commit to jpcorreia99/spark that referenced this pull request Mar 12, 2024
### What changes were proposed in this pull request?

Freezing the ICU collator upon creation.

### Why are the changes needed?

In order to avoid multiple threads writing to the collation buffer during the generation of collation sort keys which then results in data corruption and an internal error.
You can read more about collator thread safety [here](https://unicode-org.github.io/icu/userguide/icu/design.html#icu-threading-model-and-open-and-close-model)

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New unti test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#45436 from stefankandic/icuConcurrencyIssue.

Authored-by: Stefan Kandic <stefan.kandic@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
sweisdb pushed a commit to sweisdb/spark that referenced this pull request Apr 1, 2024
### What changes were proposed in this pull request?

Freezing the ICU collator upon creation.

### Why are the changes needed?

In order to avoid multiple threads writing to the collation buffer during the generation of collation sort keys which then results in data corruption and an internal error.
You can read more about collator thread safety [here](https://unicode-org.github.io/icu/userguide/icu/design.html#icu-threading-model-and-open-and-close-model)

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New unti test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#45436 from stefankandic/icuConcurrencyIssue.

Authored-by: Stefan Kandic <stefan.kandic@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants