[SPARK-46831][SQL] Collations - Extending StringType and PhysicalStringType with collationId field #44901

dbatomic · 2024-01-26T11:21:29Z

What changes were proposed in this pull request?

This PR represents initial change for introducing collation concept into Spark engine. For higher level overview please take a look at the umbrella JIRA.

This PR extends both StringType and PhysicalStringType with collationId field. At this point this is just a noop field. In the following PRs this field will be used for fetching right UTF8String comparison rules from global collation table.

Goal is to make sure that we keep backwards compatibility - this is ensured by keeping singleton object StringType that inherits StringType(DEFAULT_COLLATION_ID). DEFAULT_COLLATION_ID represents UTF8 Binary collation rules (i.e. byte for byte comparison, that is already used in Spark). Hence, any code that relies on StringType will stay binary compatible with this version.

It may be hard to see end state from just this initial PR. For reviewers who want to see how this will fit in the end state, please take a look at this draft PR.

Why are the changes needed?

Please refer to umbrella JIRA ticket for collation effort.

Does this PR introduce any user-facing change?

At this point No.

How was this patch tested?

This initial PR doesn't introduce any surface level changes.

Was this patch authored or co-authored using generative AI tooling?

No

sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/InternalRow.scala

MaxGekk · 2024-01-27T12:59:38Z

sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala

 */
 @Stable
-class StringType private() extends AtomicType {
+class StringType private(val collationId: Int) extends AtomicType {


@dbatomic Could you clarify a little bit more why the type of collationId is Int but not a trait/class Collation or an enum.

Sure, overall design is captured in the design doc that comes with JIRA ticket, but let me write reasoning here as well.

Reasons are following:

CollationId will be serializable. When we get to the point of marking column with collation, information will need to be persisted.

In future there will be thousands of possible collation combinations (all locales (800+) X case sensitivity X accent sensitivity X trimming).

We could go with an enum, but I think that enums are not well suited for such large collections.

This will have to work with Photon as well, or any other engine - having simple integer that points to the collation rules looks like simple implementation that can be easily mimicked in other engines.

Of course, this is just my reasoning. I would appreciate your thoughts on this.

MaxGekk · 2024-01-29T14:15:02Z

+1, LGTM. Merging to master.
Thank you, @dbatomic and @cloud-fan for review.

yaooqinn · 2024-01-30T07:01:11Z

The design document for the umbrella that this pull request belongs to is private. Can we make it open to make it compliant with ASF policies?

dbatomic · 2024-01-30T20:16:30Z

The design document for the umbrella that this pull request belongs to is private. Can we make it open to make it compliant with ASF policies?

I updated the permissions against the doc. Please try again. Sorry for the inconvenience.

yaooqinn · 2024-01-31T01:39:32Z

Thank you @dbatomic for the update

dongjoon-hyun · 2024-02-08T17:57:00Z

project/MimaExcludes.scala

@@ -107,6 +107,8 @@ object MimaExcludes {

    // SPARK-46410: Assign error classes/subclasses to JdbcUtils.classifyException
    ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.jdbc.JdbcDialect.classifyException"),
+    // [SPARK-464878][CORE][SQL] (false alert). Invalid rule for StringType extension.


Hi, @dbatomic . Is this wrong JIRA ID intentional because this was a false alert?

You are right - it is supposed to be 46878. Will fix it. THanks.

Thank you for the confirmation. Let me update my PR already, @dbatomic

dongjoon-hyun · 2024-02-08T18:00:26Z

I made a follow-up, @dbatomic , @MaxGekk , @cloud-fan .

[SPARK-46831][INFRA][FOLLOWUP] Fix a wrong JIRA ID in MimaExcludes #45071

### What changes were proposed in this pull request? This is a follow-up of - #44901 ### Why are the changes needed? To fix the wrong JIRA ID information. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45071 from dongjoon-hyun/SPARK-46831. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dbatomic added 2 commits January 25, 2024 17:39

Initial commit

351adb2

MimaExcludes fix - mima incorrectly raises an error.

d93d3ca

github-actions bot added SQL BUILD labels Jan 26, 2024

MaxGekk reviewed Jan 26, 2024

View reviewed changes

sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala Show resolved Hide resolved

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/InternalRow.scala Outdated Show resolved Hide resolved

PR comments fixes.

38e0631

MaxGekk reviewed Jan 27, 2024

View reviewed changes

MaxGekk approved these changes Jan 29, 2024

View reviewed changes

cloud-fan approved these changes Jan 29, 2024

View reviewed changes

MaxGekk closed this in e211dbd Jan 29, 2024

dongjoon-hyun reviewed Feb 8, 2024

View reviewed changes

dongjoon-hyun mentioned this pull request Feb 8, 2024

[SPARK-46831][INFRA][FOLLOWUP] Fix a wrong JIRA ID in MimaExcludes #45071

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-46831][SQL] Collations - Extending StringType and PhysicalStringType with collationId field #44901

[SPARK-46831][SQL] Collations - Extending StringType and PhysicalStringType with collationId field #44901

dbatomic commented Jan 26, 2024

MaxGekk Jan 27, 2024

dbatomic Jan 29, 2024

MaxGekk Jan 29, 2024

MaxGekk commented Jan 29, 2024

yaooqinn commented Jan 30, 2024

dbatomic commented Jan 30, 2024

yaooqinn commented Jan 31, 2024

dongjoon-hyun Feb 8, 2024

dbatomic Feb 8, 2024

dongjoon-hyun Feb 8, 2024

dongjoon-hyun commented Feb 8, 2024

[SPARK-46831][SQL] Collations - Extending StringType and PhysicalStringType with collationId field #44901

[SPARK-46831][SQL] Collations - Extending StringType and PhysicalStringType with collationId field #44901

Conversation

dbatomic commented Jan 26, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

MaxGekk Jan 27, 2024

Choose a reason for hiding this comment

dbatomic Jan 29, 2024

Choose a reason for hiding this comment

MaxGekk Jan 29, 2024

Choose a reason for hiding this comment

MaxGekk commented Jan 29, 2024

yaooqinn commented Jan 30, 2024

dbatomic commented Jan 30, 2024

yaooqinn commented Jan 31, 2024

dongjoon-hyun Feb 8, 2024

Choose a reason for hiding this comment

dbatomic Feb 8, 2024

Choose a reason for hiding this comment

dongjoon-hyun Feb 8, 2024

Choose a reason for hiding this comment

dongjoon-hyun commented Feb 8, 2024