Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-46831][SQL] Collations - Extending StringType and PhysicalStringType with collationId field #44901

Closed
wants to merge 3 commits into from

Conversation

dbatomic
Copy link
Contributor

What changes were proposed in this pull request?

This PR represents initial change for introducing collation concept into Spark engine. For higher level overview please take a look at the umbrella JIRA.

This PR extends both StringType and PhysicalStringType with collationId field. At this point this is just a noop field. In the following PRs this field will be used for fetching right UTF8String comparison rules from global collation table.

Goal is to make sure that we keep backwards compatibility - this is ensured by keeping singleton object StringType that inherits StringType(DEFAULT_COLLATION_ID). DEFAULT_COLLATION_ID represents UTF8 Binary collation rules (i.e. byte for byte comparison, that is already used in Spark). Hence, any code that relies on StringType will stay binary compatible with this version.

It may be hard to see end state from just this initial PR. For reviewers who want to see how this will fit in the end state, please take a look at this draft PR.

Why are the changes needed?

Please refer to umbrella JIRA ticket for collation effort.

Does this PR introduce any user-facing change?

At this point No.

How was this patch tested?

This initial PR doesn't introduce any surface level changes.

Was this patch authored or co-authored using generative AI tooling?

No

*/
@Stable
class StringType private() extends AtomicType {
class StringType private(val collationId: Int) extends AtomicType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbatomic Could you clarify a little bit more why the type of collationId is Int but not a trait/class Collation or an enum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, overall design is captured in the design doc that comes with JIRA ticket, but let me write reasoning here as well.

Reasons are following:

  1. CollationId will be serializable. When we get to the point of marking column with collation, information will need to be persisted.
  2. In future there will be thousands of possible collation combinations (all locales (800+) X case sensitivity X accent sensitivity X trimming).
  3. We could go with an enum, but I think that enums are not well suited for such large collections.
  4. This will have to work with Photon as well, or any other engine - having simple integer that points to the collation rules looks like simple implementation that can be easily mimicked in other engines.

Of course, this is just my reasoning. I would appreciate your thoughts on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@MaxGekk
Copy link
Member

MaxGekk commented Jan 29, 2024

+1, LGTM. Merging to master.
Thank you, @dbatomic and @cloud-fan for review.

@MaxGekk MaxGekk closed this in e211dbd Jan 29, 2024
@yaooqinn
Copy link
Member

The design document for the umbrella that this pull request belongs to is private. Can we make it open to make it compliant with ASF policies?

@dbatomic
Copy link
Contributor Author

The design document for the umbrella that this pull request belongs to is private. Can we make it open to make it compliant with ASF policies?

I updated the permissions against the doc. Please try again. Sorry for the inconvenience.

@yaooqinn
Copy link
Member

Thank you @dbatomic for the update

@@ -107,6 +107,8 @@ object MimaExcludes {

// SPARK-46410: Assign error classes/subclasses to JdbcUtils.classifyException
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.jdbc.JdbcDialect.classifyException"),
// [SPARK-464878][CORE][SQL] (false alert). Invalid rule for StringType extension.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @dbatomic . Is this wrong JIRA ID intentional because this was a false alert?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right - it is supposed to be 46878. Will fix it. THanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the confirmation. Let me update my PR already, @dbatomic

@dongjoon-hyun
Copy link
Member

dongjoon-hyun added a commit that referenced this pull request Feb 8, 2024
### What changes were proposed in this pull request?

This is a follow-up of
- #44901

### Why are the changes needed?

To fix the wrong JIRA ID information.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45071 from dongjoon-hyun/SPARK-46831.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants