New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-46831][SQL] Collations - Extending StringType and PhysicalStringType with collationId field #44901
Conversation
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/InternalRow.scala
Outdated
Show resolved
Hide resolved
*/ | ||
@Stable | ||
class StringType private() extends AtomicType { | ||
class StringType private(val collationId: Int) extends AtomicType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbatomic Could you clarify a little bit more why the type of collationId
is Int
but not a trait/class Collation
or an enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, overall design is captured in the design doc that comes with JIRA ticket, but let me write reasoning here as well.
Reasons are following:
- CollationId will be serializable. When we get to the point of marking column with collation, information will need to be persisted.
- In future there will be thousands of possible collation combinations (all locales (800+) X case sensitivity X accent sensitivity X trimming).
- We could go with an enum, but I think that enums are not well suited for such large collections.
- This will have to work with Photon as well, or any other engine - having simple integer that points to the collation rules looks like simple implementation that can be easily mimicked in other engines.
Of course, this is just my reasoning. I would appreciate your thoughts on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
+1, LGTM. Merging to master. |
The design document for the umbrella that this pull request belongs to is private. Can we make it open to make it compliant with ASF policies? |
I updated the permissions against the doc. Please try again. Sorry for the inconvenience. |
Thank you @dbatomic for the update |
@@ -107,6 +107,8 @@ object MimaExcludes { | |||
|
|||
// SPARK-46410: Assign error classes/subclasses to JdbcUtils.classifyException | |||
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.jdbc.JdbcDialect.classifyException"), | |||
// [SPARK-464878][CORE][SQL] (false alert). Invalid rule for StringType extension. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @dbatomic . Is this wrong JIRA ID intentional because this was a false alert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right - it is supposed to be 46878. Will fix it. THanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the confirmation. Let me update my PR already, @dbatomic
I made a follow-up, @dbatomic , @MaxGekk , @cloud-fan . |
### What changes were proposed in this pull request? This is a follow-up of - #44901 ### Why are the changes needed? To fix the wrong JIRA ID information. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45071 from dongjoon-hyun/SPARK-46831. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
This PR represents initial change for introducing collation concept into Spark engine. For higher level overview please take a look at the umbrella JIRA.
This PR extends both
StringType
andPhysicalStringType
with collationId field. At this point this is just a noop field. In the following PRs this field will be used for fetching right UTF8String comparison rules from global collation table.Goal is to make sure that we keep backwards compatibility - this is ensured by keeping singleton
object StringType
that inheritsStringType(DEFAULT_COLLATION_ID)
. DEFAULT_COLLATION_ID represents UTF8 Binary collation rules (i.e. byte for byte comparison, that is already used in Spark). Hence, any code that relies onStringType
will stay binary compatible with this version.It may be hard to see end state from just this initial PR. For reviewers who want to see how this will fit in the end state, please take a look at this draft PR.
Why are the changes needed?
Please refer to umbrella JIRA ticket for collation effort.
Does this PR introduce any user-facing change?
At this point No.
How was this patch tested?
This initial PR doesn't introduce any surface level changes.
Was this patch authored or co-authored using generative AI tooling?
No