Skip to content

Conversation

dejankrak-db
Copy link
Contributor

@dejankrak-db dejankrak-db commented Feb 3, 2025

What changes were proposed in this pull request?

This PR is a partial revert of the original PR #48962 that introduced the resolution of default session level collation for DDL and DML queries.
The part that is reverted is the default collation resolution for DML queries, whereas the part that is kept is the default collation resolution for DDL queries, which is required to apply the object level collation that was introduced as part of PR #49084. As part of this logic, object level collation is now applied to DDL queries accordingly, with the main logic implemented in ResolveDefaultStringTypes.stringTypeForDDLCommand() method.

Why are the changes needed?

As there were some unresolved technical issues when attempting to merge the functionality from PR #48962 on Delta side, due to its effect on DML queries, it was decided to pause this functionality for now, thus partially reverting unused parts for maintaining a cleaner code moving forward.
Also, this is inline with customer feedback where object level collation is much more requested functionality, so the focus is to introduce the resolution of object level collation for DDL queries instead, allowing the collation to be specified per table or view on their creation or modification, with propagating the default collation specified to subsequent queries on top of those entities.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests that cover the collations functionality, as well adding new dedicated tests for applying object level collation to the underlying columns.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Feb 3, 2025
@dejankrak-db
Copy link
Contributor Author

@cloud-fan, @stefankandic, please take a look - this is just a revert of PR #48962, as we decided not to proceed with session level collations for now, and will do a follow up to apply object level collations for queries.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the other audience, could you provide a link for this decision, @dejankrak-db ?

The decision has since been made not to ship this functionality for now,

@dejankrak-db dejankrak-db changed the title [SPARK-51067][SQL] Revert session level collation changes [SPARK-51067][SQL] Partially revert session level collation as object level collation will be used instead Feb 4, 2025
@dejankrak-db
Copy link
Contributor Author

dejankrak-db commented Feb 4, 2025

For the other audience, could you provide a link for this decision, @dejankrak-db ?

The decision has since been made not to ship this functionality for now,

@dongjoon-hyun , there are 2 main reasons for this decision:

  1. There were some unresolved technical issues when attempting to merge the original PR functionality on Delta side, due to its effect on DML queries when changing the underlying collation in this way.
  2. As per customer feedback gathered so far, object level collation is much more requested functionality, whereas there were no explicit requests for default session level collation so far, hence the focus has shifted to introducing the resolution of object level collation for DDL queries instead, allowing the collation to be specified per table or view on their creation or modification, with propagating the default collation specified to subsequent queries on top of those entities.

Therefore, it was decided to pause session level collation functionality for now, thus partially reverting unused parts of the original PR for maintaining a cleaner code moving forward, while still keeping other parts required to support object level collation resolution. Hope this clarifies the reasoning well! I have also updated the PR description with this info, thanks!

@cloud-fan
Copy link
Contributor

I'm good with removing this hacky feature. It's too fragile to use object StringType as undetermined string collation, and hard for third party Spark extensions to follow.

@dejankrak-db dejankrak-db changed the title [SPARK-51067][SQL] Revert session level collation as object level collation will be used instead [SPARK-51067][SQL] Revert session level collation for DML queries and apply object level collation for DDL queries Feb 9, 2025
@dejankrak-db
Copy link
Contributor Author

@stefankandic, when you find some time please take a look at the latest logic for DDL collation resolution as well as removing the DML collation resolution entirely, as discussed.

Copy link
Contributor

@stefankandic stefankandic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

dejankrak-db and others added 2 commits February 10, 2025 19:26
…ysis/ResolveDDLCommandStringTypes.scala

Co-authored-by: Stefan Kandic <154237371+stefankandic@users.noreply.github.com>
@dejankrak-db
Copy link
Contributor Author

dejankrak-db commented Feb 11, 2025

I'm good with removing this hacky feature. It's too fragile to use object StringType as undetermined string collation, and hard for third party Spark extensions to follow.

@cloud-fan, I have removed the entire session-level collation feature and all the associated workarounds/code - please take a look if the implementation looks good now, we would like to support DDL commands collation resolution with these changes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/4.0!

@cloud-fan cloud-fan closed this in e92e12a Feb 13, 2025
cloud-fan pushed a commit that referenced this pull request Feb 13, 2025
… apply object level collation for DDL queries

### What changes were proposed in this pull request?

This PR is a partial revert of the original PR #48962 that introduced the resolution of default session level collation for DDL and DML queries.
The part that is reverted is the default collation resolution for DML queries, whereas the part that is kept is the default collation resolution for DDL queries, which is required to apply the object level collation that was introduced as part of PR #49084. As part of this logic, object level collation is now applied to DDL queries accordingly, with the main logic implemented in ResolveDefaultStringTypes.stringTypeForDDLCommand() method.

### Why are the changes needed?

As there were some unresolved technical issues when attempting to merge the functionality from PR #48962 on Delta side, due to its effect on DML queries, it was decided to pause this functionality for now, thus partially reverting unused parts for maintaining a cleaner code moving forward.
Also, this is inline with customer feedback where object level collation is much more requested functionality, so the focus is to introduce the resolution of object level collation for DDL queries instead, allowing the collation to be specified per table or view on their creation or modification, with propagating the default collation specified to subsequent queries on top of those entities.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests that cover the collations functionality, as well adding new dedicated tests for applying object level collation to the underlying columns.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #49772 from dejankrak-db/revert-session-collations.

Lead-authored-by: Dejan Krakovic <dejan.krakovic@databricks.com>
Co-authored-by: Stefan Kandic <stefan.kandic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit e92e12a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
MaxGekk pushed a commit that referenced this pull request Mar 11, 2025
… in the given schema"

### What changes were proposed in this pull request?
After removing session-level collation (#49772) we can also revert the PR that changed the behavior of `from_json` and `from_xml` expressions to use json and not sql type representation under the hood (#48750).

### Why are the changes needed?
Now that we don't have correctness problems with session level collation, using `sql` instead of `json` will lead to smaller and more efficient type representation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #50234 from stefankandic/revertFromJsonChange.

Authored-by: Stefan Kandic <stefan.kandic@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
MaxGekk pushed a commit that referenced this pull request Mar 11, 2025
… in the given schema"

### What changes were proposed in this pull request?
After removing session-level collation (#49772) we can also revert the PR that changed the behavior of `from_json` and `from_xml` expressions to use json and not sql type representation under the hood (#48750).

### Why are the changes needed?
Now that we don't have correctness problems with session level collation, using `sql` instead of `json` will lead to smaller and more efficient type representation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #50234 from stefankandic/revertFromJsonChange.

Authored-by: Stefan Kandic <stefan.kandic@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 0094f44)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
anoopj pushed a commit to anoopj/spark that referenced this pull request Mar 15, 2025
… in the given schema"

### What changes were proposed in this pull request?
After removing session-level collation (apache#49772) we can also revert the PR that changed the behavior of `from_json` and `from_xml` expressions to use json and not sql type representation under the hood (apache#48750).

### Why are the changes needed?
Now that we don't have correctness problems with session level collation, using `sql` instead of `json` will lead to smaller and more efficient type representation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#50234 from stefankandic/revertFromJsonChange.

Authored-by: Stefan Kandic <stefan.kandic@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants