[SPARK-48698][SQL] Support analyze column stats for tables with collated columns by nikolamand-db · Pull Request #47072 · apache/spark

nikolamand-db · 2024-06-24T14:59:16Z

What changes were proposed in this pull request?

Following sequence fails:

> create table t(s string collate utf8_lcase) using parquet;
> insert into t values ('A');
> analyze table t compute statistics for all columns;
[UNSUPPORTED_FEATURE.ANALYZE_UNSUPPORTED_COLUMN_TYPE] The feature is not supported: The ANALYZE TABLE FOR COLUMNS command does not support the type "STRING COLLATE UTF8_LCASE" of the column `s` in the table `spark_catalog`.`default`.`t`. SQLSTATE: 0A000

Users should be able to run ANALYZE (column stats computation) commands on tables which have columns with collated type.

Add support for column stats computation by:

Updating pattern matching to include all StringType subtypes in stats computation execution code
Update HyperLogLogPlusPlus to support calculating approximate count for collated strings as well; this is one of the computed statistics in ANALYZE command
Add tests to check new collated HyperLogLogPlusPlus behavior
Add tests to check statistics computation of collated data

Why are the changes needed?

To properly support statistics computation for collated columns.

Does this PR introduce any user-facing change?

Yes, it changes how statistics computation behaves when being performed on collated columns.

How was this patch tested?

Added checks to CollationSuite and CollationSQLExpressionsSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

…tains collated columns

dbatomic · 2024-06-25T09:03:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

    // Find the origin column from dataSchema by column name.
    val originColumn = findColumnByName(table.dataSchema, columnName, resolver)
    val validType = canEvolveType(originColumn, newColumn)
+    val collationChanged = validType && originColumn.dataType != newColumn.dataType


How do you know that this is about collation change? From the code this means that only dataType is different.

The only possible data type change is collation change or recursively changing subtype collations. Please check canEvolveType function.

It just feels wrong to write code that is specific to collations when wrapping function is called AlterTableChangeColumnCommand which is pretty generic name. Once we try to extend support for changing column types for anything else (e.g. set different timezone information, change decimal precision...) this collationChanged will no longer make sense.

Changed collationChanged to columnTypeChanged to feel more generic.

uros-db

collationChanged thing may be a bit dodgy, but otherwise lgtm

cloud-fan · 2024-07-01T14:59:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

    catalog.alterTableDataSchema(tableName, StructType(newDataSchema))
+    // Update table stats after collation change.
+    if (columnTypeChanged) {
+      CommandUtils.updateTableStats(sparkSession, table)


does it require running a query?

cloud-fan · 2024-07-01T15:00:52Z

This PR seems all about calculating the min/max stats for string collation columns, how about the query optimization code that leverages the min/max stats? Do they need update?

nikolamand-db · 2024-09-03T10:57:23Z

Outdated, closing.

nikolamand-db added 5 commits June 24, 2024 14:29

Fix "ANALYZE TABLE COMPUTE STATISTICS FOR ALL COLUMNS" when table con…

67ee86b

…tains collated columns

Multiple analyze queries

577bfa6

Fix

8be9984

Add HyperLogLogPlusPlus support

e721c1a

More checks

ad03833

github-actions bot added the SQL label Jun 24, 2024

Add checks for alter collation

09aec04

dbatomic reviewed Jun 25, 2024

View reviewed changes

dbatomic approved these changes Jun 25, 2024

View reviewed changes

uros-db approved these changes Jun 26, 2024

View reviewed changes

Readability improvement

ad508ef

cloud-fan reviewed Jul 1, 2024

View reviewed changes

nikolamand-db closed this Sep 3, 2024

nikolamand-db deleted the SPARK-48698-analyze-collated branch September 3, 2024 10:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48698][SQL] Support analyze column stats for tables with collated columns#47072

[SPARK-48698][SQL] Support analyze column stats for tables with collated columns#47072
nikolamand-db wants to merge 7 commits intoapache:masterfrom
nikolamand-db:SPARK-48698-analyze-collated

nikolamand-db commented Jun 24, 2024

Uh oh!

dbatomic Jun 25, 2024

Uh oh!

nikolamand-db Jun 25, 2024

Uh oh!

dbatomic Jun 25, 2024

Uh oh!

nikolamand-db Jun 26, 2024 •

edited

Loading

Uh oh!

uros-db left a comment

Uh oh!

cloud-fan Jul 1, 2024

Uh oh!

cloud-fan commented Jul 1, 2024

Uh oh!

nikolamand-db commented Sep 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

nikolamand-db commented Jun 24, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dbatomic Jun 25, 2024

Choose a reason for hiding this comment

Uh oh!

nikolamand-db Jun 25, 2024

Choose a reason for hiding this comment

Uh oh!

dbatomic Jun 25, 2024

Choose a reason for hiding this comment

Uh oh!

nikolamand-db Jun 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

uros-db left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jul 1, 2024

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 1, 2024

Uh oh!

nikolamand-db commented Sep 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nikolamand-db Jun 26, 2024 •

edited

Loading