Skip to content

[SPARK-48698][SQL] Support analyze column stats for tables with collated columns#47072

Closed
nikolamand-db wants to merge 7 commits intoapache:masterfrom
nikolamand-db:SPARK-48698-analyze-collated
Closed

[SPARK-48698][SQL] Support analyze column stats for tables with collated columns#47072
nikolamand-db wants to merge 7 commits intoapache:masterfrom
nikolamand-db:SPARK-48698-analyze-collated

Conversation

@nikolamand-db
Copy link
Contributor

What changes were proposed in this pull request?

Following sequence fails:

> create table t(s string collate utf8_lcase) using parquet;
> insert into t values ('A');
> analyze table t compute statistics for all columns;
[UNSUPPORTED_FEATURE.ANALYZE_UNSUPPORTED_COLUMN_TYPE] The feature is not supported: The ANALYZE TABLE FOR COLUMNS command does not support the type "STRING COLLATE UTF8_LCASE" of the column `s` in the table `spark_catalog`.`default`.`t`. SQLSTATE: 0A000

Users should be able to run ANALYZE (column stats computation) commands on tables which have columns with collated type.

Add support for column stats computation by:

  • Updating pattern matching to include all StringType subtypes in stats computation execution code
  • Update HyperLogLogPlusPlus to support calculating approximate count for collated strings as well; this is one of the computed statistics in ANALYZE command
  • Add tests to check new collated HyperLogLogPlusPlus behavior
  • Add tests to check statistics computation of collated data

Why are the changes needed?

To properly support statistics computation for collated columns.

Does this PR introduce any user-facing change?

Yes, it changes how statistics computation behaves when being performed on collated columns.

How was this patch tested?

Added checks to CollationSuite and CollationSQLExpressionsSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Jun 24, 2024
// Find the origin column from dataSchema by column name.
val originColumn = findColumnByName(table.dataSchema, columnName, resolver)
val validType = canEvolveType(originColumn, newColumn)
val collationChanged = validType && originColumn.dataType != newColumn.dataType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you know that this is about collation change? From the code this means that only dataType is different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only possible data type change is collation change or recursively changing subtype collations. Please check canEvolveType function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just feels wrong to write code that is specific to collations when wrapping function is called AlterTableChangeColumnCommand which is pretty generic name. Once we try to extend support for changing column types for anything else (e.g. set different timezone information, change decimal precision...) this collationChanged will no longer make sense.

Copy link
Contributor Author

@nikolamand-db nikolamand-db Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed collationChanged to columnTypeChanged to feel more generic.

Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collationChanged thing may be a bit dodgy, but otherwise lgtm

catalog.alterTableDataSchema(tableName, StructType(newDataSchema))
// Update table stats after collation change.
if (columnTypeChanged) {
CommandUtils.updateTableStats(sparkSession, table)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it require running a query?

@cloud-fan
Copy link
Contributor

This PR seems all about calculating the min/max stats for string collation columns, how about the query optimization code that leverages the min/max stats? Do they need update?

@nikolamand-db
Copy link
Contributor Author

Outdated, closing.

@nikolamand-db nikolamand-db deleted the SPARK-48698-analyze-collated branch September 3, 2024 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants