[SPARK-48698][SQL] Support analyze column stats for tables with collated columns#47072
[SPARK-48698][SQL] Support analyze column stats for tables with collated columns#47072nikolamand-db wants to merge 7 commits intoapache:masterfrom
Conversation
| // Find the origin column from dataSchema by column name. | ||
| val originColumn = findColumnByName(table.dataSchema, columnName, resolver) | ||
| val validType = canEvolveType(originColumn, newColumn) | ||
| val collationChanged = validType && originColumn.dataType != newColumn.dataType |
There was a problem hiding this comment.
How do you know that this is about collation change? From the code this means that only dataType is different.
There was a problem hiding this comment.
The only possible data type change is collation change or recursively changing subtype collations. Please check canEvolveType function.
There was a problem hiding this comment.
It just feels wrong to write code that is specific to collations when wrapping function is called AlterTableChangeColumnCommand which is pretty generic name. Once we try to extend support for changing column types for anything else (e.g. set different timezone information, change decimal precision...) this collationChanged will no longer make sense.
There was a problem hiding this comment.
Changed collationChanged to columnTypeChanged to feel more generic.
uros-db
left a comment
There was a problem hiding this comment.
collationChanged thing may be a bit dodgy, but otherwise lgtm
| catalog.alterTableDataSchema(tableName, StructType(newDataSchema)) | ||
| // Update table stats after collation change. | ||
| if (columnTypeChanged) { | ||
| CommandUtils.updateTableStats(sparkSession, table) |
There was a problem hiding this comment.
does it require running a query?
|
This PR seems all about calculating the min/max stats for string collation columns, how about the query optimization code that leverages the min/max stats? Do they need update? |
|
Outdated, closing. |
What changes were proposed in this pull request?
Following sequence fails:
Users should be able to run
ANALYZE(column stats computation) commands on tables which have columns with collated type.Add support for column stats computation by:
StringTypesubtypes in stats computation execution codeHyperLogLogPlusPlusto support calculating approximate count for collated strings as well; this is one of the computed statistics inANALYZEcommandHyperLogLogPlusPlusbehaviorWhy are the changes needed?
To properly support statistics computation for collated columns.
Does this PR introduce any user-facing change?
Yes, it changes how statistics computation behaves when being performed on collated columns.
How was this patch tested?
Added checks to
CollationSuiteandCollationSQLExpressionsSuite.Was this patch authored or co-authored using generative AI tooling?
No.