[WIP][SPARK-40822][SQL] Stable derived column aliases #39332

MaxGekk · 2023-01-01T16:08:45Z

What changes were proposed in this pull request?

In the PR, I propose to change auto-generation of column aliases (the case when an user doesn't assign any alias explicitly). Before the changes, Spark SQL generates such alias from Expression but this PR proposes to take the parse tree (output of lexer), and generate an alias using the term tokens from the tree.

New helper function ParserUtils.toExprAlias takes a ParseTree from Antlr4, and converts it to a String using following simple rules:

Adds a gap after every terminal node (TerminalNodeImpl) except of (<[.
Removes a gap before (), <>, [] and ,.

For example, the sequence of tokens "(", "columnA", "+", "1", ")" is converted to the alias "(columnA + 1)"

Why are the changes needed?

To improve user experience with Spark SQL. It is always best practice to name the result of any expressions in a queries select list, if one plans to reference them later. This yields the most readable results and stability. However, sometimes queries are generated or we’re just lazy and trust in the auto generated names. The problem is that the auto-generated names are produced by pretty printing the expression tree which is, while “generally” readable, not meant to be stable across long durations of time. For example:

spark-sql> DESC SELECT substring('hello', 5);
substring(hello, 5, 2147483647)	string

the auto-generated column alias substring(hello, 5, 2147483647) contains not-obvious elements.

Does this PR introduce any user-facing change?

Yes.

How was this patch tested?

By running the modified test suites:

$ build/sbt "test:testOnly *DDLParserSuite"
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *SparkThriftServerProtocolVersionsSuite"
$ build/sbt "test:testOnly *AvroV2Suite"
$ build/sbt "test:testOnly *SQLQuerySuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *HiveDDLSuite"
$ build/sbt -Phive-2.3 "test:testOnly *HiveOrcSourceSuite"
$ build/sbt "test:testOnly *DatasetSuite"
$ build/sbt "test:testOnly *InsertSuite"

and run TPCDS benchmarks on regenerated golden files:

$ git clone https://github.com/databricks/tpcds-kit.git
$ cd tpcds-kit/tools
$ make OS=MACOS
$ cd $SPARK_HOME
$ build/sbt "sql/Test/runMain org.apache.spark.sql.GenTPCDSData --dsdgenDir `pwd`/../tpcds-kit/tools --location `pwd`/tpcds-sf-1 --scaleFactor 1 --numPartitions 1 --overwrite"

and re-gen golden files:

$ SPARK_TPCDS_DATA=`pwd`/tpcds-sf-1 SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.TPCDSQueryTestSuite"

And regenerate the SQL golden files by:

$ SPARK_GENERATE_GOLDEN_FILES=1 PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
$ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStability*Suite"
$ SPARK_GENERATE_GOLDEN_FILES=1 SPARK_ANSI_SQL_MODE=true build/sbt "sql/testOnly *PlanStability*Suite"

…mn-aliases

MaxGekk · 2023-01-08T11:51:38Z

@cloud-fan @srielau Could you review generating of column aliases, please.

…mn-aliases

cloud-fan · 2023-01-09T15:21:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

              }
            case e if optGenAliasFunc.isDefined =>
              Alias(child, optGenAliasFunc.get.apply(e))()
-            case l: Literal => Alias(l, toPrettySQL(l))()
+            case l: Literal => Alias(l, getAlias(l, optGenAliasFunc))()


We don't need this change as the above case is case e if optGenAliasFunc.isDefined

@cloud-fan At the moment (since the PR is a draft one) I just stupidly replaced all toPrettySQL to exclude any fallback to old way of alias generation.

I am mostly interested in how alias are generated because updating all tests + regenerating golden files is pretty time consuming thing. We need to agree of the rules how aliases are generated first of all. Other things are minor for now, IMHO.

cloud-fan · 2023-01-09T15:25:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -494,15 +498,17 @@ class Analyzer(override val catalogManager: CatalogManager)
            case c @ Cast(ne: NamedExpression, _, _, _) => Alias(c, ne.name)()
            case e: ExtractValue =>


We can handle ExtractValue better:

for field access like some_expr.c, use the field name as the alias

for map lookup with string literal like map_col['key'], use the key string as the alias

for others, invoke optGenAliasFunc

cloud-fan · 2023-01-09T15:27:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

@@ -661,16 +661,16 @@ class AstBuilder extends SqlBaseParserBaseVisitor[AnyRef] with SQLConfHelper wit
  }

  override def visitNamedExpressionSeq(
-      ctx: NamedExpressionSeqContext): Seq[Expression] = {
+      ctx: NamedExpressionSeqContext): Seq[(Expression, String)] = {


this is visiting named expression, do we still need to return an alias?

@cloud-fan This function handles not only named expressions, try to run even the simple example:

select 1

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParserUtils.scala

cloud-fan · 2023-01-09T15:33:22Z

sql/core/src/test/resources/sql-functions/sql-expression-schema.md

@@ -2,61 +2,61 @@
 ## Schema of Built-in Functions
 | Class name | Function name or alias | Query example | Output schema |


this is a lot of change... Shall we upper case everything except for string literals?

Yeah, I agree it seems a lot of the diffs are related to this, as well as the whitespace after , and around -/+, etc...
I think our first goal is stability, our second reducing diffs to the existing algorithm.
Note that Oracle does not put in ANY whitespace. They simply squish everything together.

this is a lot of change... Shall we upper case everything except for string literals?

If you look at other places (not the diff w/ existing pretty printing), just output of the original SQL text looks more natural from my point of view.

…mn-aliases

Stable derived column aliases

a1c4d6a

github-actions bot added the SQL label Jan 1, 2023

MaxGekk added 9 commits January 5, 2023 13:46

re-use aliasFunc

0e1ac93

Remove unnecessary changes

e0c7919

Merge remote-tracking branch 'origin/master' into stable-derived-colu…

1651b1a

…mn-aliases

Add toExprAlias()

f502838

Handle []

2c839ab

toExprAlias0 -> toExprAlias

1d7bf65

Fix tests

5d08e17

Re-gen plans

e8b0489

Fix tests

e999835

github-actions bot added the AVRO label Jan 8, 2023

MaxGekk added 2 commits January 8, 2023 18:15

Fix DataFrameSuite and SubqueryHintPropagationSuite

3e743cd

Fix python tests

b657a1d

github-actions bot added CORE PYTHON labels Jan 8, 2023

MaxGekk added 7 commits January 8, 2023 22:58

Remove gaps around dot

2807c72

Add stripGap

93259d8

Regen golden files

7be2218

Fix UDFSuite

6ac23ab

Fix ExplainSuite

41668bb

Merge remote-tracking branch 'origin/master' into stable-derived-colu…

772fa8e

…mn-aliases

Fix sql_processor.py

cbbbb8e

github-actions bot added the PANDAS API ON SPARK label Jan 9, 2023

cloud-fan reviewed Jan 9, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParserUtils.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jan 9, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParserUtils.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jan 9, 2023

View reviewed changes

MaxGekk added 14 commits January 10, 2023 10:50

Remove if (childCount > 0)

214e7ba

Add examples

7543592

Merge remote-tracking branch 'origin/master' into stable-derived-colu…

e0a819b

…mn-aliases

Re-gen TPCDS golden files

71b76b1

Fix for DDLParserSuite

6f07347

Fix for InsertSuite

f8913ef

Merge remote-tracking branch 'origin/master' into stable-derived-colu…

89fcfbe

…mn-aliases

Re-gen sql.out

10c70c3

Override toString

7415d9a

Merge remote-tracking branch 'origin/master' into stable-derived-colu…

434262a

…mn-aliases

Merge remote-tracking branch 'origin/master' into stable-derived-colu…

95bf75e

…mn-aliases

Normalize aliases

91b888d

Handle CTE

37177a5

Add a config

dc01766

MaxGekk mentioned this pull request Feb 22, 2023

[SPARK-40822][SQL] Stable derived column aliases #40126

Closed

MaxGekk closed this in 83a4074 Mar 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][SPARK-40822][SQL] Stable derived column aliases #39332

[WIP][SPARK-40822][SQL] Stable derived column aliases #39332

MaxGekk commented Jan 1, 2023 •

edited

MaxGekk commented Jan 8, 2023

cloud-fan Jan 9, 2023

MaxGekk Jan 10, 2023

cloud-fan Jan 9, 2023

cloud-fan Jan 9, 2023

MaxGekk Jan 30, 2023

cloud-fan Jan 9, 2023

srielau Jan 9, 2023

MaxGekk Jan 10, 2023

		@@ -494,15 +498,17 @@ class Analyzer(override val catalogManager: CatalogManager)
		case c @ Cast(ne: NamedExpression, _, _, _) => Alias(c, ne.name)()
		case e: ExtractValue =>

		@@ -2,61 +2,61 @@
		## Schema of Built-in Functions
		\| Class name \| Function name or alias \| Query example \| Output schema \|

[WIP][SPARK-40822][SQL] Stable derived column aliases #39332

[WIP][SPARK-40822][SQL] Stable derived column aliases #39332

Conversation

MaxGekk commented Jan 1, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

MaxGekk commented Jan 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaxGekk commented Jan 1, 2023 •

edited