Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][SPARK-40822][SQL] Stable derived column aliases #39332

Closed
wants to merge 33 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Jan 1, 2023

What changes were proposed in this pull request?

In the PR, I propose to change auto-generation of column aliases (the case when an user doesn't assign any alias explicitly). Before the changes, Spark SQL generates such alias from Expression but this PR proposes to take the parse tree (output of lexer), and generate an alias using the term tokens from the tree.

New helper function ParserUtils.toExprAlias takes a ParseTree from Antlr4, and converts it to a String using following simple rules:

  1. Adds a gap after every terminal node (TerminalNodeImpl) except of (<[.
  2. Removes a gap before (), <>, [] and ,.

For example, the sequence of tokens "(", "columnA", "+", "1", ")" is converted to the alias "(columnA + 1)"

Why are the changes needed?

To improve user experience with Spark SQL. It is always best practice to name the result of any expressions in a queries select list, if one plans to reference them later. This yields the most readable results and stability. However, sometimes queries are generated or we’re just lazy and trust in the auto generated names. The problem is that the auto-generated names are produced by pretty printing the expression tree which is, while “generally” readable, not meant to be stable across long durations of time. For example:

spark-sql> DESC SELECT substring('hello', 5);
substring(hello, 5, 2147483647)	string

the auto-generated column alias substring(hello, 5, 2147483647) contains not-obvious elements.

Does this PR introduce any user-facing change?

Yes.

How was this patch tested?

By running the modified test suites:

$ build/sbt "test:testOnly *DDLParserSuite"
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *SparkThriftServerProtocolVersionsSuite"
$ build/sbt "test:testOnly *AvroV2Suite"
$ build/sbt "test:testOnly *SQLQuerySuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *HiveDDLSuite"
$ build/sbt -Phive-2.3 "test:testOnly *HiveOrcSourceSuite"
$ build/sbt "test:testOnly *DatasetSuite"
$ build/sbt "test:testOnly *InsertSuite"

and run TPCDS benchmarks on regenerated golden files:

$ git clone https://github.com/databricks/tpcds-kit.git
$ cd tpcds-kit/tools
$ make OS=MACOS
$ cd $SPARK_HOME
$ build/sbt "sql/Test/runMain org.apache.spark.sql.GenTPCDSData --dsdgenDir `pwd`/../tpcds-kit/tools --location `pwd`/tpcds-sf-1 --scaleFactor 1 --numPartitions 1 --overwrite"

and re-gen golden files:

$ SPARK_TPCDS_DATA=`pwd`/tpcds-sf-1 SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.TPCDSQueryTestSuite"

And regenerate the SQL golden files by:

$ SPARK_GENERATE_GOLDEN_FILES=1 PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
$ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStability*Suite"
$ SPARK_GENERATE_GOLDEN_FILES=1 SPARK_ANSI_SQL_MODE=true build/sbt "sql/testOnly *PlanStability*Suite"

@github-actions github-actions bot added the SQL label Jan 1, 2023
@github-actions github-actions bot added the AVRO label Jan 8, 2023
@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 8, 2023

@cloud-fan @srielau Could you review generating of column aliases, please.

}
case e if optGenAliasFunc.isDefined =>
Alias(child, optGenAliasFunc.get.apply(e))()
case l: Literal => Alias(l, toPrettySQL(l))()
case l: Literal => Alias(l, getAlias(l, optGenAliasFunc))()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this change as the above case is case e if optGenAliasFunc.isDefined

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan At the moment (since the PR is a draft one) I just stupidly replaced all toPrettySQL to exclude any fallback to old way of alias generation.

I am mostly interested in how alias are generated because updating all tests + regenerating golden files is pretty time consuming thing. We need to agree of the rules how aliases are generated first of all. Other things are minor for now, IMHO.

@@ -494,15 +498,17 @@ class Analyzer(override val catalogManager: CatalogManager)
case c @ Cast(ne: NamedExpression, _, _, _) => Alias(c, ne.name)()
case e: ExtractValue =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can handle ExtractValue better:

  1. for field access like some_expr.c, use the field name as the alias
  2. for map lookup with string literal like map_col['key'], use the key string as the alias
  3. for others, invoke optGenAliasFunc

@@ -661,16 +661,16 @@ class AstBuilder extends SqlBaseParserBaseVisitor[AnyRef] with SQLConfHelper wit
}

override def visitNamedExpressionSeq(
ctx: NamedExpressionSeqContext): Seq[Expression] = {
ctx: NamedExpressionSeqContext): Seq[(Expression, String)] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is visiting named expression, do we still need to return an alias?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan This function handles not only named expressions, try to run even the simple example:

select 1

@@ -2,61 +2,61 @@
## Schema of Built-in Functions
| Class name | Function name or alias | Query example | Output schema |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a lot of change... Shall we upper case everything except for string literals?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree it seems a lot of the diffs are related to this, as well as the whitespace after , and around -/+, etc...
I think our first goal is stability, our second reducing diffs to the existing algorithm.
Note that Oracle does not put in ANY whitespace. They simply squish everything together.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a lot of change... Shall we upper case everything except for string literals?

If you look at other places (not the diff w/ existing pretty printing), just output of the original SQL text looks more natural from my point of view.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants