-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48831][CONNECT] Make default column name of cast
compatible with Spark Classic
#47249
Conversation
cast
cast
cast
cast
compatible with Spark Classic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
val dataType = CharVarcharUtils.replaceCharVarcharWithStringForCast(rawDataType) | ||
val castExpr = cast.getEvalMode match { | ||
case proto.Expression.Cast.EvalMode.EVAL_MODE_LEGACY => | ||
Cast(transformExpression(cast.getExpr), dataType, None, EvalMode.LEGACY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering why changing it this way can fix the schema difference? It looks like the cast expression is the same as before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it is affected by castExpr.setTagValue(Cast.USER_SPECIFIED_CAST, ())
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala
Lines 93 to 110 in 0487507
// Replaces attributes, string literals, complex type extractors with their pretty form so that | |
// generated column names don't contain back-ticks or double-quotes. | |
def usePrettyExpression(e: Expression): Expression = e transform { | |
case a: Attribute => new PrettyAttribute(a) | |
case Literal(s: UTF8String, StringType) => PrettyAttribute(s.toString, StringType) | |
case Literal(v, t: NumericType) if v != null => PrettyAttribute(v.toString, t) | |
case Literal(null, dataType) => PrettyAttribute("NULL", dataType) | |
case e: GetStructField => | |
val name = e.name.getOrElse(e.childSchema(e.ordinal).name) | |
PrettyAttribute(usePrettyExpression(e.child).sql + "." + name, e.dataType) | |
case e: GetArrayStructFields => | |
PrettyAttribute(s"${usePrettyExpression(e.child)}.${e.field.name}", e.dataType) | |
case r: InheritAnalysisRules => | |
PrettyAttribute(r.makeSQLString(r.parameters.map(toPrettySQL)), r.dataType) | |
case c: Cast if c.getTagValue(Cast.USER_SPECIFIED_CAST).isEmpty => | |
PrettyAttribute(usePrettyExpression(c.child).sql, c.dataType) | |
case p: PythonFuncExpression => PrettyPythonUDF(p.name, p.dataType, p.children) | |
} |
Merged to master. |
…with Spark Classic ### What changes were proposed in this pull request? I think there are two issues regarding the default column name of `cast`: 1, It seems unclear that when the name is the input column or `CAST(...)`, e.g. in Spark Classic, ``` scala> spark.range(1).select(col("id").cast("string"), lit(1).cast("string"), col("id").cast("long"), lit(1).cast("long")).printSchema warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` root |-- id: string (nullable = false) |-- CAST(1 AS STRING): string (nullable = false) |-- id: long (nullable = false) |-- CAST(1 AS BIGINT): long (nullable = false) ``` 2, the column name is not consistent between Spark Connect and Spark Classic. This PR aims to resolve the second issue, that is, making default column name of `cast` compatible with Spark Classic, by comparing with classic implementation https://github.com/apache/spark/blob/9cf6dc873ff34412df6256cdc7613eed40716570/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L1208-L1212 ### Why are the changes needed? the default column name is not consistent with the spark classic ### Does this PR introduce _any_ user-facing change? yes, spark classic: ``` In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show() +-------------------------+-------------------+-------------------+-------------------+ |CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)| +-------------------------+-------------------+-------------------+-------------------+ | 123| 123| 123| 123.0| +-------------------------+-------------------+-------------------+-------------------+ ``` spark connect (before): ``` In [3]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show() +---------+---+---+-----+ |X'313233'|123|123| 123| +---------+---+---+-----+ | 123|123|123|123.0| +---------+---+---+-----+ ``` spark connect (after): ``` In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show() +-------------------------+-------------------+-------------------+-------------------+ |CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)| +-------------------------+-------------------+-------------------+-------------------+ | 123| 123| 123| 123.0| +-------------------------+-------------------+-------------------+-------------------+ ``` ### How was this patch tested? added test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47249 from zhengruifeng/py_fix_cast. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? the `cast` issue has been resolved in #47249 , then we can reenable a group of doctests ### Why are the changes needed? test coverage ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #47302 from zhengruifeng/enable_more_doctest. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…with Spark Classic ### What changes were proposed in this pull request? I think there are two issues regarding the default column name of `cast`: 1, It seems unclear that when the name is the input column or `CAST(...)`, e.g. in Spark Classic, ``` scala> spark.range(1).select(col("id").cast("string"), lit(1).cast("string"), col("id").cast("long"), lit(1).cast("long")).printSchema warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` root |-- id: string (nullable = false) |-- CAST(1 AS STRING): string (nullable = false) |-- id: long (nullable = false) |-- CAST(1 AS BIGINT): long (nullable = false) ``` 2, the column name is not consistent between Spark Connect and Spark Classic. This PR aims to resolve the second issue, that is, making default column name of `cast` compatible with Spark Classic, by comparing with classic implementation https://github.com/apache/spark/blob/9cf6dc873ff34412df6256cdc7613eed40716570/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L1208-L1212 ### Why are the changes needed? the default column name is not consistent with the spark classic ### Does this PR introduce _any_ user-facing change? yes, spark classic: ``` In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show() +-------------------------+-------------------+-------------------+-------------------+ |CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)| +-------------------------+-------------------+-------------------+-------------------+ | 123| 123| 123| 123.0| +-------------------------+-------------------+-------------------+-------------------+ ``` spark connect (before): ``` In [3]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show() +---------+---+---+-----+ |X'313233'|123|123| 123| +---------+---+---+-----+ | 123|123|123|123.0| +---------+---+---+-----+ ``` spark connect (after): ``` In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show() +-------------------------+-------------------+-------------------+-------------------+ |CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)| +-------------------------+-------------------+-------------------+-------------------+ | 123| 123| 123| 123.0| +-------------------------+-------------------+-------------------+-------------------+ ``` ### How was this patch tested? added test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47249 from zhengruifeng/py_fix_cast. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? the `cast` issue has been resolved in apache#47249 , then we can reenable a group of doctests ### Why are the changes needed? test coverage ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47302 from zhengruifeng/enable_more_doctest. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…with Spark Classic ### What changes were proposed in this pull request? I think there are two issues regarding the default column name of `cast`: 1, It seems unclear that when the name is the input column or `CAST(...)`, e.g. in Spark Classic, ``` scala> spark.range(1).select(col("id").cast("string"), lit(1).cast("string"), col("id").cast("long"), lit(1).cast("long")).printSchema warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` root |-- id: string (nullable = false) |-- CAST(1 AS STRING): string (nullable = false) |-- id: long (nullable = false) |-- CAST(1 AS BIGINT): long (nullable = false) ``` 2, the column name is not consistent between Spark Connect and Spark Classic. This PR aims to resolve the second issue, that is, making default column name of `cast` compatible with Spark Classic, by comparing with classic implementation https://github.com/apache/spark/blob/9cf6dc873ff34412df6256cdc7613eed40716570/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L1208-L1212 ### Why are the changes needed? the default column name is not consistent with the spark classic ### Does this PR introduce _any_ user-facing change? yes, spark classic: ``` In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show() +-------------------------+-------------------+-------------------+-------------------+ |CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)| +-------------------------+-------------------+-------------------+-------------------+ | 123| 123| 123| 123.0| +-------------------------+-------------------+-------------------+-------------------+ ``` spark connect (before): ``` In [3]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show() +---------+---+---+-----+ |X'313233'|123|123| 123| +---------+---+---+-----+ | 123|123|123|123.0| +---------+---+---+-----+ ``` spark connect (after): ``` In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show() +-------------------------+-------------------+-------------------+-------------------+ |CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)| +-------------------------+-------------------+-------------------+-------------------+ | 123| 123| 123| 123.0| +-------------------------+-------------------+-------------------+-------------------+ ``` ### How was this patch tested? added test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47249 from zhengruifeng/py_fix_cast. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? the `cast` issue has been resolved in apache#47249 , then we can reenable a group of doctests ### Why are the changes needed? test coverage ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47302 from zhengruifeng/enable_more_doctest. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request? the `cast` issue has been resolved in apache#47249 , then we can reenable a group of doctests ### Why are the changes needed? test coverage ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47302 from zhengruifeng/enable_more_doctest. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
I think there are two issues regarding the default column name of
cast
:1, It seems unclear that when the name is the input column or
CAST(...)
, e.g. in Spark Classic,2, the column name is not consistent between Spark Connect and Spark Classic.
This PR aims to resolve the second issue, that is, making default column name of
cast
compatible with Spark Classic, by comparing with classic implementationspark/sql/core/src/main/scala/org/apache/spark/sql/Column.scala
Lines 1208 to 1212 in 9cf6dc8
Why are the changes needed?
the default column name is not consistent with the spark classic
Does this PR introduce any user-facing change?
yes,
spark classic:
spark connect (before):
spark connect (after):
How was this patch tested?
added test
Was this patch authored or co-authored using generative AI tooling?
no