[SPARK-48831][CONNECT] Make default column name of `cast` compatible with Spark Classic #47249

zhengruifeng · 2024-07-08T05:09:46Z

What changes were proposed in this pull request?

I think there are two issues regarding the default column name of cast:
1, It seems unclear that when the name is the input column or CAST(...), e.g. in Spark Classic,

scala> spark.range(1).select(col("id").cast("string"), lit(1).cast("string"), col("id").cast("long"), lit(1).cast("long")).printSchema
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
root
 |-- id: string (nullable = false)
 |-- CAST(1 AS STRING): string (nullable = false)
 |-- id: long (nullable = false)
 |-- CAST(1 AS BIGINT): long (nullable = false)

2, the column name is not consistent between Spark Connect and Spark Classic.

This PR aims to resolve the second issue, that is, making default column name of cast compatible with Spark Classic, by comparing with classic implementation

spark/sql/core/src/main/scala/org/apache/spark/sql/Column.scala

Lines 1208 to 1212 in 9cf6dc8

    
           def cast(to: DataType): Column = withExpr { 
        
             val cast = Cast(expr, CharVarcharUtils.replaceCharVarcharWithStringForCast(to)) 
        
             cast.setTagValue(Cast.USER_SPECIFIED_CAST, ()) 
        
             cast 
        
           }

Why are the changes needed?

the default column name is not consistent with the spark classic

Does this PR introduce any user-facing change?

yes,

spark classic:

In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show()
+-------------------------+-------------------+-------------------+-------------------+
|CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)|
+-------------------------+-------------------+-------------------+-------------------+
|                      123|                123|                123|              123.0|
+-------------------------+-------------------+-------------------+-------------------+

spark connect (before):

In [3]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show()
+---------+---+---+-----+
|X'313233'|123|123|  123|
+---------+---+---+-----+
|      123|123|123|123.0|
+---------+---+---+-----+

spark connect (after):

In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show()
+-------------------------+-------------------+-------------------+-------------------+
|CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)|
+-------------------------+-------------------+-------------------+-------------------+
|                      123|                123|                123|              123.0|
+-------------------------+-------------------+-------------------+-------------------+

How was this patch tested?

added test

Was this patch authored or co-authored using generative AI tooling?

no

init

allisonwang-db

Good catch!

allisonwang-db · 2024-07-08T18:59:52Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

+    val dataType = CharVarcharUtils.replaceCharVarcharWithStringForCast(rawDataType)
+    val castExpr = cast.getEvalMode match {
+      case proto.Expression.Cast.EvalMode.EVAL_MODE_LEGACY =>
+        Cast(transformExpression(cast.getExpr), dataType, None, EvalMode.LEGACY)


I am wondering why changing it this way can fix the schema difference? It looks like the cast expression is the same as before?

I guess it is affected by castExpr.setTagValue(Cast.USER_SPECIFIED_CAST, ())

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala

Lines 93 to 110 in 0487507

// Replaces attributes, string literals, complex type extractors with their pretty form so that

// generated column names don't contain back-ticks or double-quotes.

def usePrettyExpression(e: Expression): Expression = e transform {

case a: Attribute => new PrettyAttribute(a)

case Literal(s: UTF8String, StringType) => PrettyAttribute(s.toString, StringType)

case Literal(v, t: NumericType) if v != null => PrettyAttribute(v.toString, t)

case Literal(null, dataType) => PrettyAttribute("NULL", dataType)

case e: GetStructField =>

val name = e.name.getOrElse(e.childSchema(e.ordinal).name)

PrettyAttribute(usePrettyExpression(e.child).sql + "." + name, e.dataType)

case e: GetArrayStructFields =>

PrettyAttribute(s"${usePrettyExpression(e.child)}.${e.field.name}", e.dataType)

case r: InheritAnalysisRules =>

PrettyAttribute(r.makeSQLString(r.parameters.map(toPrettySQL)), r.dataType)

case c: Cast if c.getTagValue(Cast.USER_SPECIFIED_CAST).isEmpty =>

PrettyAttribute(usePrettyExpression(c.child).sql, c.dataType)

case p: PythonFuncExpression => PrettyPythonUDF(p.name, p.dataType, p.children)

}

HyukjinKwon · 2024-07-09T10:56:20Z

Merged to master.

…with Spark Classic ### What changes were proposed in this pull request? I think there are two issues regarding the default column name of `cast`: 1, It seems unclear that when the name is the input column or `CAST(...)`, e.g. in Spark Classic, ``` scala> spark.range(1).select(col("id").cast("string"), lit(1).cast("string"), col("id").cast("long"), lit(1).cast("long")).printSchema warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` root |-- id: string (nullable = false) |-- CAST(1 AS STRING): string (nullable = false) |-- id: long (nullable = false) |-- CAST(1 AS BIGINT): long (nullable = false) ``` 2, the column name is not consistent between Spark Connect and Spark Classic. This PR aims to resolve the second issue, that is, making default column name of `cast` compatible with Spark Classic, by comparing with classic implementation https://github.com/apache/spark/blob/9cf6dc873ff34412df6256cdc7613eed40716570/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L1208-L1212 ### Why are the changes needed? the default column name is not consistent with the spark classic ### Does this PR introduce _any_ user-facing change? yes, spark classic: ``` In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show() +-------------------------+-------------------+-------------------+-------------------+ |CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)| +-------------------------+-------------------+-------------------+-------------------+ | 123| 123| 123| 123.0| +-------------------------+-------------------+-------------------+-------------------+ ``` spark connect (before): ``` In [3]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show() +---------+---+---+-----+ |X'313233'|123|123| 123| +---------+---+---+-----+ | 123|123|123|123.0| +---------+---+---+-----+ ``` spark connect (after): ``` In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show() +-------------------------+-------------------+-------------------+-------------------+ |CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)| +-------------------------+-------------------+-------------------+-------------------+ | 123| 123| 123| 123.0| +-------------------------+-------------------+-------------------+-------------------+ ``` ### How was this patch tested? added test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47249 from zhengruifeng/py_fix_cast. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? the `cast` issue has been resolved in #47249 , then we can reenable a group of doctests ### Why are the changes needed? test coverage ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #47302 from zhengruifeng/enable_more_doctest. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…with Spark Classic ### What changes were proposed in this pull request? I think there are two issues regarding the default column name of `cast`: 1, It seems unclear that when the name is the input column or `CAST(...)`, e.g. in Spark Classic, ``` scala> spark.range(1).select(col("id").cast("string"), lit(1).cast("string"), col("id").cast("long"), lit(1).cast("long")).printSchema warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` root |-- id: string (nullable = false) |-- CAST(1 AS STRING): string (nullable = false) |-- id: long (nullable = false) |-- CAST(1 AS BIGINT): long (nullable = false) ``` 2, the column name is not consistent between Spark Connect and Spark Classic. This PR aims to resolve the second issue, that is, making default column name of `cast` compatible with Spark Classic, by comparing with classic implementation https://github.com/apache/spark/blob/9cf6dc873ff34412df6256cdc7613eed40716570/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L1208-L1212 ### Why are the changes needed? the default column name is not consistent with the spark classic ### Does this PR introduce _any_ user-facing change? yes, spark classic: ``` In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show() +-------------------------+-------------------+-------------------+-------------------+ |CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)| +-------------------------+-------------------+-------------------+-------------------+ | 123| 123| 123| 123.0| +-------------------------+-------------------+-------------------+-------------------+ ``` spark connect (before): ``` In [3]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show() +---------+---+---+-----+ |X'313233'|123|123| 123| +---------+---+---+-----+ | 123|123|123|123.0| +---------+---+---+-----+ ``` spark connect (after): ``` In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show() +-------------------------+-------------------+-------------------+-------------------+ |CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)| +-------------------------+-------------------+-------------------+-------------------+ | 123| 123| 123| 123.0| +-------------------------+-------------------+-------------------+-------------------+ ``` ### How was this patch tested? added test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47249 from zhengruifeng/py_fix_cast. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? the `cast` issue has been resolved in apache#47249 , then we can reenable a group of doctests ### Why are the changes needed? test coverage ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47302 from zhengruifeng/enable_more_doctest. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…with Spark Classic ### What changes were proposed in this pull request? I think there are two issues regarding the default column name of `cast`: 1, It seems unclear that when the name is the input column or `CAST(...)`, e.g. in Spark Classic, ``` scala> spark.range(1).select(col("id").cast("string"), lit(1).cast("string"), col("id").cast("long"), lit(1).cast("long")).printSchema warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` root |-- id: string (nullable = false) |-- CAST(1 AS STRING): string (nullable = false) |-- id: long (nullable = false) |-- CAST(1 AS BIGINT): long (nullable = false) ``` 2, the column name is not consistent between Spark Connect and Spark Classic. This PR aims to resolve the second issue, that is, making default column name of `cast` compatible with Spark Classic, by comparing with classic implementation https://github.com/apache/spark/blob/9cf6dc873ff34412df6256cdc7613eed40716570/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L1208-L1212 ### Why are the changes needed? the default column name is not consistent with the spark classic ### Does this PR introduce _any_ user-facing change? yes, spark classic: ``` In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show() +-------------------------+-------------------+-------------------+-------------------+ |CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)| +-------------------------+-------------------+-------------------+-------------------+ | 123| 123| 123| 123.0| +-------------------------+-------------------+-------------------+-------------------+ ``` spark connect (before): ``` In [3]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show() +---------+---+---+-----+ |X'313233'|123|123| 123| +---------+---+---+-----+ | 123|123|123|123.0| +---------+---+---+-----+ ``` spark connect (after): ``` In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show() +-------------------------+-------------------+-------------------+-------------------+ |CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)| +-------------------------+-------------------+-------------------+-------------------+ | 123| 123| 123| 123.0| +-------------------------+-------------------+-------------------+-------------------+ ``` ### How was this patch tested? added test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47249 from zhengruifeng/py_fix_cast. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? the `cast` issue has been resolved in apache#47249 , then we can reenable a group of doctests ### Why are the changes needed? test coverage ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47302 from zhengruifeng/enable_more_doctest. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

github-actions bot added SQL PYTHON CONNECT labels Jul 8, 2024

zhengruifeng force-pushed the py_fix_cast branch from e2e3131 to 4259bbc Compare July 8, 2024 08:05

zhengruifeng changed the title ~~[SPARK-48831][PYTHON][CONNECT] Fix the default column name of cast~~ [SPARK-48831][CONNECT] Fix the default column name of cast Jul 8, 2024

zhengruifeng added 3 commits July 8, 2024 17:24

init

7bffabd

init

fix

d8e804e

fix scala test

1212914

zhengruifeng force-pushed the py_fix_cast branch from 4259bbc to 1212914 Compare July 8, 2024 09:24

fix scala test

14c0650

zhengruifeng changed the title ~~[SPARK-48831][CONNECT] Fix the default column name of cast~~ [SPARK-48831][CONNECT] Make default column name of cast compatible with Spark Classic Jul 8, 2024

allisonwang-db reviewed Jul 8, 2024

View reviewed changes

zhengruifeng requested review from cloud-fan and HyukjinKwon July 9, 2024 00:11

HyukjinKwon approved these changes Jul 9, 2024

View reviewed changes

HyukjinKwon closed this in 43b6718 Jul 9, 2024

zhengruifeng deleted the py_fix_cast branch July 9, 2024 12:27

zhengruifeng mentioned this pull request Jul 11, 2024

[SPARK-46738][PYTHON] Reenable a group of doctests #47302

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48831][CONNECT] Make default column name of `cast` compatible with Spark Classic #47249

[SPARK-48831][CONNECT] Make default column name of `cast` compatible with Spark Classic #47249

zhengruifeng commented Jul 8, 2024 •

edited

Loading

allisonwang-db left a comment

allisonwang-db Jul 8, 2024

zhengruifeng Jul 9, 2024

HyukjinKwon commented Jul 9, 2024

	def cast(to: DataType): Column = withExpr {
	val cast = Cast(expr, CharVarcharUtils.replaceCharVarcharWithStringForCast(to))
	cast.setTagValue(Cast.USER_SPECIFIED_CAST, ())
	cast
	}

	// Replaces attributes, string literals, complex type extractors with their pretty form so that
	// generated column names don't contain back-ticks or double-quotes.
	def usePrettyExpression(e: Expression): Expression = e transform {
	case a: Attribute => new PrettyAttribute(a)
	case Literal(s: UTF8String, StringType) => PrettyAttribute(s.toString, StringType)
	case Literal(v, t: NumericType) if v != null => PrettyAttribute(v.toString, t)
	case Literal(null, dataType) => PrettyAttribute("NULL", dataType)
	case e: GetStructField =>
	val name = e.name.getOrElse(e.childSchema(e.ordinal).name)
	PrettyAttribute(usePrettyExpression(e.child).sql + "." + name, e.dataType)
	case e: GetArrayStructFields =>
	PrettyAttribute(s"${usePrettyExpression(e.child)}.${e.field.name}", e.dataType)
	case r: InheritAnalysisRules =>
	PrettyAttribute(r.makeSQLString(r.parameters.map(toPrettySQL)), r.dataType)
	case c: Cast if c.getTagValue(Cast.USER_SPECIFIED_CAST).isEmpty =>
	PrettyAttribute(usePrettyExpression(c.child).sql, c.dataType)
	case p: PythonFuncExpression => PrettyPythonUDF(p.name, p.dataType, p.children)
	}

[SPARK-48831][CONNECT] Make default column name of cast compatible with Spark Classic #47249

[SPARK-48831][CONNECT] Make default column name of cast compatible with Spark Classic #47249

Conversation

zhengruifeng commented Jul 8, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

allisonwang-db left a comment

Choose a reason for hiding this comment

allisonwang-db Jul 8, 2024

Choose a reason for hiding this comment

zhengruifeng Jul 9, 2024

Choose a reason for hiding this comment

HyukjinKwon commented Jul 9, 2024

[SPARK-48831][CONNECT] Make default column name of `cast` compatible with Spark Classic #47249

[SPARK-48831][CONNECT] Make default column name of `cast` compatible with Spark Classic #47249

zhengruifeng commented Jul 8, 2024 •

edited

Loading