Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48831][CONNECT] Make default column name of cast compatible with Spark Classic #47249

Closed
wants to merge 4 commits into from

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Jul 8, 2024

What changes were proposed in this pull request?

I think there are two issues regarding the default column name of cast:
1, It seems unclear that when the name is the input column or CAST(...), e.g. in Spark Classic,

scala> spark.range(1).select(col("id").cast("string"), lit(1).cast("string"), col("id").cast("long"), lit(1).cast("long")).printSchema
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
root
 |-- id: string (nullable = false)
 |-- CAST(1 AS STRING): string (nullable = false)
 |-- id: long (nullable = false)
 |-- CAST(1 AS BIGINT): long (nullable = false)

2, the column name is not consistent between Spark Connect and Spark Classic.

This PR aims to resolve the second issue, that is, making default column name of cast compatible with Spark Classic, by comparing with classic implementation

def cast(to: DataType): Column = withExpr {
val cast = Cast(expr, CharVarcharUtils.replaceCharVarcharWithStringForCast(to))
cast.setTagValue(Cast.USER_SPECIFIED_CAST, ())
cast
}

Why are the changes needed?

the default column name is not consistent with the spark classic

Does this PR introduce any user-facing change?

yes,

spark classic:

In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show()
+-------------------------+-------------------+-------------------+-------------------+
|CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)|
+-------------------------+-------------------+-------------------+-------------------+
|                      123|                123|                123|              123.0|
+-------------------------+-------------------+-------------------+-------------------+

spark connect (before):

In [3]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show()
+---------+---+---+-----+
|X'313233'|123|123|  123|
+---------+---+---+-----+
|      123|123|123|123.0|
+---------+---+---+-----+

spark connect (after):

In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show()
+-------------------------+-------------------+-------------------+-------------------+
|CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)|
+-------------------------+-------------------+-------------------+-------------------+
|                      123|                123|                123|              123.0|
+-------------------------+-------------------+-------------------+-------------------+

How was this patch tested?

added test

Was this patch authored or co-authored using generative AI tooling?

no

@zhengruifeng zhengruifeng changed the title [SPARK-48831][PYTHON][CONNECT] Fix the default column name of cast [SPARK-48831][CONNECT] Fix the default column name of cast Jul 8, 2024
@zhengruifeng zhengruifeng changed the title [SPARK-48831][CONNECT] Fix the default column name of cast [SPARK-48831][CONNECT] Make default column name of cast compatible with Spark Classic Jul 8, 2024
Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

val dataType = CharVarcharUtils.replaceCharVarcharWithStringForCast(rawDataType)
val castExpr = cast.getEvalMode match {
case proto.Expression.Cast.EvalMode.EVAL_MODE_LEGACY =>
Cast(transformExpression(cast.getExpr), dataType, None, EvalMode.LEGACY)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering why changing it this way can fix the schema difference? It looks like the cast expression is the same as before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is affected by castExpr.setTagValue(Cast.USER_SPECIFIED_CAST, ())

// Replaces attributes, string literals, complex type extractors with their pretty form so that
// generated column names don't contain back-ticks or double-quotes.
def usePrettyExpression(e: Expression): Expression = e transform {
case a: Attribute => new PrettyAttribute(a)
case Literal(s: UTF8String, StringType) => PrettyAttribute(s.toString, StringType)
case Literal(v, t: NumericType) if v != null => PrettyAttribute(v.toString, t)
case Literal(null, dataType) => PrettyAttribute("NULL", dataType)
case e: GetStructField =>
val name = e.name.getOrElse(e.childSchema(e.ordinal).name)
PrettyAttribute(usePrettyExpression(e.child).sql + "." + name, e.dataType)
case e: GetArrayStructFields =>
PrettyAttribute(s"${usePrettyExpression(e.child)}.${e.field.name}", e.dataType)
case r: InheritAnalysisRules =>
PrettyAttribute(r.makeSQLString(r.parameters.map(toPrettySQL)), r.dataType)
case c: Cast if c.getTagValue(Cast.USER_SPECIFIED_CAST).isEmpty =>
PrettyAttribute(usePrettyExpression(c.child).sql, c.dataType)
case p: PythonFuncExpression => PrettyPythonUDF(p.name, p.dataType, p.children)
}

@HyukjinKwon
Copy link
Member

Merged to master.

@zhengruifeng zhengruifeng deleted the py_fix_cast branch July 9, 2024 12:27
ericm-db pushed a commit to ericm-db/spark that referenced this pull request Jul 10, 2024
…with Spark Classic

### What changes were proposed in this pull request?

I think there are two issues regarding the default column name of `cast`:
1, It seems unclear that when the name is the input column or `CAST(...)`, e.g. in Spark Classic,
```
scala> spark.range(1).select(col("id").cast("string"), lit(1).cast("string"), col("id").cast("long"), lit(1).cast("long")).printSchema
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
root
 |-- id: string (nullable = false)
 |-- CAST(1 AS STRING): string (nullable = false)
 |-- id: long (nullable = false)
 |-- CAST(1 AS BIGINT): long (nullable = false)
```

2, the column name is not consistent between Spark Connect and Spark Classic.

This PR aims to resolve the second issue, that is, making default column name of `cast` compatible with Spark Classic, by comparing with classic implementation
https://github.com/apache/spark/blob/9cf6dc873ff34412df6256cdc7613eed40716570/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L1208-L1212

### Why are the changes needed?
the default column name is not consistent with the spark classic

### Does this PR introduce _any_ user-facing change?
yes,

spark classic:
```
In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show()
+-------------------------+-------------------+-------------------+-------------------+
|CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)|
+-------------------------+-------------------+-------------------+-------------------+
|                      123|                123|                123|              123.0|
+-------------------------+-------------------+-------------------+-------------------+
```

spark connect (before):
```
In [3]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show()
+---------+---+---+-----+
|X'313233'|123|123|  123|
+---------+---+---+-----+
|      123|123|123|123.0|
+---------+---+---+-----+
```

spark connect (after):
```
In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show()
+-------------------------+-------------------+-------------------+-------------------+
|CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)|
+-------------------------+-------------------+-------------------+-------------------+
|                      123|                123|                123|              123.0|
+-------------------------+-------------------+-------------------+-------------------+
```

### How was this patch tested?
added test

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#47249 from zhengruifeng/py_fix_cast.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Jul 11, 2024
### What changes were proposed in this pull request?
the `cast` issue has been resolved in #47249 , then we can reenable a group of doctests

### Why are the changes needed?
test coverage

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47302 from zhengruifeng/enable_more_doctest.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
jingz-db pushed a commit to jingz-db/spark that referenced this pull request Jul 22, 2024
…with Spark Classic

### What changes were proposed in this pull request?

I think there are two issues regarding the default column name of `cast`:
1, It seems unclear that when the name is the input column or `CAST(...)`, e.g. in Spark Classic,
```
scala> spark.range(1).select(col("id").cast("string"), lit(1).cast("string"), col("id").cast("long"), lit(1).cast("long")).printSchema
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
root
 |-- id: string (nullable = false)
 |-- CAST(1 AS STRING): string (nullable = false)
 |-- id: long (nullable = false)
 |-- CAST(1 AS BIGINT): long (nullable = false)
```

2, the column name is not consistent between Spark Connect and Spark Classic.

This PR aims to resolve the second issue, that is, making default column name of `cast` compatible with Spark Classic, by comparing with classic implementation
https://github.com/apache/spark/blob/9cf6dc873ff34412df6256cdc7613eed40716570/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L1208-L1212

### Why are the changes needed?
the default column name is not consistent with the spark classic

### Does this PR introduce _any_ user-facing change?
yes,

spark classic:
```
In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show()
+-------------------------+-------------------+-------------------+-------------------+
|CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)|
+-------------------------+-------------------+-------------------+-------------------+
|                      123|                123|                123|              123.0|
+-------------------------+-------------------+-------------------+-------------------+
```

spark connect (before):
```
In [3]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show()
+---------+---+---+-----+
|X'313233'|123|123|  123|
+---------+---+---+-----+
|      123|123|123|123.0|
+---------+---+---+-----+
```

spark connect (after):
```
In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show()
+-------------------------+-------------------+-------------------+-------------------+
|CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)|
+-------------------------+-------------------+-------------------+-------------------+
|                      123|                123|                123|              123.0|
+-------------------------+-------------------+-------------------+-------------------+
```

### How was this patch tested?
added test

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#47249 from zhengruifeng/py_fix_cast.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
jingz-db pushed a commit to jingz-db/spark that referenced this pull request Jul 22, 2024
### What changes were proposed in this pull request?
the `cast` issue has been resolved in apache#47249 , then we can reenable a group of doctests

### Why are the changes needed?
test coverage

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47302 from zhengruifeng/enable_more_doctest.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…with Spark Classic

### What changes were proposed in this pull request?

I think there are two issues regarding the default column name of `cast`:
1, It seems unclear that when the name is the input column or `CAST(...)`, e.g. in Spark Classic,
```
scala> spark.range(1).select(col("id").cast("string"), lit(1).cast("string"), col("id").cast("long"), lit(1).cast("long")).printSchema
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
root
 |-- id: string (nullable = false)
 |-- CAST(1 AS STRING): string (nullable = false)
 |-- id: long (nullable = false)
 |-- CAST(1 AS BIGINT): long (nullable = false)
```

2, the column name is not consistent between Spark Connect and Spark Classic.

This PR aims to resolve the second issue, that is, making default column name of `cast` compatible with Spark Classic, by comparing with classic implementation
https://github.com/apache/spark/blob/9cf6dc873ff34412df6256cdc7613eed40716570/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L1208-L1212

### Why are the changes needed?
the default column name is not consistent with the spark classic

### Does this PR introduce _any_ user-facing change?
yes,

spark classic:
```
In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show()
+-------------------------+-------------------+-------------------+-------------------+
|CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)|
+-------------------------+-------------------+-------------------+-------------------+
|                      123|                123|                123|              123.0|
+-------------------------+-------------------+-------------------+-------------------+
```

spark connect (before):
```
In [3]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show()
+---------+---+---+-----+
|X'313233'|123|123|  123|
+---------+---+---+-----+
|      123|123|123|123.0|
+---------+---+---+-----+
```

spark connect (after):
```
In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show()
+-------------------------+-------------------+-------------------+-------------------+
|CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)|
+-------------------------+-------------------+-------------------+-------------------+
|                      123|                123|                123|              123.0|
+-------------------------+-------------------+-------------------+-------------------+
```

### How was this patch tested?
added test

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#47249 from zhengruifeng/py_fix_cast.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?
the `cast` issue has been resolved in apache#47249 , then we can reenable a group of doctests

### Why are the changes needed?
test coverage

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47302 from zhengruifeng/enable_more_doctest.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
### What changes were proposed in this pull request?
the `cast` issue has been resolved in apache#47249 , then we can reenable a group of doctests

### Why are the changes needed?
test coverage

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47302 from zhengruifeng/enable_more_doctest.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants