Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43063][SQL][FOLLOWUP] Add ToPrettyString expression for Dataset.show #40922

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Apr 24, 2023

What changes were proposed in this pull request?

This is a followup of #40699 to avoid changing the Cast behavior. It pulls out the cast-to-string code into a base trait, and add a new Expression ToPrettyString to extend this trait with a little customization.

It also handles binary value inside array/struct/map to also print hex format, for df.show only, not Cast.

Why are the changes needed?

avoid behavior change

Does this PR introduce any user-facing change?

change back the behavior of casting array/map/struct to string regarding null elements. It was null, then changed to NULL in #40699 , and is null again after this PR.

How was this patch tested?

existing tests

@github-actions github-actions bot added the SQL label Apr 24, 2023
@cloud-fan
Copy link
Contributor Author

cc @yikf @sadikovi @gengliangwang

@sadikovi
Copy link
Contributor

I think the PR still introduces user-facing changes.
Also, would it be possible to not make any changes in Cast and do everything in df.show method?

@cloud-fan
Copy link
Contributor Author

would it be possible to not make any changes in Cast and do everything in df.show method?

We can by duplicating the code of Cast, but I don't think that's a good idea.

@sadikovi
Copy link
Contributor

I suppose it is fine to have changes in Cast. Would it be possible to check the example queries in my comment in #40699 and what results they return? Or let me know if this is ready for review and I can check out the PR and try those examples on my machine.

@sadikovi
Copy link
Contributor

Does this PR need #40699? I was under the assumption that we had to revert the original patch and have another solution instead.

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Apr 26, 2023

Most of the changes in #40699 are updating tests, and we still need them as we don't revert the behavior change of df.show. The behavior change of Cast is reverted and we can tell it from tests: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala

It's ready for review now.

@@ -67,6 +67,8 @@ object StringUtils extends Logging {
"(?s)" + out.result() // (?s) enables dotall mode, causing "." to match new lines
}

def getHexString(bytes: Array[Byte]): String = bytes.map("%02X".format(_)).mkString("[", " ", "]")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems all functions in this file have javadoc. Shall we also add one for this method?

SchemaUtils.escapeMetaCharacters(cell.toString)
}
// Escapes meta-characters not to break the `showString` format
val str = SchemaUtils.escapeMetaCharacters(cell.toString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could potentially throw NullPointerException if cell is null. This was handled in the original code with match-case statement.

We could replace it with s"$cell" or handle null separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never be null as ToPrettyString.nullable is false. I can add an assert to enforce it.


override def eval(input: InternalRow): Any = {
val v = child.eval(input)
if (v == null) UTF8String.fromString("NULL") else castFunc(v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be UTF8String.fromString(nullString)

|${childCode.code}
|UTF8String ${ev.value};
|if (${childCode.isNull}) {
| ${ev.value} = UTF8String.fromString("NULL");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, nullString instead of "NULL".

@cloud-fan
Copy link
Contributor Author


override protected def useDecimalPlainString: Boolean = true

override protected def useHexFormatForBinary: Boolean = true
Copy link
Member

@yaooqinn yaooqinn Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this be applied to the thrift server? or still, it will keep the string representation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Thriftserver still calls HiveResult to generate string. @yikf can you confirm?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply, It was busy last week.

Yes, spark-sql CLI and Thriftserver still calls hiveResultString to generate the string.

* Returns a pretty string of the byte array which prints each byte as a hex digit and add spaces
* between them. For example, [1A C0].
*/
def getHexString(bytes: Array[Byte]): String = bytes.map("%02X".format(_)).mkString("[", " ", "]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to use uppercase here?

FYI, jdk17 has HexFormat and the default is to use lowercase characters "0-9","a-f"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed this is a existing behavior

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@LuciferYang
Copy link
Contributor

+1, late LGTM

LuciferYang pushed a commit to LuciferYang/spark that referenced this pull request May 10, 2023
…t.show

### What changes were proposed in this pull request?

This is a followup of apache#40699 to avoid changing the Cast behavior. It pulls out the cast-to-string code into a base trait, and add a new Expression `ToPrettyString` to extend this trait with a little customization.

It also handles binary value inside array/struct/map to also print hex format, for `df.show` only, not `Cast`.

### Why are the changes needed?

avoid behavior change

### Does this PR introduce _any_ user-facing change?

change back the behavior of casting array/map/struct to string regarding null elements. It was `null`, then changed to `NULL` in apache#40699 , and is `null` again after this PR.

### How was this patch tested?

existing tests

Closes apache#40922 from cloud-fan/pretty.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
builder.append(keyToUTF8String(keyArray.get(0, kt)).asInstanceOf[UTF8String])
builder.append(" ->")
if (valueArray.isNullAt(0)) {
if (nullString.nonEmpty) builder.append(nullString)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late review.

This seems to cause a bug because previously we had the following.

if (!legacyCastToStr) builder.append(" NULL")

In other words, we have a space.

scala> spark.version
res9: String = 3.4.0

scala> sql("select map('k', null)").show()
+------------+
|map(k, NULL)|
+------------+
| {k -> null}|
+------------+

Now, master branch shows the following. There is no space between -> and NULL.

scala> spark.version
res9: String = 3.5.0-SNAPSHOT

scala> sql("select map('k', null)").show()
+------------+
|map(k, NULL)|
+------------+
|  {k ->NULL}|
+------------+

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HyukjinKwon pushed a commit that referenced this pull request Jun 2, 2023
### What changes were proposed in this pull request?

This is a follow-up of #40922. This PR aims to add a space between `->` and value.

It seems to be missed here because the original PR already have the same code pattern in other place.

https://github.com/apache/spark/blob/74b04eeffdc4765f56fe3a9e97165b15ed4e2c73/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToStringBase.scala#L114

### Why are the changes needed?

**BEFORE**
```
scala> sql("select map('k', null)").show()
+------------+
|map(k, NULL)|
+------------+
|  {k ->NULL}|
+------------+
```

**AFTER**
```
 scala> sql("select map('k', null)").show()
+------------+
|map(k, NULL)|
+------------+
| {k -> NULL}|
+------------+
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
Manual review.

Closes #41432 from dongjoon-hyun/SPARK-43063.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
czxm pushed a commit to czxm/spark that referenced this pull request Jun 12, 2023
### What changes were proposed in this pull request?

This is a follow-up of apache#40922. This PR aims to add a space between `->` and value.

It seems to be missed here because the original PR already have the same code pattern in other place.

https://github.com/apache/spark/blob/74b04eeffdc4765f56fe3a9e97165b15ed4e2c73/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToStringBase.scala#L114

### Why are the changes needed?

**BEFORE**
```
scala> sql("select map('k', null)").show()
+------------+
|map(k, NULL)|
+------------+
|  {k ->NULL}|
+------------+
```

**AFTER**
```
 scala> sql("select map('k', null)").show()
+------------+
|map(k, NULL)|
+------------+
| {k -> NULL}|
+------------+
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
Manual review.

Closes apache#41432 from dongjoon-hyun/SPARK-43063.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
vkorukanti added a commit to vkorukanti/delta that referenced this pull request Oct 6, 2023
The following are the changes needed:
* PySpark 3.5 has deprecated the support for Python 3.7. This required changes to Delta test infra to install the appropriate Python version and other packages. The `Dockerfile` used for running tests is also updated to have required Python version and packages and uses the same base image as PySpark test infra in Apache Spark.
* `StructType.toAttributes` and `StructType.fromAttributes` methods are moved into a utility class `DataTypeUtils`.
* The `iceberg` module is disabled as there is no released version of `iceberg` that works with Spark 3.5 yet
* Remove the URI path hack used in `DMLWithDeletionVectorsHelper` to get around a bug in Spark 3.4.
* Remove unrelated tutorial in `delta/examples/tutorials/saiseu19`
* Test failure fixes
   * `org.apache.spark.sql.delta.DeltaHistoryManagerSuite` - Error message has changed
   *  `org.apache.spark.sql.delta.DeltaOptionSuite` - Parquet file name using the LZ4 code has changed due to a apache/parquet-java#1000 in `parquet-mr` dependency.
   * `org.apache.spark.sql.delta.deletionvectors.DeletionVectorsSuite` - Parquet now generates `row-index` whenever `_metadata` column is selected, however Spark 3.5 has a bug where a row group containing more than 2bn rows fails. For now don't return any `row-index` column in `_metadata` by overriding the `metadataSchemaFields: Seq[StructField]` in `DeltaParquetFileFormat`.
   * `org.apache.spark.sql.delta.perf.OptimizeMetadataOnlyDeltaQuerySuite`: A behavior change by apache/spark#40922. In Spark plans a new function called `ToPrettyString` is used instead of `cast(aggExpr To STRING)` in when `Dataset.show()` usage.
   * `org.apache.spark.sql.delta.DeltaCDCStreamDeletionVectorSuite` and `org.apache.spark.sql.delta.DeltaCDCStreamSuite`: Regression in Spark 3.5 RC fixed by apache/spark#42774 before the Spark 3.5 release

Closes delta-io#1986

GitOrigin-RevId: b0e4a81b608a857e45ecba71b070309347616a30
scottsand-db pushed a commit to delta-io/delta that referenced this pull request Oct 6, 2023
The following are the changes needed:
* PySpark 3.5 has deprecated the support for Python 3.7. This required changes to Delta test infra to install the appropriate Python version and other packages. The `Dockerfile` used for running tests is also updated to have required Python version and packages and uses the same base image as PySpark test infra in Apache Spark.
* `StructType.toAttributes` and `StructType.fromAttributes` methods are moved into a utility class `DataTypeUtils`.
* The `iceberg` module is disabled as there is no released version of `iceberg` that works with Spark 3.5 yet
* Remove the URI path hack used in `DMLWithDeletionVectorsHelper` to get around a bug in Spark 3.4.
* Remove unrelated tutorial in `delta/examples/tutorials/saiseu19`
* Test failure fixes
   * `org.apache.spark.sql.delta.DeltaHistoryManagerSuite` - Error message has changed
   *  `org.apache.spark.sql.delta.DeltaOptionSuite` - Parquet file name using the LZ4 code has changed due to a apache/parquet-java#1000 in `parquet-mr` dependency.
   * `org.apache.spark.sql.delta.deletionvectors.DeletionVectorsSuite` - Parquet now generates `row-index` whenever `_metadata` column is selected, however Spark 3.5 has a bug where a row group containing more than 2bn rows fails. For now don't return any `row-index` column in `_metadata` by overriding the `metadataSchemaFields: Seq[StructField]` in `DeltaParquetFileFormat`.
   * `org.apache.spark.sql.delta.perf.OptimizeMetadataOnlyDeltaQuerySuite`: A behavior change by apache/spark#40922. In Spark plans a new function called `ToPrettyString` is used instead of `cast(aggExpr To STRING)` in when `Dataset.show()` usage.
   * `org.apache.spark.sql.delta.DeltaCDCStreamDeletionVectorSuite` and `org.apache.spark.sql.delta.DeltaCDCStreamSuite`: Regression in Spark 3.5 RC fixed by apache/spark#42774 before the Spark 3.5 release

Closes #1986

GitOrigin-RevId: b0e4a81b608a857e45ecba71b070309347616a30
xupefei pushed a commit to xupefei/delta that referenced this pull request Oct 31, 2023
The following are the changes needed:
* PySpark 3.5 has deprecated the support for Python 3.7. This required changes to Delta test infra to install the appropriate Python version and other packages. The `Dockerfile` used for running tests is also updated to have required Python version and packages and uses the same base image as PySpark test infra in Apache Spark.
* `StructType.toAttributes` and `StructType.fromAttributes` methods are moved into a utility class `DataTypeUtils`.
* The `iceberg` module is disabled as there is no released version of `iceberg` that works with Spark 3.5 yet
* Remove the URI path hack used in `DMLWithDeletionVectorsHelper` to get around a bug in Spark 3.4.
* Remove unrelated tutorial in `delta/examples/tutorials/saiseu19`
* Test failure fixes
   * `org.apache.spark.sql.delta.DeltaHistoryManagerSuite` - Error message has changed
   *  `org.apache.spark.sql.delta.DeltaOptionSuite` - Parquet file name using the LZ4 code has changed due to a apache/parquet-java#1000 in `parquet-mr` dependency.
   * `org.apache.spark.sql.delta.deletionvectors.DeletionVectorsSuite` - Parquet now generates `row-index` whenever `_metadata` column is selected, however Spark 3.5 has a bug where a row group containing more than 2bn rows fails. For now don't return any `row-index` column in `_metadata` by overriding the `metadataSchemaFields: Seq[StructField]` in `DeltaParquetFileFormat`.
   * `org.apache.spark.sql.delta.perf.OptimizeMetadataOnlyDeltaQuerySuite`: A behavior change by apache/spark#40922. In Spark plans a new function called `ToPrettyString` is used instead of `cast(aggExpr To STRING)` in when `Dataset.show()` usage.
   * `org.apache.spark.sql.delta.DeltaCDCStreamDeletionVectorSuite` and `org.apache.spark.sql.delta.DeltaCDCStreamSuite`: Regression in Spark 3.5 RC fixed by apache/spark#42774 before the Spark 3.5 release

Closes delta-io#1986

GitOrigin-RevId: b0e4a81b608a857e45ecba71b070309347616a30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants