[SPARK-43063][SQL][FOLLOWUP] Add ToPrettyString expression for Dataset.show #40922

cloud-fan · 2023-04-24T08:26:13Z

What changes were proposed in this pull request?

This is a followup of #40699 to avoid changing the Cast behavior. It pulls out the cast-to-string code into a base trait, and add a new Expression ToPrettyString to extend this trait with a little customization.

It also handles binary value inside array/struct/map to also print hex format, for df.show only, not Cast.

Why are the changes needed?

avoid behavior change

Does this PR introduce any user-facing change?

change back the behavior of casting array/map/struct to string regarding null elements. It was null, then changed to NULL in #40699 , and is null again after this PR.

How was this patch tested?

existing tests

cloud-fan · 2023-04-24T08:26:40Z

cc @yikf @sadikovi @gengliangwang

sadikovi · 2023-04-24T23:39:54Z

I think the PR still introduces user-facing changes.
Also, would it be possible to not make any changes in Cast and do everything in df.show method?

cloud-fan · 2023-04-25T01:52:22Z

would it be possible to not make any changes in Cast and do everything in df.show method?

We can by duplicating the code of Cast, but I don't think that's a good idea.

sadikovi · 2023-04-25T23:25:01Z

I suppose it is fine to have changes in Cast. Would it be possible to check the example queries in my comment in #40699 and what results they return? Or let me know if this is ready for review and I can check out the PR and try those examples on my machine.

sadikovi · 2023-04-26T01:33:44Z

Does this PR need #40699? I was under the assumption that we had to revert the original patch and have another solution instead.

cloud-fan · 2023-04-26T03:10:23Z

Most of the changes in #40699 are updating tests, and we still need them as we don't revert the behavior change of df.show. The behavior change of Cast is reverted and we can tell it from tests: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala

It's ready for review now.

sadikovi · 2023-04-26T05:25:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala

@@ -67,6 +67,8 @@ object StringUtils extends Logging {
    "(?s)" + out.result() // (?s) enables dotall mode, causing "." to match new lines
  }

+  def getHexString(bytes: Array[Byte]): String = bytes.map("%02X".format(_)).mkString("[", " ", "]")


It seems all functions in this file have javadoc. Shall we also add one for this method?

sadikovi · 2023-04-26T05:26:06Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

-            SchemaUtils.escapeMetaCharacters(cell.toString)
-        }
+        // Escapes meta-characters not to break the `showString` format
+        val str = SchemaUtils.escapeMetaCharacters(cell.toString)


This could potentially throw NullPointerException if cell is null. This was handled in the original code with match-case statement.

We could replace it with s"$cell" or handle null separately.

This should never be null as ToPrettyString.nullable is false. I can add an assert to enforce it.

sadikovi · 2023-04-26T05:33:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToPrettyString.scala

+
+  override def eval(input: InternalRow): Any = {
+    val v = child.eval(input)
+    if (v == null) UTF8String.fromString("NULL") else castFunc(v)


I think this should be UTF8String.fromString(nullString)

sadikovi · 2023-04-26T05:33:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToPrettyString.scala

+         |${childCode.code}
+         |UTF8String ${ev.value};
+         |if (${childCode.isNull}) {
+         |  ${ev.value} = UTF8String.fromString("NULL");


Same here, nullString instead of "NULL".

cloud-fan · 2023-04-27T03:45:13Z

cc @LuciferYang @AngersZhuuuu @yaooqinn

yaooqinn · 2023-04-27T04:51:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToPrettyString.scala

+
+  override protected def useDecimalPlainString: Boolean = true
+
+  override protected def useHexFormatForBinary: Boolean = true


will this be applied to the thrift server? or still, it will keep the string representation

I don't think so. Thriftserver still calls HiveResult to generate string. @yikf can you confirm?

Sorry for the late reply, It was busy last week.

Yes, spark-sql CLI and Thriftserver still calls hiveResultString to generate the string.

yaooqinn · 2023-04-27T04:56:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala

+   * Returns a pretty string of the byte array which prints each byte as a hex digit and add spaces
+   * between them. For example, [1A C0].
+   */
+  def getHexString(bytes: Array[Byte]): String = bytes.map("%02X".format(_)).mkString("[", " ", "]")


any reason to use uppercase here?

FYI, jdk17 has HexFormat and the default is to use lowercase characters "0-9","a-f"

Oh, I missed this is a existing behavior

cloud-fan · 2023-04-27T08:39:22Z

thanks for the review, merging to master!

LuciferYang · 2023-04-28T06:35:16Z

+1, late LGTM

…t.show ### What changes were proposed in this pull request? This is a followup of apache#40699 to avoid changing the Cast behavior. It pulls out the cast-to-string code into a base trait, and add a new Expression `ToPrettyString` to extend this trait with a little customization. It also handles binary value inside array/struct/map to also print hex format, for `df.show` only, not `Cast`. ### Why are the changes needed? avoid behavior change ### Does this PR introduce _any_ user-facing change? change back the behavior of casting array/map/struct to string regarding null elements. It was `null`, then changed to `NULL` in apache#40699 , and is `null` again after this PR. ### How was this patch tested? existing tests Closes apache#40922 from cloud-fan/pretty. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

dongjoon-hyun · 2023-06-02T07:09:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToStringBase.scala

+          builder.append(keyToUTF8String(keyArray.get(0, kt)).asInstanceOf[UTF8String])
+          builder.append(" ->")
+          if (valueArray.isNullAt(0)) {
+            if (nullString.nonEmpty) builder.append(nullString)


Sorry for late review.

This seems to cause a bug because previously we had the following.

if (!legacyCastToStr) builder.append(" NULL")

In other words, we have a space.

scala> spark.version res9: String = 3.4.0 scala> sql("select map('k', null)").show() +------------+ |map(k, NULL)| +------------+ | {k -> null}| +------------+

Now, master branch shows the following. There is no space between -> and NULL.

scala> spark.version res9: String = 3.5.0-SNAPSHOT scala> sql("select map('k', null)").show() +------------+ |map(k, NULL)| +------------+ | {k ->NULL}| +------------+

I made a follow-up for this.

[SPARK-43063][SQL][FOLLOWUP] Add a space between -> and value #41432

### What changes were proposed in this pull request? This is a follow-up of #40922. This PR aims to add a space between `->` and value. It seems to be missed here because the original PR already have the same code pattern in other place. https://github.com/apache/spark/blob/74b04eeffdc4765f56fe3a9e97165b15ed4e2c73/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToStringBase.scala#L114 ### Why are the changes needed? **BEFORE** ``` scala> sql("select map('k', null)").show() +------------+ |map(k, NULL)| +------------+ | {k ->NULL}| +------------+ ``` **AFTER** ``` scala> sql("select map('k', null)").show() +------------+ |map(k, NULL)| +------------+ | {k -> NULL}| +------------+ ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. Closes #41432 from dongjoon-hyun/SPARK-43063. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This is a follow-up of apache#40922. This PR aims to add a space between `->` and value. It seems to be missed here because the original PR already have the same code pattern in other place. https://github.com/apache/spark/blob/74b04eeffdc4765f56fe3a9e97165b15ed4e2c73/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToStringBase.scala#L114 ### Why are the changes needed? **BEFORE** ``` scala> sql("select map('k', null)").show() +------------+ |map(k, NULL)| +------------+ | {k ->NULL}| +------------+ ``` **AFTER** ``` scala> sql("select map('k', null)").show() +------------+ |map(k, NULL)| +------------+ | {k -> NULL}| +------------+ ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. Closes apache#41432 from dongjoon-hyun/SPARK-43063. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

The following are the changes needed: * PySpark 3.5 has deprecated the support for Python 3.7. This required changes to Delta test infra to install the appropriate Python version and other packages. The `Dockerfile` used for running tests is also updated to have required Python version and packages and uses the same base image as PySpark test infra in Apache Spark. * `StructType.toAttributes` and `StructType.fromAttributes` methods are moved into a utility class `DataTypeUtils`. * The `iceberg` module is disabled as there is no released version of `iceberg` that works with Spark 3.5 yet * Remove the URI path hack used in `DMLWithDeletionVectorsHelper` to get around a bug in Spark 3.4. * Remove unrelated tutorial in `delta/examples/tutorials/saiseu19` * Test failure fixes * `org.apache.spark.sql.delta.DeltaHistoryManagerSuite` - Error message has changed * `org.apache.spark.sql.delta.DeltaOptionSuite` - Parquet file name using the LZ4 code has changed due to a apache/parquet-java#1000 in `parquet-mr` dependency. * `org.apache.spark.sql.delta.deletionvectors.DeletionVectorsSuite` - Parquet now generates `row-index` whenever `_metadata` column is selected, however Spark 3.5 has a bug where a row group containing more than 2bn rows fails. For now don't return any `row-index` column in `_metadata` by overriding the `metadataSchemaFields: Seq[StructField]` in `DeltaParquetFileFormat`. * `org.apache.spark.sql.delta.perf.OptimizeMetadataOnlyDeltaQuerySuite`: A behavior change by apache/spark#40922. In Spark plans a new function called `ToPrettyString` is used instead of `cast(aggExpr To STRING)` in when `Dataset.show()` usage. * `org.apache.spark.sql.delta.DeltaCDCStreamDeletionVectorSuite` and `org.apache.spark.sql.delta.DeltaCDCStreamSuite`: Regression in Spark 3.5 RC fixed by apache/spark#42774 before the Spark 3.5 release Closes delta-io#1986 GitOrigin-RevId: b0e4a81b608a857e45ecba71b070309347616a30

The following are the changes needed: * PySpark 3.5 has deprecated the support for Python 3.7. This required changes to Delta test infra to install the appropriate Python version and other packages. The `Dockerfile` used for running tests is also updated to have required Python version and packages and uses the same base image as PySpark test infra in Apache Spark. * `StructType.toAttributes` and `StructType.fromAttributes` methods are moved into a utility class `DataTypeUtils`. * The `iceberg` module is disabled as there is no released version of `iceberg` that works with Spark 3.5 yet * Remove the URI path hack used in `DMLWithDeletionVectorsHelper` to get around a bug in Spark 3.4. * Remove unrelated tutorial in `delta/examples/tutorials/saiseu19` * Test failure fixes * `org.apache.spark.sql.delta.DeltaHistoryManagerSuite` - Error message has changed * `org.apache.spark.sql.delta.DeltaOptionSuite` - Parquet file name using the LZ4 code has changed due to a apache/parquet-java#1000 in `parquet-mr` dependency. * `org.apache.spark.sql.delta.deletionvectors.DeletionVectorsSuite` - Parquet now generates `row-index` whenever `_metadata` column is selected, however Spark 3.5 has a bug where a row group containing more than 2bn rows fails. For now don't return any `row-index` column in `_metadata` by overriding the `metadataSchemaFields: Seq[StructField]` in `DeltaParquetFileFormat`. * `org.apache.spark.sql.delta.perf.OptimizeMetadataOnlyDeltaQuerySuite`: A behavior change by apache/spark#40922. In Spark plans a new function called `ToPrettyString` is used instead of `cast(aggExpr To STRING)` in when `Dataset.show()` usage. * `org.apache.spark.sql.delta.DeltaCDCStreamDeletionVectorSuite` and `org.apache.spark.sql.delta.DeltaCDCStreamSuite`: Regression in Spark 3.5 RC fixed by apache/spark#42774 before the Spark 3.5 release Closes #1986 GitOrigin-RevId: b0e4a81b608a857e45ecba71b070309347616a30

The following are the changes needed: * PySpark 3.5 has deprecated the support for Python 3.7. This required changes to Delta test infra to install the appropriate Python version and other packages. The `Dockerfile` used for running tests is also updated to have required Python version and packages and uses the same base image as PySpark test infra in Apache Spark. * `StructType.toAttributes` and `StructType.fromAttributes` methods are moved into a utility class `DataTypeUtils`. * The `iceberg` module is disabled as there is no released version of `iceberg` that works with Spark 3.5 yet * Remove the URI path hack used in `DMLWithDeletionVectorsHelper` to get around a bug in Spark 3.4. * Remove unrelated tutorial in `delta/examples/tutorials/saiseu19` * Test failure fixes * `org.apache.spark.sql.delta.DeltaHistoryManagerSuite` - Error message has changed * `org.apache.spark.sql.delta.DeltaOptionSuite` - Parquet file name using the LZ4 code has changed due to a apache/parquet-java#1000 in `parquet-mr` dependency. * `org.apache.spark.sql.delta.deletionvectors.DeletionVectorsSuite` - Parquet now generates `row-index` whenever `_metadata` column is selected, however Spark 3.5 has a bug where a row group containing more than 2bn rows fails. For now don't return any `row-index` column in `_metadata` by overriding the `metadataSchemaFields: Seq[StructField]` in `DeltaParquetFileFormat`. * `org.apache.spark.sql.delta.perf.OptimizeMetadataOnlyDeltaQuerySuite`: A behavior change by apache/spark#40922. In Spark plans a new function called `ToPrettyString` is used instead of `cast(aggExpr To STRING)` in when `Dataset.show()` usage. * `org.apache.spark.sql.delta.DeltaCDCStreamDeletionVectorSuite` and `org.apache.spark.sql.delta.DeltaCDCStreamSuite`: Regression in Spark 3.5 RC fixed by apache/spark#42774 before the Spark 3.5 release Closes delta-io#1986 GitOrigin-RevId: b0e4a81b608a857e45ecba71b070309347616a30

github-actions bot added the SQL label Apr 24, 2023

cloud-fan force-pushed the pretty branch from 6ccaa3c to 1d9bb62 Compare April 25, 2023 02:57

Add ToPrettyString expression for Dataset.show

f3d635d

cloud-fan force-pushed the pretty branch from 1d9bb62 to f3d635d Compare April 25, 2023 13:20

sadikovi reviewed Apr 26, 2023

View reviewed changes

address comments

8e29c43

yaooqinn reviewed Apr 27, 2023

View reviewed changes

yaooqinn approved these changes Apr 27, 2023

View reviewed changes

cloud-fan closed this in c90e41f Apr 27, 2023

cloud-fan mentioned this pull request Apr 27, 2023

[SPARK-41259][SQL] SparkSQLDriver use the spark result string that is consistent with that of df.show #40437

Closed

dongjoon-hyun reviewed Jun 2, 2023

View reviewed changes

dongjoon-hyun mentioned this pull request Jun 2, 2023

[SPARK-43063][SQL][FOLLOWUP] Add a space between -> and value #41432

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43063][SQL][FOLLOWUP] Add ToPrettyString expression for Dataset.show #40922

[SPARK-43063][SQL][FOLLOWUP] Add ToPrettyString expression for Dataset.show #40922

cloud-fan commented Apr 24, 2023 •

edited

cloud-fan commented Apr 24, 2023

sadikovi commented Apr 24, 2023

cloud-fan commented Apr 25, 2023

sadikovi commented Apr 25, 2023

sadikovi commented Apr 26, 2023

cloud-fan commented Apr 26, 2023 •

edited

sadikovi Apr 26, 2023

sadikovi Apr 26, 2023

cloud-fan Apr 26, 2023

sadikovi Apr 26, 2023

sadikovi Apr 26, 2023

cloud-fan commented Apr 27, 2023

yaooqinn Apr 27, 2023 •

edited

cloud-fan Apr 27, 2023

yikf May 4, 2023

yaooqinn Apr 27, 2023

yaooqinn Apr 27, 2023

cloud-fan commented Apr 27, 2023

LuciferYang commented Apr 28, 2023

dongjoon-hyun Jun 2, 2023 •

edited

dongjoon-hyun Jun 2, 2023


		override protected def useDecimalPlainString: Boolean = true

		override protected def useHexFormatForBinary: Boolean = true

[SPARK-43063][SQL][FOLLOWUP] Add ToPrettyString expression for Dataset.show #40922

[SPARK-43063][SQL][FOLLOWUP] Add ToPrettyString expression for Dataset.show #40922

Conversation

cloud-fan commented Apr 24, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cloud-fan commented Apr 24, 2023

sadikovi commented Apr 24, 2023

cloud-fan commented Apr 25, 2023

sadikovi commented Apr 25, 2023

sadikovi commented Apr 26, 2023

cloud-fan commented Apr 26, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Apr 27, 2023

yaooqinn Apr 27, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Apr 27, 2023

LuciferYang commented Apr 28, 2023

dongjoon-hyun Jun 2, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Apr 24, 2023 •

edited

cloud-fan commented Apr 26, 2023 •

edited

yaooqinn Apr 27, 2023 •

edited

dongjoon-hyun Jun 2, 2023 •

edited