[SPARK-43384][SQL] Make `df.show` print a nice string for `MapType`. #41065

yikf · 2023-05-05T11:48:50Z

What changes were proposed in this pull request?

This PR aims to make df.show print a nice string for MapType.

Let's say have an example like this:

spark.sql("SELECT map(1,1.1, 2,2.2) AS col").show(false)

Before, it print

+--------------------+
|col                 |
+--------------------+
|{1 -> 1.1, 2 -> 2.2}|
+--------------------+

Now, it prints as follows, that's consistent with spark-sql CLI.

+-------------+
|col          |
+-------------+
|{1:1.1,2:2.2}|
+-------------+

Why are the changes needed?

Make df.show print a nice string for MapType.

Does this PR introduce any user-facing change?

Yes, They will face better nice strings representation for MapType.

How was this patch tested?

Exist tests.

yikf · 2023-05-05T11:51:39Z

cc @cloud-fan

Hisoka-X · 2023-05-06T01:57:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToPrettyString.scala

@@ -45,6 +45,10 @@ case class ToPrettyString(child: Expression, timeZoneId: Option[String] = None)
  override protected def leftBracket: String = "{"
  override protected def rightBracket: String = "}"

+  override def kvPairSeparator: String = ":"
+
+  override protected def elementSeparator: String = ","


Suggested change

override protected def elementSeparator: String = ","

override protected def elementSeparator: String = ", "

Add space between kv maybe better.

Yea, I think , is better, But spark-sql use ,, Any suggestions? @cloud-fan

cloud-fan · 2023-05-06T03:47:46Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala

@@ -828,7 +828,7 @@ abstract class CastSuiteBase extends SparkFunSuite with ExpressionEvalHelper {
        val ret2 = cast(
          Literal.create(Map("1" -> "a".getBytes, "2" -> null, "3" -> "c".getBytes)),
          StringType)
-        checkEvaluation(ret2, s"${lb}1 -> a, 2 ->${if (legacyCast) "" else " null"}, 3 -> c$rb")
+        checkEvaluation(ret2, s"${lb}1 -> a, 2 -> ${if (legacyCast) "" else "null"}, 3 -> c$rb")


can we avoid changing the cast behavior?

I think it's a reasonable change, MapToString has three elements to print, key, value, and separator. that is k separator v, this approach is more unified. For Cast, the separator is " -> ", and for ToPrettyString, it is ":".

This affects only one behavior of the Cast, that is, when v of the first element is an empty string, it was printed as k -> before and now is k -> . I think k -> is more reasonable, it is consistent with the other elements except the first element.

For example, map(k1,"",k2,v2), before it's k1 ->, k2 -> v2, now it's k1 -> , k2 -> v2.

looks reasonable. @sadikovi what do you think?

HyukjinKwon · 2023-05-08T01:32:32Z

cc @sadikovi FYI

dongjoon-hyun

This sounds like a controversial issue. For example, I'm not sure why this is a nice string for MapType, @yikf . To be honest, I prefer the original style because it resembles a Scala style.

Map(1 -> 2, 2 -> 3)

Is this PR aiming to change Spark for more Python-user-friendly?

dongjoon-hyun

If we need this change, why don't we make this configurable? I believe the default should be the same with old Spark behavior while the user can use new output via enabling the configuration manually.

yikf · 2023-05-25T03:10:48Z

Thanks @dongjoon-hyun for your point : )

The intent of this PR is to change the way df.show represents the map data type, which is currently different from some mainstream databases. We also want df.show and spark-sql CLI to be as consistent as possible, since they are both spark CLI. So we think the representation after PR might be a better representation.

Adding configuration is also a good idea.

dongjoon-hyun · 2023-05-31T00:45:51Z

Any update for adding config, @yikf ?

Stale

yikf · 2023-05-31T05:39:38Z

Add a config to control style. Should we use the new output by default and restore the old output when the legacy configuration is enabled?

dongjoon-hyun · 2023-05-31T06:02:34Z

Since this is not a bug fix, please use the existing behavior as the default. The new feature can be enabled and tested at least during Spark 3.5.0. Then, we can switch it at Spark 3.6.0.

yikf · 2023-06-01T09:27:34Z

@dongjoon-hyun Thank you, that sounds reasonable. And i restored the default behavior of #40699. for new representations of null, mapType, or other future data types, we use the existing behavior as the default. the new representation can be enabled by configuration.

dongjoon-hyun · 2023-06-01T17:02:15Z

python/pyspark/ml/feature.py

@@ -5313,7 +5313,7 @@ class VectorAssembler(
    +---+---+----+-------------+
    |  a|  b|   c|     features|
    +---+---+----+-------------+
-    |1.0|2.0|NULL|[1.0,2.0,NaN]|
+    |1.0|2.0|null|[1.0,2.0,NaN]|


This is another difference from the existing behavior, @yikf .

Existing behavior should be null, NULL is changed in #40699, I think we should do the same with map type, default to null instead of NULL

Thank you for the pointer. I'm not sure I can agree with #40699 there. I'm tracking the comment on that PR.

github-actions · 2023-09-11T00:17:31Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions bot added CORE PYTHON SQL labels May 5, 2023

yikf force-pushed the map-to-string branch from c70a004 to 7744244 Compare May 5, 2023 11:50

Hisoka-X reviewed May 6, 2023

View reviewed changes

cloud-fan reviewed May 6, 2023

View reviewed changes

dongjoon-hyun previously requested changes May 23, 2023

View reviewed changes

dongjoon-hyun reviewed May 23, 2023

View reviewed changes

yikf force-pushed the map-to-string branch from 7744244 to 430cd1e Compare May 31, 2023 02:56

yikf force-pushed the map-to-string branch from 430cd1e to d7cd36a Compare June 1, 2023 09:21

github-actions bot added CONNECT ML PANDAS API ON SPARK labels Jun 1, 2023

yikf force-pushed the map-to-string branch from d7cd36a to 7fabd6c Compare June 1, 2023 12:51

dongjoon-hyun reviewed Jun 1, 2023

View reviewed changes

Make nice string for MapType

7a8e7e3

yikf force-pushed the map-to-string branch from 7fabd6c to 7a8e7e3 Compare June 2, 2023 06:08

yikf mentioned this pull request Jun 2, 2023

[SPARK-43063][SQL][FOLLOWUP] Add a space between -> and value #41432

Closed

github-actions bot added the Stale label Sep 11, 2023

github-actions bot closed this Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43384][SQL] Make `df.show` print a nice string for `MapType`. #41065

[SPARK-43384][SQL] Make `df.show` print a nice string for `MapType`. #41065

yikf commented May 5, 2023

yikf commented May 5, 2023

Hisoka-X May 6, 2023

yikf May 6, 2023

cloud-fan May 6, 2023

yikf May 6, 2023

cloud-fan May 8, 2023

HyukjinKwon commented May 8, 2023

dongjoon-hyun left a comment

dongjoon-hyun left a comment •

edited

yikf commented May 25, 2023

dongjoon-hyun commented May 31, 2023

yikf commented May 31, 2023

dongjoon-hyun commented May 31, 2023

yikf commented Jun 1, 2023

dongjoon-hyun Jun 1, 2023

yikf Jun 2, 2023

dongjoon-hyun Jun 2, 2023

github-actions bot commented Sep 11, 2023

	override protected def elementSeparator: String = ","
	override protected def elementSeparator: String = ", "

[SPARK-43384][SQL] Make df.show print a nice string for MapType. #41065

[SPARK-43384][SQL] Make df.show print a nice string for MapType. #41065

Conversation

yikf commented May 5, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

yikf commented May 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented May 8, 2023

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment • edited

Choose a reason for hiding this comment

yikf commented May 25, 2023

dongjoon-hyun commented May 31, 2023

yikf commented May 31, 2023

dongjoon-hyun commented May 31, 2023

yikf commented Jun 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Sep 11, 2023

[SPARK-43384][SQL] Make `df.show` print a nice string for `MapType`. #41065

[SPARK-43384][SQL] Make `df.show` print a nice string for `MapType`. #41065

dongjoon-hyun left a comment •

edited