Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43384][SQL] Make df.show print a nice string for MapType. #41065

Closed
wants to merge 1 commit into from

Conversation

yikf
Copy link
Contributor

@yikf yikf commented May 5, 2023

What changes were proposed in this pull request?

This PR aims to make df.show print a nice string for MapType.

Let's say have an example like this:

spark.sql("SELECT map(1,1.1, 2,2.2) AS col").show(false)

Before, it print

+--------------------+
|col                 |
+--------------------+
|{1 -> 1.1, 2 -> 2.2}|
+--------------------+

Now, it prints as follows, that's consistent with spark-sql CLI.

+-------------+
|col          |
+-------------+
|{1:1.1,2:2.2}|
+-------------+

Why are the changes needed?

Make df.show print a nice string for MapType.

Does this PR introduce any user-facing change?

Yes, They will face better nice strings representation for MapType.

How was this patch tested?

Exist tests.

@yikf
Copy link
Contributor Author

yikf commented May 5, 2023

cc @cloud-fan

@@ -45,6 +45,10 @@ case class ToPrettyString(child: Expression, timeZoneId: Option[String] = None)
override protected def leftBracket: String = "{"
override protected def rightBracket: String = "}"

override def kvPairSeparator: String = ":"

override protected def elementSeparator: String = ","
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
override protected def elementSeparator: String = ","
override protected def elementSeparator: String = ", "

Add space between kv maybe better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think , is better, But spark-sql use ,, Any suggestions? @cloud-fan

@@ -828,7 +828,7 @@ abstract class CastSuiteBase extends SparkFunSuite with ExpressionEvalHelper {
val ret2 = cast(
Literal.create(Map("1" -> "a".getBytes, "2" -> null, "3" -> "c".getBytes)),
StringType)
checkEvaluation(ret2, s"${lb}1 -> a, 2 ->${if (legacyCast) "" else " null"}, 3 -> c$rb")
checkEvaluation(ret2, s"${lb}1 -> a, 2 -> ${if (legacyCast) "" else "null"}, 3 -> c$rb")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid changing the cast behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a reasonable change, MapToString has three elements to print, key, value, and separator. that is k separator v, this approach is more unified. For Cast, the separator is " -> ", and for ToPrettyString, it is ":".

This affects only one behavior of the Cast, that is, when v of the first element is an empty string, it was printed as k -> before and now is k -> . I think k -> is more reasonable, it is consistent with the other elements except the first element.

For example, map(k1,"",k2,v2), before it's k1 ->, k2 -> v2, now it's k1 -> , k2 -> v2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks reasonable. @sadikovi what do you think?

@HyukjinKwon
Copy link
Member

cc @sadikovi FYI

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a controversial issue. For example, I'm not sure why this is a nice string for MapType, @yikf . To be honest, I prefer the original style because it resembles a Scala style.

Map(1 -> 2, 2 -> 3)

Is this PR aiming to change Spark for more Python-user-friendly?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need this change, why don't we make this configurable? I believe the default should be the same with old Spark behavior while the user can use new output via enabling the configuration manually.

@yikf
Copy link
Contributor Author

yikf commented May 25, 2023

Thanks @dongjoon-hyun for your point : )

The intent of this PR is to change the way df.show represents the map data type, which is currently different from some mainstream databases. We also want df.show and spark-sql CLI to be as consistent as possible, since they are both spark CLI. So we think the representation after PR might be a better representation.

Adding configuration is also a good idea.

@dongjoon-hyun
Copy link
Member

Any update for adding config, @yikf ?

@yikf
Copy link
Contributor Author

yikf commented May 31, 2023

Add a config to control style. Should we use the new output by default and restore the old output when the legacy configuration is enabled?

@dongjoon-hyun
Copy link
Member

Since this is not a bug fix, please use the existing behavior as the default. The new feature can be enabled and tested at least during Spark 3.5.0. Then, we can switch it at Spark 3.6.0.

@yikf
Copy link
Contributor Author

yikf commented Jun 1, 2023

@dongjoon-hyun Thank you, that sounds reasonable. And i restored the default behavior of #40699. for new representations of null, mapType, or other future data types, we use the existing behavior as the default. the new representation can be enabled by configuration.

@@ -5313,7 +5313,7 @@ class VectorAssembler(
+---+---+----+-------------+
| a| b| c| features|
+---+---+----+-------------+
|1.0|2.0|NULL|[1.0,2.0,NaN]|
|1.0|2.0|null|[1.0,2.0,NaN]|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another difference from the existing behavior, @yikf .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing behavior should be null, NULL is changed in #40699, I think we should do the same with map type, default to null instead of NULL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the pointer. I'm not sure I can agree with #40699 there. I'm tracking the comment on that PR.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Sep 11, 2023
@github-actions github-actions bot closed this Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants