[SPARK-43063][SQL] `df.show` handle null should print NULL instead of null #40699

yikf · 2023-04-07T07:47:00Z

What changes were proposed in this pull request?

df.show handle null should print NULL instead of null to consistent behavior;

Like as the following behavior is currently inconsistent:

scala> spark.sql("select decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4, 'Seattle') as result").show(false)
+------+
|result|
+------+
|null  |
+------+

spark-sql> DESC FUNCTION EXTENDED decode;
function_desc
Function: decode
Class: org.apache.spark.sql.catalyst.expressions.Decode
Usage:
    decode(bin, charset) - Decodes the first argument using the second argument character set.

    decode(expr, search, result [, search, result ] ... [, default]) - Compares expr
      to each search value in order. If expr is equal to a search value, decode returns
      the corresponding result. If no match is found, then it returns default. If default
      is omitted, it returns null.

Extended Usage:
    Examples:
      > SELECT decode(encode('abc', 'utf-8'), 'utf-8');
       abc
      > SELECT decode(2, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4, 'Seattle', 'Non domestic');
       San Francisco
      > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4, 'Seattle', 'Non domestic');
       Non domestic
      > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4, 'Seattle');
       NULL

    Since: 3.2.0

Time taken: 0.074 seconds, Fetched 4 row(s)

spark-sql> select decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4, 'Seattle');
NULL

Why are the changes needed?

df.show keep consistent behavior when handle null with spark-sql CLI.

Does this PR introduce any user-facing change?

Yes, null will display NULL instead of null.

How was this patch tested?

GA

LuciferYang · 2023-04-08T08:20:00Z

@yikf Can you re-trigger GA?

LuciferYang

+1, LGTM if test passed

AngersZhuuuu

LGTM, cc @cloud-fan

python/pyspark/sql/column.py

cloud-fan · 2023-04-11T13:54:14Z

The change LGTM. Can we also check other systems like Preso/Trino/Impala and see how they display null values?

cc @HyukjinKwon @viirya

LuciferYang · 2023-04-11T13:58:49Z

The change LGTM. Can we also check other systems like Preso/Trino/Impala and see how they display null values?

@yikf Can you help check Preso/Trino ? Also check PostgreSQL?

yikf · 2023-04-12T02:31:52Z

The change LGTM. Can we also check other systems like Preso/Trino/Impala and see how they display null values?

cc @HyukjinKwon @viirya

The change LGTM. Can we also check other systems like Preso/Trino/Impala and see how they display null values?

@yikf Can you help check Preso/Trino ? Also check PostgreSQL?

Sure, As following,
Trino

trino> select ARRAY[1, null];
   _col0
-----------
 [1, NULL]
(1 row)

Query 20230412_023104_00007_unnu6, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0.03 [0 rows, 0B] [0 rows/s, 0B/s]

trino> select null;
 _col0
-------
 NULL
(1 row)

Query 20230412_023108_00008_unnu6, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0.07 [0 rows, 0B] [0 rows/s, 0B/s]

PostgreSQL

postgres=# select array[1, null];
  array
----------
 {1,NULL}
(1 row)

cloud-fan · 2023-04-13T02:15:12Z

thanks, merging to master!

sadikovi · 2023-04-21T05:12:51Z

Actually, this is a breaking change for CAST expression. It affects the string representation of complex types and not just df.show which could cause issues when comparing the string representation of a row:

select cast(struct(null) as string) as c0, null as c1

-- df.show()
+------+----+
|    c0|  c1|
+------+----+
|{NULL}|NULL|
+------+----+

-- df.collect.toList
List([{NULL},null])

It also affects CSV writes:

val df = spark.sql(
  """
  select cast(array(1,2) as string) as c0, 1 as c1
  union all
  select cast(array(3,null) as string) as c0, null as c1
  """
).coalesce(1)
df.write.csv("file:/tmp/ivan_test.csv")

Before this PR:

"[1, 2]",1
"[3, null]",

After this PR:

"[1, 2]",1
"[3, NULL]",

And there is no option to re-enable null format.

On a broader side of things: It would be good to clarify why spark-sql shell is expected to be visually compatible with spark-shell, those are two different REPLs with their own ways of displaying data. For example, we don't display a table the same way in those REPLs.

sadikovi · 2023-04-21T05:13:22Z

cc @cloud-fan @dongjoon-hyun

cloud-fan · 2023-04-21T05:22:59Z

@sadikovi CSV does not accept struct/array/map columns, your example generates a string column and writes it to CSV. That said, any behavior change of that "string generation" function will change the value we write to CSV, but I wouldn't call it a CSV behavior change.

It's intentional that the cast struct/array/map to string behavior is changed w.r.t. nulls. If third-party libraries rely on this behavior, then we need to revisit this PR.

BTW, I think PR description should not say it's for consistency with spark-sql shell. df.show displays top-level null value as NULL, but inner field null value as null. This is the major motivation of the fix.

sadikovi · 2023-04-21T06:02:07Z

I just tried without the change and df.show() seems to return correct results:

spark.sql(
  """
  select struct('a'), map('a', 1), array(1, 2), 'a'
  union all
  select struct(null), map('a', null), array(null, null), null 
  """).show()

+---------+-----------+------------+----+
|struct(a)|  map(a, 1)| array(1, 2)|   a|
+---------+-----------+------------+----+
|      {a}|   {a -> 1}|      [1, 2]|   a|
|   {null}|{a -> null}|[null, null]|null|
+---------+-----------+------------+----+

Could you provide an example of inconsistency in df.show() that this PR addresses?

The reason I brought the CSV example is because that is typically how users would write structs/arrays/maps to a CSV file, they would transform that to a string, potentially via CAST. That is why I think this is a behaviour change in writing a CSV file.

yikf · 2023-04-21T06:28:13Z

BTW, I think PR description should not say it's for consistency with spark-sql shell. df.show displays top-level null value as NULL, but inner field null value as null. This is the major motivation of the fix.

Actually, previously, the top-level is consistent with the inner null display with null. This PR mainly thinks that the mainstream database uses NULL instead of null when displaying null, and NULL is a better nice string representation.

If it's affecting something else, I think we should look at it again, sorry ~ : )

For example, we don't display a table the same way in those REPLs.

BTW, For apache spark, spark-sql and spark-shell are two different REPLs, but they are only user interaction interfaces. Shouldn't the displayed content be the same? At least in terms of the result set.

yikf · 2023-04-21T06:33:59Z

which could cause issues when comparing the string representation of a row:

Sorry, I'm not sure what that means, does it mean compared to what was written before this PR? Any problems if cast(null as string) with NULL, I see that trino is also NULL

trino> select cast(null as varchar) as c0, null as c1;
  c0  |  c1
------+------
 NULL | NULL
(1 row)

cloud-fan · 2023-04-21T07:14:11Z

@sadikovi Sorry I was wrong, Dataset.show also explicitly prints null as null before this PR.

Being consistent with other databases is good, but if people rely on the cast behavior to write struct/array/map to CSV, then it's more important to not break it.

Shall we revert this PR then? We can change spark-sql shell to print null to be consistent with df.show instead.

yikf · 2023-04-21T07:38:48Z

+1 for the revert.

BTW, Should spark-sql keep consistent with df.show, this also seems to break a lot of things, since spark-sql previously used hiveResultString, not only null display.

cloud-fan · 2023-04-21T07:49:02Z

If people programmatically parse the output of spark-sql shell, then we shouldn't change it. If it's only for display and consumed by humans, I think it's OK to change.

yikf · 2023-04-21T07:57:41Z

If people programmatically parse the output of spark-sql shell

I want to keep it the way it is. Look at what other people think.

sadikovi · 2023-04-21T19:52:29Z

To be honest, I don't understand why spark-sql shell is expected to be consistent with spark-shell or pyspark shell. Can someone elaborate? I can see making spark-sql shell consistent with Presto/Trino/MySQL/Postgres, etc. but I don't understand why Scala REPL should be visually consistent with SQL terminal in terms of displaying results - they serve different purposes.

If you would like to have a consistent visual behaviour for NULLs/nulls, it is also fine just as long as it does not break other features like Cast or collect.toString. Maybe we could simply add a conversion method to display values in a DataFrame in whatever format we need when calling .show instead of changing Cast. In fact, we can refactor it into a separate class and reuse it in spark-sql and spark-shell.

…t.show ### What changes were proposed in this pull request? This is a followup of #40699 to avoid changing the Cast behavior. It pulls out the cast-to-string code into a base trait, and add a new Expression `ToPrettyString` to extend this trait with a little customization. It also handles binary value inside array/struct/map to also print hex format, for `df.show` only, not `Cast`. ### Why are the changes needed? avoid behavior change ### Does this PR introduce _any_ user-facing change? change back the behavior of casting array/map/struct to string regarding null elements. It was `null`, then changed to `NULL` in #40699 , and is `null` again after this PR. ### How was this patch tested? existing tests Closes #40922 from cloud-fan/pretty. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…t.show ### What changes were proposed in this pull request? This is a followup of apache#40699 to avoid changing the Cast behavior. It pulls out the cast-to-string code into a base trait, and add a new Expression `ToPrettyString` to extend this trait with a little customization. It also handles binary value inside array/struct/map to also print hex format, for `df.show` only, not `Cast`. ### Why are the changes needed? avoid behavior change ### Does this PR introduce _any_ user-facing change? change back the behavior of casting array/map/struct to string regarding null elements. It was `null`, then changed to `NULL` in apache#40699 , and is `null` again after this PR. ### How was this patch tested? existing tests Closes apache#40922 from cloud-fan/pretty. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added the SQL label Apr 7, 2023

LuciferYang approved these changes Apr 8, 2023

View reviewed changes

yikf force-pushed the show-NULL branch 2 times, most recently from 12a5155 to 0b95c6c Compare April 10, 2023 06:43

github-actions bot added CORE PYTHON labels Apr 10, 2023

AngersZhuuuu reviewed Apr 10, 2023

View reviewed changes

yikf force-pushed the show-NULL branch from 0b95c6c to 3ddaf60 Compare April 10, 2023 10:58

github-actions bot added the ML label Apr 10, 2023

cloud-fan reviewed Apr 10, 2023

View reviewed changes

python/pyspark/sql/column.py Show resolved Hide resolved

yikf force-pushed the show-NULL branch 3 times, most recently from a31686b to b82fc6c Compare April 11, 2023 02:37

github-actions bot added CONNECT PANDAS API ON SPARK labels Apr 11, 2023

yikf force-pushed the show-NULL branch 2 times, most recently from 10f2437 to 85c58e3 Compare April 11, 2023 13:22

jaceklaskowski approved these changes Apr 11, 2023

View reviewed changes

df.show handle null should print NULL instead of null

a565b55

yikf force-pushed the show-NULL branch from 85c58e3 to a565b55 Compare April 12, 2023 06:16

cloud-fan approved these changes Apr 13, 2023

View reviewed changes

cloud-fan closed this in a45affe Apr 13, 2023

cloud-fan mentioned this pull request Apr 13, 2023

[SPARK-41259][SQL] SparkSQLDriver use the spark result string that is consistent with that of df.show #40437

Closed

cloud-fan mentioned this pull request Apr 24, 2023

[SPARK-43063][SQL][FOLLOWUP] Add ToPrettyString expression for Dataset.show #40922

Closed

yikf mentioned this pull request Jun 1, 2023

[SPARK-43384][SQL] Make df.show print a nice string for MapType. #41065

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43063][SQL] `df.show` handle null should print NULL instead of null #40699

[SPARK-43063][SQL] `df.show` handle null should print NULL instead of null #40699

yikf commented Apr 7, 2023

LuciferYang commented Apr 8, 2023

LuciferYang left a comment

AngersZhuuuu left a comment

cloud-fan commented Apr 11, 2023

LuciferYang commented Apr 11, 2023

yikf commented Apr 12, 2023 •

edited

cloud-fan commented Apr 13, 2023

sadikovi commented Apr 21, 2023 •

edited

sadikovi commented Apr 21, 2023

cloud-fan commented Apr 21, 2023

sadikovi commented Apr 21, 2023

yikf commented Apr 21, 2023

yikf commented Apr 21, 2023

cloud-fan commented Apr 21, 2023

yikf commented Apr 21, 2023

cloud-fan commented Apr 21, 2023

yikf commented Apr 21, 2023

sadikovi commented Apr 21, 2023 •

edited

[SPARK-43063][SQL] df.show handle null should print NULL instead of null #40699

[SPARK-43063][SQL] df.show handle null should print NULL instead of null #40699

Conversation

yikf commented Apr 7, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

LuciferYang commented Apr 8, 2023

LuciferYang left a comment

Choose a reason for hiding this comment

AngersZhuuuu left a comment

Choose a reason for hiding this comment

cloud-fan commented Apr 11, 2023

LuciferYang commented Apr 11, 2023

yikf commented Apr 12, 2023 • edited

cloud-fan commented Apr 13, 2023

sadikovi commented Apr 21, 2023 • edited

sadikovi commented Apr 21, 2023

cloud-fan commented Apr 21, 2023

sadikovi commented Apr 21, 2023

yikf commented Apr 21, 2023

yikf commented Apr 21, 2023

cloud-fan commented Apr 21, 2023

yikf commented Apr 21, 2023

cloud-fan commented Apr 21, 2023

yikf commented Apr 21, 2023

sadikovi commented Apr 21, 2023 • edited

[SPARK-43063][SQL] `df.show` handle null should print NULL instead of null #40699

[SPARK-43063][SQL] `df.show` handle null should print NULL instead of null #40699

yikf commented Apr 12, 2023 •

edited

sadikovi commented Apr 21, 2023 •

edited

sadikovi commented Apr 21, 2023 •

edited