Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43063][SQL] df.show handle null should print NULL instead of null #40699

Closed
wants to merge 1 commit into from

Conversation

yikf
Copy link
Contributor

@yikf yikf commented Apr 7, 2023

What changes were proposed in this pull request?

df.show handle null should print NULL instead of null to consistent behavior;

Like as the following behavior is currently inconsistent:

scala> spark.sql("select decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4, 'Seattle') as result").show(false)
+------+
|result|
+------+
|null  |
+------+
spark-sql> DESC FUNCTION EXTENDED decode;
function_desc
Function: decode
Class: org.apache.spark.sql.catalyst.expressions.Decode
Usage:
    decode(bin, charset) - Decodes the first argument using the second argument character set.

    decode(expr, search, result [, search, result ] ... [, default]) - Compares expr
      to each search value in order. If expr is equal to a search value, decode returns
      the corresponding result. If no match is found, then it returns default. If default
      is omitted, it returns null.

Extended Usage:
    Examples:
      > SELECT decode(encode('abc', 'utf-8'), 'utf-8');
       abc
      > SELECT decode(2, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4, 'Seattle', 'Non domestic');
       San Francisco
      > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4, 'Seattle', 'Non domestic');
       Non domestic
      > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4, 'Seattle');
       NULL

    Since: 3.2.0

Time taken: 0.074 seconds, Fetched 4 row(s)
spark-sql> select decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4, 'Seattle');
NULL

Why are the changes needed?

df.show keep consistent behavior when handle null with spark-sql CLI.

Does this PR introduce any user-facing change?

Yes, null will display NULL instead of null.

How was this patch tested?

GA

@github-actions github-actions bot added the SQL label Apr 7, 2023
@LuciferYang
Copy link
Contributor

@yikf Can you re-trigger GA?

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM if test passed

Copy link
Contributor

@AngersZhuuuu AngersZhuuuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, cc @cloud-fan

@cloud-fan
Copy link
Contributor

The change LGTM. Can we also check other systems like Preso/Trino/Impala and see how they display null values?

cc @HyukjinKwon @viirya

@LuciferYang
Copy link
Contributor

The change LGTM. Can we also check other systems like Preso/Trino/Impala and see how they display null values?

@yikf Can you help check Preso/Trino ? Also check PostgreSQL?

@yikf
Copy link
Contributor Author

yikf commented Apr 12, 2023

The change LGTM. Can we also check other systems like Preso/Trino/Impala and see how they display null values?

cc @HyukjinKwon @viirya

The change LGTM. Can we also check other systems like Preso/Trino/Impala and see how they display null values?

@yikf Can you help check Preso/Trino ? Also check PostgreSQL?

Sure, As following,
Trino

trino> select ARRAY[1, null];
   _col0
-----------
 [1, NULL]
(1 row)

Query 20230412_023104_00007_unnu6, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0.03 [0 rows, 0B] [0 rows/s, 0B/s]

trino> select null;
 _col0
-------
 NULL
(1 row)

Query 20230412_023108_00008_unnu6, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0.07 [0 rows, 0B] [0 rows/s, 0B/s]

PostgreSQL

postgres=# select array[1, null];
  array
----------
 {1,NULL}
(1 row)

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@sadikovi
Copy link
Contributor

sadikovi commented Apr 21, 2023

Actually, this is a breaking change for CAST expression. It affects the string representation of complex types and not just df.show which could cause issues when comparing the string representation of a row:

select cast(struct(null) as string) as c0, null as c1
-- df.show()
+------+----+
|    c0|  c1|
+------+----+
|{NULL}|NULL|
+------+----+

-- df.collect.toList
List([{NULL},null])

It also affects CSV writes:

val df = spark.sql(
  """
  select cast(array(1,2) as string) as c0, 1 as c1
  union all
  select cast(array(3,null) as string) as c0, null as c1
  """
).coalesce(1)
df.write.csv("file:/tmp/ivan_test.csv")

Before this PR:

"[1, 2]",1
"[3, null]",

After this PR:

"[1, 2]",1
"[3, NULL]",

And there is no option to re-enable null format.

On a broader side of things: It would be good to clarify why spark-sql shell is expected to be visually compatible with spark-shell, those are two different REPLs with their own ways of displaying data. For example, we don't display a table the same way in those REPLs.

@sadikovi
Copy link
Contributor

cc @cloud-fan @dongjoon-hyun

@cloud-fan
Copy link
Contributor

@sadikovi CSV does not accept struct/array/map columns, your example generates a string column and writes it to CSV. That said, any behavior change of that "string generation" function will change the value we write to CSV, but I wouldn't call it a CSV behavior change.

It's intentional that the cast struct/array/map to string behavior is changed w.r.t. nulls. If third-party libraries rely on this behavior, then we need to revisit this PR.

BTW, I think PR description should not say it's for consistency with spark-sql shell. df.show displays top-level null value as NULL, but inner field null value as null. This is the major motivation of the fix.

@sadikovi
Copy link
Contributor

I just tried without the change and df.show() seems to return correct results:

spark.sql(
  """
  select struct('a'), map('a', 1), array(1, 2), 'a'
  union all
  select struct(null), map('a', null), array(null, null), null 
  """).show()
+---------+-----------+------------+----+
|struct(a)|  map(a, 1)| array(1, 2)|   a|
+---------+-----------+------------+----+
|      {a}|   {a -> 1}|      [1, 2]|   a|
|   {null}|{a -> null}|[null, null]|null|
+---------+-----------+------------+----+

Could you provide an example of inconsistency in df.show() that this PR addresses?

The reason I brought the CSV example is because that is typically how users would write structs/arrays/maps to a CSV file, they would transform that to a string, potentially via CAST. That is why I think this is a behaviour change in writing a CSV file.

@yikf
Copy link
Contributor Author

yikf commented Apr 21, 2023

BTW, I think PR description should not say it's for consistency with spark-sql shell. df.show displays top-level null value as NULL, but inner field null value as null. This is the major motivation of the fix.

Actually, previously, the top-level is consistent with the inner null display with null. This PR mainly thinks that the mainstream database uses NULL instead of null when displaying null, and NULL is a better nice string representation.

If it's affecting something else, I think we should look at it again, sorry ~ : )

For example, we don't display a table the same way in those REPLs.

BTW, For apache spark, spark-sql and spark-shell are two different REPLs, but they are only user interaction interfaces. Shouldn't the displayed content be the same? At least in terms of the result set.

@yikf
Copy link
Contributor Author

yikf commented Apr 21, 2023

which could cause issues when comparing the string representation of a row:

Sorry, I'm not sure what that means, does it mean compared to what was written before this PR? Any problems if cast(null as string) with NULL, I see that trino is also NULL

trino> select cast(null as varchar) as c0, null as c1;
  c0  |  c1
------+------
 NULL | NULL
(1 row)

@cloud-fan
Copy link
Contributor

@sadikovi Sorry I was wrong, Dataset.show also explicitly prints null as null before this PR.

Being consistent with other databases is good, but if people rely on the cast behavior to write struct/array/map to CSV, then it's more important to not break it.

Shall we revert this PR then? We can change spark-sql shell to print null to be consistent with df.show instead.

@yikf
Copy link
Contributor Author

yikf commented Apr 21, 2023

+1 for the revert.

BTW, Should spark-sql keep consistent with df.show, this also seems to break a lot of things, since spark-sql previously used hiveResultString, not only null display.

@cloud-fan
Copy link
Contributor

If people programmatically parse the output of spark-sql shell, then we shouldn't change it. If it's only for display and consumed by humans, I think it's OK to change.

@yikf
Copy link
Contributor Author

yikf commented Apr 21, 2023

If people programmatically parse the output of spark-sql shell

I want to keep it the way it is. Look at what other people think.

@sadikovi
Copy link
Contributor

sadikovi commented Apr 21, 2023

To be honest, I don't understand why spark-sql shell is expected to be consistent with spark-shell or pyspark shell. Can someone elaborate? I can see making spark-sql shell consistent with Presto/Trino/MySQL/Postgres, etc. but I don't understand why Scala REPL should be visually consistent with SQL terminal in terms of displaying results - they serve different purposes.

If you would like to have a consistent visual behaviour for NULLs/nulls, it is also fine just as long as it does not break other features like Cast or collect.toString. Maybe we could simply add a conversion method to display values in a DataFrame in whatever format we need when calling .show instead of changing Cast. In fact, we can refactor it into a separate class and reuse it in spark-sql and spark-shell.

cloud-fan added a commit that referenced this pull request Apr 27, 2023
…t.show

### What changes were proposed in this pull request?

This is a followup of #40699 to avoid changing the Cast behavior. It pulls out the cast-to-string code into a base trait, and add a new Expression `ToPrettyString` to extend this trait with a little customization.

It also handles binary value inside array/struct/map to also print hex format, for `df.show` only, not `Cast`.

### Why are the changes needed?

avoid behavior change

### Does this PR introduce _any_ user-facing change?

change back the behavior of casting array/map/struct to string regarding null elements. It was `null`, then changed to `NULL` in #40699 , and is `null` again after this PR.

### How was this patch tested?

existing tests

Closes #40922 from cloud-fan/pretty.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
LuciferYang pushed a commit to LuciferYang/spark that referenced this pull request May 10, 2023
…t.show

### What changes were proposed in this pull request?

This is a followup of apache#40699 to avoid changing the Cast behavior. It pulls out the cast-to-string code into a base trait, and add a new Expression `ToPrettyString` to extend this trait with a little customization.

It also handles binary value inside array/struct/map to also print hex format, for `df.show` only, not `Cast`.

### Why are the changes needed?

avoid behavior change

### Does this PR introduce _any_ user-facing change?

change back the behavior of casting array/map/struct to string regarding null elements. It was `null`, then changed to `NULL` in apache#40699 , and is `null` again after this PR.

### How was this patch tested?

existing tests

Closes apache#40922 from cloud-fan/pretty.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants