New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-43063][SQL] df.show
handle null should print NULL instead of null
#40699
Conversation
@yikf Can you re-trigger GA? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM if test passed
12a5155
to
0b95c6c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, cc @cloud-fan
a31686b
to
b82fc6c
Compare
10f2437
to
85c58e3
Compare
The change LGTM. Can we also check other systems like Preso/Trino/Impala and see how they display null values? |
@yikf Can you help check Preso/Trino ? Also check PostgreSQL? |
Sure, As following, trino> select ARRAY[1, null];
_col0
-----------
[1, NULL]
(1 row)
Query 20230412_023104_00007_unnu6, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0.03 [0 rows, 0B] [0 rows/s, 0B/s]
trino> select null;
_col0
-------
NULL
(1 row)
Query 20230412_023108_00008_unnu6, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0.07 [0 rows, 0B] [0 rows/s, 0B/s] PostgreSQL postgres=# select array[1, null];
array
----------
{1,NULL}
(1 row) |
thanks, merging to master! |
Actually, this is a breaking change for CAST expression. It affects the string representation of complex types and not just select cast(struct(null) as string) as c0, null as c1
It also affects CSV writes: val df = spark.sql(
"""
select cast(array(1,2) as string) as c0, 1 as c1
union all
select cast(array(3,null) as string) as c0, null as c1
"""
).coalesce(1)
df.write.csv("file:/tmp/ivan_test.csv") Before this PR:
After this PR:
And there is no option to re-enable null format. On a broader side of things: It would be good to clarify why spark-sql shell is expected to be visually compatible with spark-shell, those are two different REPLs with their own ways of displaying data. For example, we don't display a table the same way in those REPLs. |
@sadikovi CSV does not accept struct/array/map columns, your example generates a string column and writes it to CSV. That said, any behavior change of that "string generation" function will change the value we write to CSV, but I wouldn't call it a CSV behavior change. It's intentional that the cast struct/array/map to string behavior is changed w.r.t. nulls. If third-party libraries rely on this behavior, then we need to revisit this PR. BTW, I think PR description should not say it's for consistency with spark-sql shell. |
I just tried without the change and df.show() seems to return correct results: spark.sql(
"""
select struct('a'), map('a', 1), array(1, 2), 'a'
union all
select struct(null), map('a', null), array(null, null), null
""").show()
Could you provide an example of inconsistency in df.show() that this PR addresses? The reason I brought the CSV example is because that is typically how users would write structs/arrays/maps to a CSV file, they would transform that to a string, potentially via CAST. That is why I think this is a behaviour change in writing a CSV file. |
Actually, previously, the top-level is consistent with the inner null display with If it's affecting something else, I think we should look at it again, sorry ~ : )
BTW, For apache spark, spark-sql and spark-shell are two different REPLs, but they are only user interaction interfaces. Shouldn't the displayed content be the same? At least in terms of the result set. |
Sorry, I'm not sure what that means, does it mean compared to what was written before this PR? Any problems if cast(null as string) with NULL, I see that trino is also NULL trino> select cast(null as varchar) as c0, null as c1;
c0 | c1
------+------
NULL | NULL
(1 row) |
@sadikovi Sorry I was wrong, Being consistent with other databases is good, but if people rely on the cast behavior to write struct/array/map to CSV, then it's more important to not break it. Shall we revert this PR then? We can change spark-sql shell to print |
+1 for the revert. BTW, Should spark-sql keep consistent with df.show, this also seems to break a lot of things, since spark-sql previously used hiveResultString, not only null display. |
If people programmatically parse the output of spark-sql shell, then we shouldn't change it. If it's only for display and consumed by humans, I think it's OK to change. |
I want to keep it the way it is. Look at what other people think. |
To be honest, I don't understand why spark-sql shell is expected to be consistent with spark-shell or pyspark shell. Can someone elaborate? I can see making spark-sql shell consistent with Presto/Trino/MySQL/Postgres, etc. but I don't understand why Scala REPL should be visually consistent with SQL terminal in terms of displaying results - they serve different purposes. If you would like to have a consistent visual behaviour for NULLs/nulls, it is also fine just as long as it does not break other features like Cast or |
…t.show ### What changes were proposed in this pull request? This is a followup of #40699 to avoid changing the Cast behavior. It pulls out the cast-to-string code into a base trait, and add a new Expression `ToPrettyString` to extend this trait with a little customization. It also handles binary value inside array/struct/map to also print hex format, for `df.show` only, not `Cast`. ### Why are the changes needed? avoid behavior change ### Does this PR introduce _any_ user-facing change? change back the behavior of casting array/map/struct to string regarding null elements. It was `null`, then changed to `NULL` in #40699 , and is `null` again after this PR. ### How was this patch tested? existing tests Closes #40922 from cloud-fan/pretty. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…t.show ### What changes were proposed in this pull request? This is a followup of apache#40699 to avoid changing the Cast behavior. It pulls out the cast-to-string code into a base trait, and add a new Expression `ToPrettyString` to extend this trait with a little customization. It also handles binary value inside array/struct/map to also print hex format, for `df.show` only, not `Cast`. ### Why are the changes needed? avoid behavior change ### Does this PR introduce _any_ user-facing change? change back the behavior of casting array/map/struct to string regarding null elements. It was `null`, then changed to `NULL` in apache#40699 , and is `null` again after this PR. ### How was this patch tested? existing tests Closes apache#40922 from cloud-fan/pretty. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
df.show
handle null should print NULL instead of null to consistent behavior;Like as the following behavior is currently inconsistent:
Why are the changes needed?
df.show
keep consistent behavior when handlenull
with spark-sql CLI.Does this PR introduce any user-facing change?
Yes,
null
will display NULL instead of null.How was this patch tested?
GA