-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Docs and a test for unnest / Proposed consistent API #7044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| let results = df.unnest_column("tags")?.collect().await?; | ||
| let expected = vec![ | ||
| "+----------+-------+", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct test:
postgres=# select * from unnest_example2;
c1 | c2
----+---------------
1 | {tag11,tag12}
2 | {tag21,tag22}
3 | {tag31,tag32}
4 | {tag41,tag42}
5 | {tag51,tag52}
6 | {tag61,tag62}
(6 rows)
postgres=# select c1, unnest(c2) from unnest_example2;
c1 | unnest
----+--------
1 | tag11
1 | tag12
2 | tag21
2 | tag22
3 | tag31
3 | tag32
4 | tag41
4 | tag42
5 | tag51
5 | tag52
6 | tag61
6 | tag62
(12 rows)
| use super::DisplayAs; | ||
|
|
||
| /// Unnest the given column by joining the row with each value in the nested type. | ||
| /// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the point of view of all our reference databases, this output is incorrect:
PostgreSQL:
postgres=# select * from unnest_example;
c1 | c2
-------+----
{1,2} | A
| B
{} | D
{3} | E
(4 rows)
postgres=# select unnest(c1), c2 from unnest_example;
unnest | c2
--------+----
1 | A
2 | A
3 | E
(3 rows)
DuckDB:
D select * from unnest_example;
┌─────────┬─────────┐
│ c1 │ c2 │
│ int32[] │ varchar │
├─────────┼─────────┤
│ [1, 2] │ A │
│ │ B │
│ [] │ D │
│ [3] │ E │
└─────────┴─────────┘
D select unnest(c1), c2 from unnest_example;
┌────────────┬─────────┐
│ unnest(c1) │ c2 │
│ int32 │ varchar │
├────────────┼─────────┤
│ 1 │ A │
│ 2 │ A │
│ 3 │ E │
└────────────┴─────────┘
ClickHouse:
SELECT *
FROM unnest_example
Query id: be77ac76-c0ee-454a-968c-fa07afcf732d
┌─c1────┬─c2─┐
│ [] │ B │
│ [] │ D │
│ [1,2] │ A │
│ [3] │ E │
└───────┴────┘
4 rows in set. Elapsed: 0.001 sec.
DESKTOP-HM8OC4I. :) select arrayJoin(c1), c2 from unnest_example;
SELECT
arrayJoin(c1),
c2
FROM unnest_example
Query id: 3b8d0451-54f5-4a9a-8938-5505fe75a9df
┌─arrayJoin(c1)─┬─c2─┐
│ 1 │ A │
│ 2 │ A │
│ 3 │ E │
└───────────────┴────┘
3 rows in set. Elapsed: 0.001 sec.
|
|
||
| /// Unnest a column that contains a nested list type. | ||
| /// | ||
| /// For example, calling unnest(c1) results in the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the point of view of all our reference databases, this output is incorrect:
PostgreSQL:
postgres=# select * from unnest_example;
c1 | c2
-------+----
{1,2} | A
| B
{} | D
{3} | E
(4 rows)
postgres=# select unnest(c1), c2 from unnest_example;
unnest | c2
--------+----
1 | A
2 | A
3 | E
(3 rows)
DuckDB:
D select * from unnest_example;
┌─────────┬─────────┐
│ c1 │ c2 │
│ int32[] │ varchar │
├─────────┼─────────┤
│ [1, 2] │ A │
│ │ B │
│ [] │ D │
│ [3] │ E │
└─────────┴─────────┘
D select unnest(c1), c2 from unnest_example;
┌────────────┬─────────┐
│ unnest(c1) │ c2 │
│ int32 │ varchar │
├────────────┼─────────┤
│ 1 │ A │
│ 2 │ A │
│ 3 │ E │
└────────────┴─────────┘
ClickHouse:
SELECT *
FROM unnest_example
Query id: be77ac76-c0ee-454a-968c-fa07afcf732d
┌─c1────┬─c2─┐
│ [] │ B │
│ [] │ D │
│ [1,2] │ A │
│ [3] │ E │
└───────┴────┘
4 rows in set. Elapsed: 0.001 sec.
DESKTOP-HM8OC4I. :) select arrayJoin(c1), c2 from unnest_example;
SELECT
arrayJoin(c1),
c2
FROM unnest_example
Query id: 3b8d0451-54f5-4a9a-8938-5505fe75a9df
┌─arrayJoin(c1)─┬─c2─┐
│ 1 │ A │
│ 2 │ A │
│ 3 │ E │
└───────────────┴────┘
3 rows in set. Elapsed: 0.001 sec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the current implementation adds a row if there's a null value and would likely skip the row if the array is empty.
I can work on fixing this if that's needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes when I implemented unnest_column I followed the logic in Pandas and Polars:
ints shape_id
[3, 88, 94] 1
[73] 2
None 3
[43, 97] 4
None 5
[65] 6
by unnesting ints we get:
ints shape_id
3 1
88 1
94 1
73 2
None 3
43 4
97 4
None 5
65 6
that I think is better because you preserve all the information that was in the nested data and if then you want to nest the data again you get back the same data you started with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you decide to implement the same behavior you have in DuckDB would be nice to have an option to preserve the old behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the code ready to change this but need someone to confirm if this is what we want.
To support both ways, we could introduce some extra boolean flag like keep_nulls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a keep_nulls flag sounds like a good idea to me
So is the consensus for "duckdb/clickhouse" behavior by default, with a flag for keep_nulls? If so, I can update the examples in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe to avoid breaking API changes to the LogicalPlanBuilder::unnest_column interface we can add a LogicalPlanBuilder::unnest_with_options method that will set the flag in the LogicalPlan::Unnest struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have a play with the API and make a proposal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a proposed API -- please let me know what you think: #7088
unnest consistent with DuckDB/ClickHouse, add option for preserve_nulls, update docs
unnest consistent with DuckDB/ClickHouse, add option for preserve_nulls, update docs |
Lets continue the discussion on #7088 |
Which issue does this PR close?
While reviewing #7002 from @smiklos I kept getting confused about what unnest actually did.
cc @izveigor and @jayzhan211 -- could you please verify I got this right?
Rationale for this change
make the code eaiser to understand
What changes are included in this PR?
Add docstrings (and the new test from #7002 while i was at it)
Are these changes tested?
yes
Are there any user-facing changes?