Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Jul 20, 2023

Which issue does this PR close?

While reviewing #7002 from @smiklos I kept getting confused about what unnest actually did.

cc @izveigor and @jayzhan211 -- could you please verify I got this right?

Rationale for this change

make the code eaiser to understand

What changes are included in this PR?

Add docstrings (and the new test from #7002 while i was at it)

Are these changes tested?

yes

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate labels Jul 20, 2023
@alamb alamb force-pushed the alamb/unnest_docs branch from 0489dfe to 4c271a2 Compare July 20, 2023 19:06

let results = df.unnest_column("tags")?.collect().await?;
let expected = vec![
"+----------+-------+",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct test:

postgres=# select * from unnest_example2;
 c1 |      c2       
----+---------------
  1 | {tag11,tag12}
  2 | {tag21,tag22}
  3 | {tag31,tag32}
  4 | {tag41,tag42}
  5 | {tag51,tag52}
  6 | {tag61,tag62}
(6 rows)

postgres=# select c1, unnest(c2) from unnest_example2;
 c1 | unnest 
----+--------
  1 | tag11
  1 | tag12
  2 | tag21
  2 | tag22
  3 | tag31
  3 | tag32
  4 | tag41
  4 | tag42
  5 | tag51
  5 | tag52
  6 | tag61
  6 | tag62
(12 rows)

use super::DisplayAs;

/// Unnest the given column by joining the row with each value in the nested type.
///
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the point of view of all our reference databases, this output is incorrect:
PostgreSQL:

postgres=# select * from unnest_example;
  c1   | c2 
-------+----
 {1,2} | A
       | B
 {}    | D
 {3}   | E
(4 rows)

postgres=# select unnest(c1), c2 from unnest_example;
 unnest | c2 
--------+----
      1 | A
      2 | A
      3 | E
(3 rows)

DuckDB:

D select * from unnest_example;
┌─────────┬─────────┐
│   c1    │   c2    │
│ int32[] │ varchar │
├─────────┼─────────┤
│ [1, 2]  │ A       │
│         │ B       │
│ []      │ D       │
│ [3]     │ E       │
└─────────┴─────────┘
D select unnest(c1), c2 from unnest_example;
┌────────────┬─────────┐
│ unnest(c1) │   c2    │
│   int32    │ varchar │
├────────────┼─────────┤
│          1 │ A       │
│          2 │ A       │
│          3 │ E       │
└────────────┴─────────┘

ClickHouse:

SELECT *
FROM unnest_example

Query id: be77ac76-c0ee-454a-968c-fa07afcf732d

┌─c1────┬─c2─┐
│ []    │ B  │
│ []    │ D  │
│ [1,2] │ A  │
│ [3]   │ E  │
└───────┴────┘

4 rows in set. Elapsed: 0.001 sec. 

DESKTOP-HM8OC4I. :) select arrayJoin(c1), c2 from unnest_example;

SELECT
    arrayJoin(c1),
    c2
FROM unnest_example

Query id: 3b8d0451-54f5-4a9a-8938-5505fe75a9df

┌─arrayJoin(c1)─┬─c2─┐
│             1 │ A  │
│             2 │ A  │
│             3 │ E  │
└───────────────┴────┘

3 rows in set. Elapsed: 0.001 sec.


/// Unnest a column that contains a nested list type.
///
/// For example, calling unnest(c1) results in the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the point of view of all our reference databases, this output is incorrect:
PostgreSQL:

postgres=# select * from unnest_example;
  c1   | c2 
-------+----
 {1,2} | A
       | B
 {}    | D
 {3}   | E
(4 rows)

postgres=# select unnest(c1), c2 from unnest_example;
 unnest | c2 
--------+----
      1 | A
      2 | A
      3 | E
(3 rows)

DuckDB:

D select * from unnest_example;
┌─────────┬─────────┐
│   c1    │   c2    │
│ int32[] │ varchar │
├─────────┼─────────┤
│ [1, 2]  │ A       │
│         │ B       │
│ []      │ D       │
│ [3]     │ E       │
└─────────┴─────────┘
D select unnest(c1), c2 from unnest_example;
┌────────────┬─────────┐
│ unnest(c1) │   c2    │
│   int32    │ varchar │
├────────────┼─────────┤
│          1 │ A       │
│          2 │ A       │
│          3 │ E       │
└────────────┴─────────┘

ClickHouse:

SELECT *
FROM unnest_example

Query id: be77ac76-c0ee-454a-968c-fa07afcf732d

┌─c1────┬─c2─┐
│ []    │ B  │
│ []    │ D  │
│ [1,2] │ A  │
│ [3]   │ E  │
└───────┴────┘

4 rows in set. Elapsed: 0.001 sec. 

DESKTOP-HM8OC4I. :) select arrayJoin(c1), c2 from unnest_example;

SELECT
    arrayJoin(c1),
    c2
FROM unnest_example

Query id: 3b8d0451-54f5-4a9a-8938-5505fe75a9df

┌─arrayJoin(c1)─┬─c2─┐
│             1 │ A  │
│             2 │ A  │
│             3 │ E  │
└───────────────┴────┘

3 rows in set. Elapsed: 0.001 sec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current implementation adds a row if there's a null value and would likely skip the row if the array is empty.

I can work on fixing this if that's needed

Copy link
Contributor

@vincev vincev Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes when I implemented unnest_column I followed the logic in Pandas and Polars:

          ints  shape_id
[3, 88, 94]         1
       [73]         2
       None         3
   [43, 97]         4
       None         5
       [65]         6

by unnesting ints we get:

   ints  shape_id
   3         1
  88         1
  94         1
  73         2
None         3
  43         4
  97         4
None         5
  65         6

that I think is better because you preserve all the information that was in the nested data and if then you want to nest the data again you get back the same data you started with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you decide to implement the same behavior you have in DuckDB would be nice to have an option to preserve the old behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the code ready to change this but need someone to confirm if this is what we want.
To support both ways, we could introduce some extra boolean flag like keep_nulls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a keep_nulls flag sounds like a good idea to me

So is the consensus for "duckdb/clickhouse" behavior by default, with a flag for keep_nulls? If so, I can update the examples in this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe to avoid breaking API changes to the LogicalPlanBuilder::unnest_column interface we can add a LogicalPlanBuilder::unnest_with_options method that will set the flag in the LogicalPlan::Unnest struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a play with the API and make a proposal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a proposed API -- please let me know what you think: #7088

@alamb
Copy link
Contributor Author

alamb commented Jul 22, 2023

Thank you for the research @izveigor and the constructive discussion @vincev and @smiklos

@alamb alamb changed the title Minor: Add docs and a test for unnest Make unnest consistent with DuckDB/ClickHouse, add option for preserve_nulls, update docs Jul 25, 2023
@alamb alamb changed the title Make unnest consistent with DuckDB/ClickHouse, add option for preserve_nulls, update docs Minor: Add docs and a test for unnest Jul 25, 2023
@alamb alamb force-pushed the alamb/unnest_docs branch from e81752e to 4c271a2 Compare July 25, 2023 14:19
@alamb alamb changed the title Minor: Add docs and a test for unnest Docs and a test for unnest / Proposed consistent API Jul 27, 2023
@alamb
Copy link
Contributor Author

alamb commented Jul 30, 2023

Lets continue the discussion on #7088

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate logical-expr Logical plan and expressions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants