-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Docs and a test for unnest / Proposed consistent API #7044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1114,6 +1114,71 @@ async fn unnest_fixed_list() -> Result<()> { | |
| Ok(()) | ||
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn unnest_fixed_list_nonull() -> Result<()> { | ||
| let mut shape_id_builder = UInt32Builder::new(); | ||
| let mut tags_builder = FixedSizeListBuilder::new(StringBuilder::new(), 2); | ||
|
|
||
| for idx in 0..6 { | ||
| // Append shape id. | ||
| shape_id_builder.append_value(idx as u32 + 1); | ||
|
|
||
| tags_builder | ||
| .values() | ||
| .append_value(format!("tag{}1", idx + 1)); | ||
| tags_builder | ||
| .values() | ||
| .append_value(format!("tag{}2", idx + 1)); | ||
| tags_builder.append(true); | ||
| } | ||
|
|
||
| let batch = RecordBatch::try_from_iter(vec![ | ||
| ("shape_id", Arc::new(shape_id_builder.finish()) as ArrayRef), | ||
| ("tags", Arc::new(tags_builder.finish()) as ArrayRef), | ||
| ])?; | ||
|
|
||
| let ctx = SessionContext::new(); | ||
| ctx.register_batch("shapes", batch)?; | ||
| let df = ctx.table("shapes").await?; | ||
|
|
||
| let results = df.clone().collect().await?; | ||
| let expected = vec![ | ||
| "+----------+----------------+", | ||
| "| shape_id | tags |", | ||
| "+----------+----------------+", | ||
| "| 1 | [tag11, tag12] |", | ||
| "| 2 | [tag21, tag22] |", | ||
| "| 3 | [tag31, tag32] |", | ||
| "| 4 | [tag41, tag42] |", | ||
| "| 5 | [tag51, tag52] |", | ||
| "| 6 | [tag61, tag62] |", | ||
| "+----------+----------------+", | ||
| ]; | ||
| assert_batches_sorted_eq!(expected, &results); | ||
|
|
||
| let results = df.unnest_column("tags")?.collect().await?; | ||
| let expected = vec![ | ||
| "+----------+-------+", | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's correct test: |
||
| "| shape_id | tags |", | ||
| "+----------+-------+", | ||
| "| 1 | tag11 |", | ||
| "| 1 | tag12 |", | ||
| "| 2 | tag21 |", | ||
| "| 2 | tag22 |", | ||
| "| 3 | tag31 |", | ||
| "| 3 | tag32 |", | ||
| "| 4 | tag41 |", | ||
| "| 4 | tag42 |", | ||
| "| 5 | tag51 |", | ||
| "| 5 | tag52 |", | ||
| "| 6 | tag61 |", | ||
| "| 6 | tag62 |", | ||
| "+----------+-------+", | ||
| ]; | ||
| assert_batches_sorted_eq!(expected, &results); | ||
|
|
||
| Ok(()) | ||
| } | ||
| #[tokio::test] | ||
| async fn unnest_aggregate_columns() -> Result<()> { | ||
| const NUM_ROWS: usize = 5; | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -120,7 +120,8 @@ pub enum LogicalPlan { | |
| Ddl(DdlStatement), | ||
| /// Describe the schema of table | ||
| DescribeTable(DescribeTable), | ||
| /// Unnest a column that contains a nested list type. | ||
| /// Unnest a column that contains a nested list type. See | ||
| /// [`Unnest`] for more details. | ||
| Unnest(Unnest), | ||
| } | ||
|
|
||
|
|
@@ -1753,6 +1754,23 @@ pub enum Partitioning { | |
| } | ||
|
|
||
| /// Unnest a column that contains a nested list type. | ||
| /// | ||
| /// For example, calling unnest(c1) results in the following: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From the point of view of all our reference databases, this output is incorrect: DuckDB: ClickHouse:
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the current implementation adds a row if there's a null value and would likely skip the row if the array is empty. I can work on fixing this if that's needed
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes when I implemented by unnesting that I think is better because you preserve all the information that was in the nested data and if then you want to nest the data again you get back the same data you started with.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you decide to implement the same behavior you have in DuckDB would be nice to have an option to preserve the old behavior.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have the code ready to change this but need someone to confirm if this is what we want.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think a So is the consensus for "duckdb/clickhouse" behavior by default, with a flag for
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe to avoid breaking API changes to the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll have a play with the API and make a proposal
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here is a proposed API -- please let me know what you think: #7088 |
||
| /// | ||
| /// ```text | ||
| /// ┌─────────┐ ┌─────┐ ┌─────────┐ ┌─────┐ | ||
| /// │ {1, 2} │ │ A │ Unnest │ 1 │ │ A │ | ||
| /// ├─────────┤ ├─────┤ ├─────────┤ ├─────┤ | ||
| /// │ null │ │ B │ │ 2 │ │ A │ | ||
| /// ├─────────┤ ├─────┤ ────────────▶ ├─────────┤ ├─────┤ | ||
| /// │ {} │ │ D │ │ null │ │ B │ | ||
| /// ├─────────┤ ├─────┤ ├─────────┤ ├─────┤ | ||
| /// │ {3} │ │ E │ │ null │ │ D │ | ||
| /// └─────────┘ └─────┘ ├─────────┤ ├─────┤ | ||
| /// c1 c2 │ 3 │ │ E │ | ||
| /// └─────────┘ └─────┘ | ||
| /// c1 c2 | ||
| /// ``` | ||
| #[derive(Debug, Clone, PartialEq, Eq, Hash)] | ||
| pub struct Unnest { | ||
| /// The incoming logical plan | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the point of view of all our reference databases, this output is incorrect:
PostgreSQL:
DuckDB:
ClickHouse: