Add support for UNION sql #1029

xudong963 · 2021-09-20T16:51:55Z

Which issue does this PR close?

Closes #998

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

xudong963 · 2021-09-20T16:54:50Z

Some tests may be not passed, I will fix them tomorrow. And maybe I also need to add a test for UNION.

Dandandan · 2021-09-20T16:58:12Z

datafusion/src/dataframe.rs

    /// # }
    /// ```
-    fn union(&self, dataframe: Arc<dyn DataFrame>) -> Result<Arc<dyn DataFrame>>;
+    fn union(


I think a new method e.g. union_distinct would make better sense - having booleans in APIs hurts readability.

alternatively - what about df.union(df2)?.distinct()? and not exposing a new method?

Dandandan · 2021-09-20T17:07:41Z

datafusion/src/physical_plan/mod.rs

+                        array_str += &*array_value_to_string(column, 1)?;
+                        vec_array.push(column.clone());
+                    }
+                    if vec_str.contains(&array_str) {


As the code is using a Vec with all rows + contains this scales badly, namely O(n^2) instead of O(n). Besides that, it probably has extremely high overhead in terms of string formatting and memory usage to keep all batches + rows converted to strings in memory. Formatting to string is also not very robust - we don't guarantee that two different values won't be formatted as the same string value.

I think a cleaner / efficient way to go now would be to reuse the current implementations we have to drive the execution and only change the query planning.
#998 (comment)

For some inspiration, here is the current SELECT DISTINCT implementation:
https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/sql/planner.rs#L777

I agree with @Dandandan -- specifically I think you could make a plan that implemented SELECT x from foo UNION select x from bar by effectively creating the same plan as

SELECT distinct (select x from foo UNION ALL select x from bar)

You can see the plan that gets made by running EXPLAIN:

explain select distinct x from ( select 1 as x UNION ALL select 1 as x); +---------------+------------------------------------------------------------------------------+ | plan_type | plan | +---------------+------------------------------------------------------------------------------+ | logical_plan | Aggregate: groupBy=[[#x]], aggr=[[]] | | | Union | | | Projection: Int64(1) AS x | | | EmptyRelation | | | Projection: Int64(1) AS x | | | EmptyRelation | | physical_plan | HashAggregateExec: mode=FinalPartitioned, gby=[x@0 as x], aggr=[] | | | CoalesceBatchesExec: target_batch_size=4096 | | | RepartitionExec: partitioning=Hash([Column { name: "x", index: 0 }], 16) | | | HashAggregateExec: mode=Partial, gby=[x@0 as x], aggr=[] | | | UnionExec | | | RepartitionExec: partitioning=RoundRobinBatch(16) | | | ProjectionExec: expr=[1 as x] | | | EmptyExec: produce_one_row=true | | | RepartitionExec: partitioning=RoundRobinBatch(16) | | | ProjectionExec: expr=[1 as x] | | | EmptyExec: produce_one_row=true | +---------------+------------------------------------------------------------------------------+``` (so use a `UnionExec` followed by `HashAggregateExec`)

Dandandan · 2021-09-20T17:10:44Z

datafusion/src/physical_plan/mod.rs

-    let stream = execute_stream(plan).await?;
-    common::collect(stream).await
+    let stream = execute_stream(plan.clone()).await?;
+    let any_plan = plan.as_any().downcast_ref::<UnionExec>();


The code to execute the UnionExec (if changed) should be changed there.
I suggest to implement it using the plan we have. If a more efficient implementation could be implemented, I think the best way would be to put that in a new node - i.e. UnionDistinctExec and

Dandandan · 2021-09-20T17:11:48Z

datafusion/src/physical_plan/union.rs

    /// Execution metrics
    metrics: ExecutionPlanMetricsSet,
+    /// Union ALL or Union
+    is_all: bool,


Union all and union distinct are quite different, so if we need to add / change implementations, I think it makes sense to add a new node instead, like UnionDistinctExec

Dandandan · 2021-09-20T17:12:42Z

@xudong963 thanks for opening this PR! I have some comments on the current approach / direction, let me know what you think.

xudong963 · 2021-09-21T11:23:12Z

thanks for opening this PR! I have some comments on the current approach / direction, let me know what you think.

@Dandandan Dan Dan, I agree with your idea #998 (comment). After I finished the current approach, I also feel it badly. Thanks for your comments and guides, I will try the PR with #998 (comment). 💪🏻

Dandandan · 2021-09-21T17:35:58Z

thanks for opening this PR! I have some comments on the current approach / direction, let me know what you think.

@Dandandan Dan Dan, I agree with your idea #998 (comment). After I finished the current approach, I also feel it badly. Thanks for your comments and guides, I will try the PR with #998 (comment). 💪🏻

No problem! Thanks for trying and looking forward to the next iteration 👍

xudong963 · 2021-10-01T10:53:00Z

I ended my disgusting 24-hour on-call. My one-week holiday is coming, I'll be absorbed in this PR!

xudong963 · 2021-10-03T14:53:02Z

Close the PR, the new is #1068

* Attempt at caching Jstrings as GlobalRefs in a HashMap to reduce reallocations. I need to confirm 1) there's actually a performance benefit to this, and 2) these GlobalRefs are being released when I want them to be. * Minor refactor and added more docs. * Undo import reordering to reduce diff. * Docs. * Avoid get() by just cloning the Arc to globalref on insert. * Store jstring cache in ExecutionContext.

impl union

7169bfc

github-actions bot added datafusion sql SQL Planner labels Sep 20, 2021

Dandandan reviewed Sep 20, 2021

View reviewed changes

houqp added the enhancement New feature or request label Sep 20, 2021

Dandandan reviewed Sep 20, 2021

View reviewed changes

alamb changed the title ~~impl union~~ Add support for UNION sql Sep 20, 2021

xudong963 mentioned this pull request Oct 2, 2021

UNION ALL bug: thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', ./src/datatypes/schema.rs:165:10 #1064

Closed

xudong963 closed this Oct 3, 2021

alamb mentioned this pull request Oct 4, 2021

Add support for UNION [DISTINCT] sql #1068

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for UNION sql #1029

Add support for UNION sql #1029

Uh oh!

xudong963 commented Sep 20, 2021

Uh oh!

xudong963 commented Sep 20, 2021

Uh oh!

Dandandan Sep 20, 2021

Uh oh!

Dandandan Sep 20, 2021 •

edited

Loading

Uh oh!

Dandandan Sep 20, 2021

Uh oh!

alamb Sep 20, 2021 •

edited

Loading

Uh oh!

Dandandan Sep 20, 2021

Uh oh!

Dandandan Sep 20, 2021

Uh oh!

Dandandan commented Sep 20, 2021

Uh oh!

xudong963 commented Sep 21, 2021

Uh oh!

Dandandan commented Sep 21, 2021

Uh oh!

xudong963 commented Oct 1, 2021

Uh oh!

xudong963 commented Oct 3, 2021

Uh oh!

Uh oh!

Add support for UNION sql #1029

Add support for UNION sql #1029

Uh oh!

Conversation

xudong963 commented Sep 20, 2021

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

xudong963 commented Sep 20, 2021

Uh oh!

Dandandan Sep 20, 2021

Choose a reason for hiding this comment

Uh oh!

Dandandan Sep 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Sep 20, 2021

Choose a reason for hiding this comment

Uh oh!

alamb Sep 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Sep 20, 2021

Choose a reason for hiding this comment

Uh oh!

Dandandan Sep 20, 2021

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Sep 20, 2021

Uh oh!

xudong963 commented Sep 21, 2021

Uh oh!

Dandandan commented Sep 21, 2021

Uh oh!

xudong963 commented Oct 1, 2021

Uh oh!

xudong963 commented Oct 3, 2021

Uh oh!

Uh oh!

Dandandan Sep 20, 2021 •

edited

Loading

alamb Sep 20, 2021 •

edited

Loading