ARROW-10666: [Rust][DataFusion] Support nested SELECT statements. #8727

drusso · 2020-11-20T20:01:30Z

ARROW-10666 This PR enables nested SELECT statements. Note that table aliases remain unsupported, and no optimizations are made during the planning stages.

github-actions · 2020-11-20T20:13:09Z

https://issues.apache.org/jira/browse/ARROW-10666

jorgecarleitao

Such a great feature with some little code change! Thanks a lot, @drusso !

Could you change the README line - [ ] Subqueries to - [x] Subqueries ? :D

Btw, I think that the optimizations are being applied: these are done after the SQL is planned.

The general flow is:

SQL -- parsing -> Logical Plan -- Optimizers -> Optimized Logical plan -- Physical planner -> Physical plan

So, the plans should be optimized :)

andygrove

LGTM. I'm curious to know if you tried adding support for table aliases and ran into issues with that?

drusso · 2020-11-22T14:33:22Z

@jorgecarleitao I was pleasantly surprised by how few changes were required to get this working! I've updated the README.

@andygrove I haven't looked into adding support for table aliasing, which I think is most useful in the context of joins. Since the feature is now in master, it's probably a good time to add support.

drusso · 2020-11-22T15:09:37Z

On the topic of table aliasing:

For example:

let df_source = ctx.read_parquet(&parquet_source())?;
let df_in1 = df_source.select_columns(vec!["string_col", "int_col"])?;
let df_in2 = df_source.select_columns(vec!["string_col", "int_col"])?;
let df_join = df_in1.join(df_in2, JoinType::Inner, &["string_col"], &["string_col"])?;
let results = df_join.collect().await?;

Will yield:

Error: Plan("The left schema and the right schema have the following columns with the same name without being on the ON statement: {\"int_col\"}. Consider aliasing them.")

Of course the workaround is to the alias the columns. Are there any plans to handle disambiguation? In PySpark, for example, the equivalent version of the example above would be valid, and columns can be disambiguated with df_in1.int_col and df_in2.int_col.

The reason I ask about plans to handle this in the DataFrame API is because the solution there might influence the implementation in the SQL layer.

jorgecarleitao · 2020-11-22T15:29:30Z

AFAIK pyspark does not desambiguate:

import pyspark

with pyspark.SparkContext() as sc:
    spark = pyspark.sql.SQLContext(sc)

    df = spark.createDataFrame([
        [1, 2],
        [2, 3],
    ], schema=["id", "id1"])

    df1 = spark.createDataFrame([
        [1, 2],
        [1, 3],
    ], schema=["id", "id1"])

    df.join(df1, on="id").show()

yields

+---+---+---+                                                                   
| id|id1|id1|
+---+---+---+
|  1|  2|  2|
|  1|  2|  3|
+---+---+---+

on pyspark==2.4.6

In pyspark, writing df.join(df1, on="id").select("id1") errors because the select can't tell which column to select. This IMO is poor judgment: the join itself does not crash, but operating on the resulting table crashes.

I am generally against desambiguation because doing so changes the schema only when columns collide (or do we always add some left_?) In general, colliding columns requires the user to always desambiguate them, either before the statement (via alias) or after the statement (via ?.column_name). Raising an error IMO is the best possible outcome as it requires the user to be explicit about what they want.

jorgecarleitao · 2020-11-22T15:31:56Z

Note that this does not impact SQL, as SQL all tables are named and columns are referred via a qualified name (e.g. t1.name)

drusso · 2020-11-22T21:31:05Z

Sounds good.

In case it might be of interest, dplyr's inner_join() will add a suffix to any non-joined column that collide. The suffixes can be explicitly passed as part of the function arguments.

jorgecarleitao · 2020-11-23T18:17:06Z

@drusso , could you rebase this? We had some issues with the CI that were addressed, so you should be able to have this run on CI clean now.

drusso · 2020-11-24T22:42:11Z

@jorgecarleitao Sure thing, I've rebased the changes.

[ARROW-10666](https://issues.apache.org/jira/browse/ARROW-10666) This PR enables nested `SELECT` statements. Note that table aliases remain unsupported, and no optimizations are made during the planning stages. Closes apache#8727 from drusso/ARROW-10666 Authored-by: Daniel Russo <danrusso@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

github-actions bot added Component: Rust - DataFusion Component: Rust labels Nov 20, 2020

jorgecarleitao approved these changes Nov 20, 2020

View reviewed changes

andygrove approved these changes Nov 21, 2020

View reviewed changes

drusso added 2 commits November 24, 2020 17:34

ARROW-10666: [Rust][DataFusion] Support nested SELECT statements.

e4d7764

ARROW-10666: [Rust][DataFusion] Mark subqueries as completed.

1d0ef88

drusso force-pushed the ARROW-10666 branch from 4626c3f to 1d0ef88 Compare November 24, 2020 22:35

jorgecarleitao closed this in c0a6ab9 Nov 25, 2020

asfimport mentioned this pull request Nov 25, 2020

[Rust] [DataFusion] Support nested SELECT statements #26620

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-10666: [Rust][DataFusion] Support nested SELECT statements. #8727

ARROW-10666: [Rust][DataFusion] Support nested SELECT statements. #8727

drusso commented Nov 20, 2020 •

edited

Loading

github-actions bot commented Nov 20, 2020

jorgecarleitao left a comment

andygrove left a comment

drusso commented Nov 22, 2020

drusso commented Nov 22, 2020 •

edited

Loading

jorgecarleitao commented Nov 22, 2020 •

edited

Loading

jorgecarleitao commented Nov 22, 2020

drusso commented Nov 22, 2020

jorgecarleitao commented Nov 23, 2020

drusso commented Nov 24, 2020

ARROW-10666: [Rust][DataFusion] Support nested SELECT statements. #8727

ARROW-10666: [Rust][DataFusion] Support nested SELECT statements. #8727

Conversation

drusso commented Nov 20, 2020 • edited Loading

github-actions bot commented Nov 20, 2020

jorgecarleitao left a comment

Choose a reason for hiding this comment

andygrove left a comment

Choose a reason for hiding this comment

drusso commented Nov 22, 2020

drusso commented Nov 22, 2020 • edited Loading

jorgecarleitao commented Nov 22, 2020 • edited Loading

jorgecarleitao commented Nov 22, 2020

drusso commented Nov 22, 2020

jorgecarleitao commented Nov 23, 2020

drusso commented Nov 24, 2020

drusso commented Nov 20, 2020 •

edited

Loading

drusso commented Nov 22, 2020 •

edited

Loading

jorgecarleitao commented Nov 22, 2020 •

edited

Loading