Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10666: [Rust][DataFusion] Support nested SELECT statements. #8727

Closed
wants to merge 2 commits into from

Conversation

drusso
Copy link
Contributor

@drusso drusso commented Nov 20, 2020

ARROW-10666 This PR enables nested SELECT statements. Note that table aliases remain unsupported, and no optimizations are made during the planning stages.

@github-actions
Copy link

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a great feature with some little code change! Thanks a lot, @drusso !

Could you change the README line - [ ] Subqueries to - [x] Subqueries ? :D

Btw, I think that the optimizations are being applied: these are done after the SQL is planned.

The general flow is:

SQL -- parsing -> Logical Plan -- Optimizers -> Optimized Logical plan -- Physical planner -> Physical plan

So, the plans should be optimized :)

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm curious to know if you tried adding support for table aliases and ran into issues with that?

@drusso
Copy link
Contributor Author

drusso commented Nov 22, 2020

@jorgecarleitao I was pleasantly surprised by how few changes were required to get this working! I've updated the README.

@andygrove I haven't looked into adding support for table aliasing, which I think is most useful in the context of joins. Since the feature is now in master, it's probably a good time to add support.

@drusso
Copy link
Contributor Author

drusso commented Nov 22, 2020

On the topic of table aliasing:

For example:

let df_source = ctx.read_parquet(&parquet_source())?;
let df_in1 = df_source.select_columns(vec!["string_col", "int_col"])?;
let df_in2 = df_source.select_columns(vec!["string_col", "int_col"])?;
let df_join = df_in1.join(df_in2, JoinType::Inner, &["string_col"], &["string_col"])?;
let results = df_join.collect().await?;

Will yield:

Error: Plan("The left schema and the right schema have the following columns with the same name without being on the ON statement: {\"int_col\"}. Consider aliasing them.")

Of course the workaround is to the alias the columns. Are there any plans to handle disambiguation? In PySpark, for example, the equivalent version of the example above would be valid, and columns can be disambiguated with df_in1.int_col and df_in2.int_col.

The reason I ask about plans to handle this in the DataFrame API is because the solution there might influence the implementation in the SQL layer.

@jorgecarleitao
Copy link
Member

jorgecarleitao commented Nov 22, 2020

AFAIK pyspark does not desambiguate:

import pyspark

with pyspark.SparkContext() as sc:
    spark = pyspark.sql.SQLContext(sc)

    df = spark.createDataFrame([
        [1, 2],
        [2, 3],
    ], schema=["id", "id1"])

    df1 = spark.createDataFrame([
        [1, 2],
        [1, 3],
    ], schema=["id", "id1"])

    df.join(df1, on="id").show()

yields

+---+---+---+                                                                   
| id|id1|id1|
+---+---+---+
|  1|  2|  2|
|  1|  2|  3|
+---+---+---+

on pyspark==2.4.6

In pyspark, writing df.join(df1, on="id").select("id1") errors because the select can't tell which column to select. This IMO is poor judgment: the join itself does not crash, but operating on the resulting table crashes.

I am generally against desambiguation because doing so changes the schema only when columns collide (or do we always add some left_?) In general, colliding columns requires the user to always desambiguate them, either before the statement (via alias) or after the statement (via ?.column_name). Raising an error IMO is the best possible outcome as it requires the user to be explicit about what they want.

@jorgecarleitao
Copy link
Member

Note that this does not impact SQL, as SQL all tables are named and columns are referred via a qualified name (e.g. t1.name)

@drusso
Copy link
Contributor Author

drusso commented Nov 22, 2020

Sounds good.

In case it might be of interest, dplyr's inner_join() will add a suffix to any non-joined column that collide. The suffixes can be explicitly passed as part of the function arguments.

@jorgecarleitao
Copy link
Member

@drusso , could you rebase this? We had some issues with the CI that were addressed, so you should be able to have this run on CI clean now.

@drusso
Copy link
Contributor Author

drusso commented Nov 24, 2020

@jorgecarleitao Sure thing, I've rebased the changes.

GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
[ARROW-10666](https://issues.apache.org/jira/browse/ARROW-10666) This PR enables nested `SELECT` statements. Note that table aliases remain unsupported, and no optimizations are made during the planning stages.

Closes apache#8727 from drusso/ARROW-10666

Authored-by: Daniel Russo <danrusso@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants