-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10666: [Rust][DataFusion] Support nested SELECT statements. #8727
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such a great feature with some little code change! Thanks a lot, @drusso !
Could you change the README line - [ ] Subqueries
to - [x] Subqueries
? :D
Btw, I think that the optimizations are being applied: these are done after the SQL is planned.
The general flow is:
SQL -- parsing -> Logical Plan -- Optimizers -> Optimized Logical plan -- Physical planner -> Physical plan
So, the plans should be optimized :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'm curious to know if you tried adding support for table aliases and ran into issues with that?
@jorgecarleitao I was pleasantly surprised by how few changes were required to get this working! I've updated the README. @andygrove I haven't looked into adding support for table aliasing, which I think is most useful in the context of joins. Since the feature is now in master, it's probably a good time to add support. |
On the topic of table aliasing: For example: let df_source = ctx.read_parquet(&parquet_source())?;
let df_in1 = df_source.select_columns(vec!["string_col", "int_col"])?;
let df_in2 = df_source.select_columns(vec!["string_col", "int_col"])?;
let df_join = df_in1.join(df_in2, JoinType::Inner, &["string_col"], &["string_col"])?;
let results = df_join.collect().await?; Will yield:
Of course the workaround is to the alias the columns. Are there any plans to handle disambiguation? In PySpark, for example, the equivalent version of the example above would be valid, and columns can be disambiguated with The reason I ask about plans to handle this in the DataFrame API is because the solution there might influence the implementation in the SQL layer. |
AFAIK pyspark does not desambiguate: import pyspark
with pyspark.SparkContext() as sc:
spark = pyspark.sql.SQLContext(sc)
df = spark.createDataFrame([
[1, 2],
[2, 3],
], schema=["id", "id1"])
df1 = spark.createDataFrame([
[1, 2],
[1, 3],
], schema=["id", "id1"])
df.join(df1, on="id").show() yields
on In pyspark, writing I am generally against desambiguation because doing so changes the schema only when columns collide (or do we always add some |
Note that this does not impact SQL, as SQL all tables are named and columns are referred via a qualified name (e.g. |
Sounds good. In case it might be of interest, dplyr's inner_join() will add a suffix to any non-joined column that collide. The suffixes can be explicitly passed as part of the function arguments. |
@drusso , could you rebase this? We had some issues with the CI that were addressed, so you should be able to have this run on CI clean now. |
@jorgecarleitao Sure thing, I've rebased the changes. |
[ARROW-10666](https://issues.apache.org/jira/browse/ARROW-10666) This PR enables nested `SELECT` statements. Note that table aliases remain unsupported, and no optimizations are made during the planning stages. Closes apache#8727 from drusso/ARROW-10666 Authored-by: Daniel Russo <danrusso@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
ARROW-10666 This PR enables nested
SELECT
statements. Note that table aliases remain unsupported, and no optimizations are made during the planning stages.