Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10900: [Rust] [DataFusion] Resolve TableScan provider eagerly #8910

Closed
wants to merge 2 commits into from

Conversation

rdettai
Copy link
Contributor

@rdettai rdettai commented Dec 14, 2020

Currently, the TableScan logical plan is quite complex. It can either reference a provider or a table name that is registered to the context.

This issue is about linking the logical plan to the TableProvider directly upon parsing of the SQL. This allows to simplify greatly the code and also makes the TableScan plan easier to use by external query plan manipulations.

https://issues.apache.org/jira/browse/ARROW-10900

@rdettai
Copy link
Contributor Author

rdettai commented Dec 14, 2020

Notes about the code:

  • we should create a helper that projects a schema on Option<Vec<usize>> to avoid having the same code duplicated all over the place, but I think this should be done in a separate PR
  • I removed the schema_name field as it isn't used anywhere (yagni)
  • I created an EmptyTable, which is convenient for tests but might also be useful for all kinds of plan manipulation where the actual data is not important.
  • I think the most debatable change is about the SchemaProvider trait in the sql planner. It was actually not a schema provider as it could already provide other info about UDFs / UDAFs. So I completely renamed it to ContextProvider, and instead of giving access to only the schema part of the table, if gives access to the entire TableProvider.

@github-actions
Copy link

@codecov-io
Copy link

Codecov Report

Merging #8910 (a551cc6) into master (5353c28) will increase coverage by 0.02%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8910      +/-   ##
==========================================
+ Coverage   75.48%   75.51%   +0.02%     
==========================================
  Files         181      182       +1     
  Lines       41649    41633      -16     
==========================================
  Hits        31439    31439              
+ Misses      10210    10194      -16     
Impacted Files Coverage Δ
rust/datafusion/src/datasource/empty.rs 0.00% <0.00%> (ø)
rust/datafusion/src/execution/context.rs 0.00% <0.00%> (ø)
rust/datafusion/src/logical_plan/builder.rs 0.00% <0.00%> (ø)
rust/datafusion/src/logical_plan/plan.rs 0.00% <0.00%> (ø)
...t/datafusion/src/optimizer/projection_push_down.rs 0.00% <ø> (ø)
rust/datafusion/src/physical_plan/planner.rs 0.00% <0.00%> (ø)
rust/datafusion/src/sql/planner.rs 0.00% <0.00%> (ø)
rust/datafusion/tests/dataframe.rs 0.00% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8ae596a...a551cc6. Read the comment docs.

@rdettai
Copy link
Contributor Author

rdettai commented Dec 14, 2020

@XiaokunDing @seddonm1 @Dandandan this might conflict a bit with your changes around statistics

@jorgecarleitao
Copy link
Member

Funny thing, I was planning to work on this exact same change ❤️ (removing the TableProvider, offering the provider directly, and making the planner call provider.scan). I agree with this design ^_^

Rename also looks good to me.

@rdettai
Copy link
Contributor Author

rdettai commented Dec 14, 2020

I was planning to work on this exact same change

great!

removing the TableProvider, offering the provider directly, and making the planner call provider.scan

Not sure what you mean here! Isn't TableProvider what makes the datasource part of the query engine easily extensible?

Side note: the aim of this work is to prepare https://issues.apache.org/jira/browse/ARROW-10902. Feel free to take a look at it!

@rdettai rdettai changed the title ARROW-10783: [Rust] [DataFusion] Resolve TableScan provider eagerly ARROW-10900: [Rust] [DataFusion] Resolve TableScan provider eagerly Dec 14, 2020
@jorgecarleitao
Copy link
Member

Not sure what you mean here!

I meant TableSource :/

@github-actions
Copy link


/// Scan an empty data source, mainly used in tests
pub fn scan_empty(
name: &str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can remove this parameter and provide scan() a placeholder like ""?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I did originally, to match the other similar methods in the block (scan_csv, scan_memory...). But this one is used by many tests in the codebase, and often they are using names, so we would have needed to edit asserts all over the place. I find that it does not do much harm to leave the it here, but if there is a consensus on the fact that its better removed, I don't mind!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I hadn't considered that. This change can be done in future PR if needed, so I agree we can keep them here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think keeping the name is reasonable here -- sometimes the name is needed when referring to the table in SQL, however, as this is part of the PlanBuilder I agree it isn't obviously going to be useful

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Will wait for @alamb as it touches code that he is familiar with.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the PR and I agree it looks good -- thank you @rdettai


/// Scan an empty data source, mainly used in tests
pub fn scan_empty(
name: &str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think keeping the name is reasonable here -- sometimes the name is needed when referring to the table in SQL, however, as this is part of the PlanBuilder I agree it isn't obviously going to be useful

@alamb alamb closed this in 26aef88 Dec 15, 2020
alamb pushed a commit that referenced this pull request Dec 19, 2020
…for TableProvider implementations

I've got a use case for this with a custom TableProvider implementation, so thought I'd give this a go :)

This PR allows TableProviders to optionally indicate that they support handling filter expressions either:
- Inexactly, to simply optimise data retrieval in an approximate fashion; e.g. pruning in your classic chunked storage system with min/max column metadata stored per chunk
- Exactly, in which case the relevant filter plan nodes can be optimised out entirely

Some preemptive concerns from my side:
- Most of these concepts could probably have better names, open to suggestions here.
- I'm not sure whether expressions are the correct thing to be pushing down to the provider.
- I've had to update quite a few `scan` callsites with empty filter lists. Could this be handled in a better way?
- Currently, only table scans using TableSource::FromProvider are supported, because we need a reference to the provider at optimisation time. #8910 removes the provider/named-based reference distinction entirely so I can rebase this once that's merged and add an extra test using an ordinary sql statement, rather than just a `ctx.read_table(provider)` call.

I'd appreciate any thoughts or feedback!

Closes #8917 from returnString/table_provider_pushdown

Authored-by: Ruan Pearce-Authers <ruanpa@outlook.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
> Currently, the TableScan logical plan is quite complex. It can either reference a provider or a table name that is registered to the context.
>
> This issue is about linking the logical plan to the TableProvider directly upon parsing of the SQL. This allows to simplify greatly the code and also makes the TableScan plan easier to use by external query plan manipulations.

https://issues.apache.org/jira/browse/ARROW-10900

Closes apache#8910 from rdettai/ARROW-10900-eager-tablescan

Authored-by: rdettai <rdettai@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…for TableProvider implementations

I've got a use case for this with a custom TableProvider implementation, so thought I'd give this a go :)

This PR allows TableProviders to optionally indicate that they support handling filter expressions either:
- Inexactly, to simply optimise data retrieval in an approximate fashion; e.g. pruning in your classic chunked storage system with min/max column metadata stored per chunk
- Exactly, in which case the relevant filter plan nodes can be optimised out entirely

Some preemptive concerns from my side:
- Most of these concepts could probably have better names, open to suggestions here.
- I'm not sure whether expressions are the correct thing to be pushing down to the provider.
- I've had to update quite a few `scan` callsites with empty filter lists. Could this be handled in a better way?
- Currently, only table scans using TableSource::FromProvider are supported, because we need a reference to the provider at optimisation time. apache#8910 removes the provider/named-based reference distinction entirely so I can rebase this once that's merged and add an extra test using an ordinary sql statement, rather than just a `ctx.read_table(provider)` call.

I'd appreciate any thoughts or feedback!

Closes apache#8917 from returnString/table_provider_pushdown

Authored-by: Ruan Pearce-Authers <ruanpa@outlook.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants