New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10900: [Rust] [DataFusion] Resolve TableScan provider eagerly #8910
Conversation
Notes about the code:
|
Codecov Report
@@ Coverage Diff @@
## master #8910 +/- ##
==========================================
+ Coverage 75.48% 75.51% +0.02%
==========================================
Files 181 182 +1
Lines 41649 41633 -16
==========================================
Hits 31439 31439
+ Misses 10210 10194 -16
Continue to review full report at Codecov.
|
@XiaokunDing @seddonm1 @Dandandan this might conflict a bit with your changes around statistics |
Funny thing, I was planning to work on this exact same change ❤️ (removing the Rename also looks good to me. |
great!
Not sure what you mean here! Isn't Side note: the aim of this work is to prepare https://issues.apache.org/jira/browse/ARROW-10902. Feel free to take a look at it! |
I meant |
|
||
/// Scan an empty data source, mainly used in tests | ||
pub fn scan_empty( | ||
name: &str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can remove this parameter and provide scan()
a placeholder like ""
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I did originally, to match the other similar methods in the block (scan_csv
, scan_memory
...). But this one is used by many tests in the codebase, and often they are using names, so we would have needed to edit asserts all over the place. I find that it does not do much harm to leave the it here, but if there is a consensus on the fact that its better removed, I don't mind!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I hadn't considered that. This change can be done in future PR if needed, so I agree we can keep them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think keeping the name is reasonable here -- sometimes the name is needed when referring to the table in SQL, however, as this is part of the PlanBuilder I agree it isn't obviously going to be useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Will wait for @alamb as it touches code that he is familiar with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the PR and I agree it looks good -- thank you @rdettai
|
||
/// Scan an empty data source, mainly used in tests | ||
pub fn scan_empty( | ||
name: &str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think keeping the name is reasonable here -- sometimes the name is needed when referring to the table in SQL, however, as this is part of the PlanBuilder I agree it isn't obviously going to be useful
…for TableProvider implementations I've got a use case for this with a custom TableProvider implementation, so thought I'd give this a go :) This PR allows TableProviders to optionally indicate that they support handling filter expressions either: - Inexactly, to simply optimise data retrieval in an approximate fashion; e.g. pruning in your classic chunked storage system with min/max column metadata stored per chunk - Exactly, in which case the relevant filter plan nodes can be optimised out entirely Some preemptive concerns from my side: - Most of these concepts could probably have better names, open to suggestions here. - I'm not sure whether expressions are the correct thing to be pushing down to the provider. - I've had to update quite a few `scan` callsites with empty filter lists. Could this be handled in a better way? - Currently, only table scans using TableSource::FromProvider are supported, because we need a reference to the provider at optimisation time. #8910 removes the provider/named-based reference distinction entirely so I can rebase this once that's merged and add an extra test using an ordinary sql statement, rather than just a `ctx.read_table(provider)` call. I'd appreciate any thoughts or feedback! Closes #8917 from returnString/table_provider_pushdown Authored-by: Ruan Pearce-Authers <ruanpa@outlook.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
> Currently, the TableScan logical plan is quite complex. It can either reference a provider or a table name that is registered to the context. > > This issue is about linking the logical plan to the TableProvider directly upon parsing of the SQL. This allows to simplify greatly the code and also makes the TableScan plan easier to use by external query plan manipulations. https://issues.apache.org/jira/browse/ARROW-10900 Closes apache#8910 from rdettai/ARROW-10900-eager-tablescan Authored-by: rdettai <rdettai@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
…for TableProvider implementations I've got a use case for this with a custom TableProvider implementation, so thought I'd give this a go :) This PR allows TableProviders to optionally indicate that they support handling filter expressions either: - Inexactly, to simply optimise data retrieval in an approximate fashion; e.g. pruning in your classic chunked storage system with min/max column metadata stored per chunk - Exactly, in which case the relevant filter plan nodes can be optimised out entirely Some preemptive concerns from my side: - Most of these concepts could probably have better names, open to suggestions here. - I'm not sure whether expressions are the correct thing to be pushing down to the provider. - I've had to update quite a few `scan` callsites with empty filter lists. Could this be handled in a better way? - Currently, only table scans using TableSource::FromProvider are supported, because we need a reference to the provider at optimisation time. apache#8910 removes the provider/named-based reference distinction entirely so I can rebase this once that's merged and add an extra test using an ordinary sql statement, rather than just a `ctx.read_table(provider)` call. I'd appreciate any thoughts or feedback! Closes apache#8917 from returnString/table_provider_pushdown Authored-by: Ruan Pearce-Authers <ruanpa@outlook.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
https://issues.apache.org/jira/browse/ARROW-10900