-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-34519: [C++][R] Fix dataset scans that project the same name as a field #34576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looking at the C++ change it seems like it would technically allow something like make_struct(add(field_ref("x"), 2), field_ref("y")) which would end up loading x and y. However, I think we can classify that as "garbage in / garbage out".
I'm marking "approve" because the C++ change seems fine. However, I'm not sure I understand the root problem yet.
|
Ok, I put in some debugging and ran your test case. The way R is building scan options the value of Under the old model we would skip However, we would have not realized you wanted to read in any fields at all because we skip However, that puts the burden of calculating the materialized fields squarely on R's shoulders. Also, this is more or less the direction the scan node itself is going. In ScanV2 the "projection" will just be a list of "columns to load". Long term, this seems inevitable. For example, I think this approach may fail with a query like The solution (I think) is a proper "pushdown projection" pass. First, R creates a plan with a scan node that is configured to load everything. Then, in a push down projection pass, the actual scan projection is calculated. Ideally this would be in C++ but it kind of breaks the old philosophy of "zero-optimization" |
It would, the join code in R ensures that the join keys are present at the join step, then the select of left.a happens after. Fairly confident we have tests covering this, but easy enough to verify. If it helps our confidence in this change, this is essentially restoring the code we had in the R C++ bindings before the most recent change, which instead of copying the code from the R package, it adapted a function that already existed in scanner.cc (and apparently wasn't always right). |
|
I'm fine with this change so don't let this discussion block merging.
Right. But I'm more worried about What expression is assigned to the scan in that case? Is it the expression from the first projection (would be wrong), the expression from the second projection (would be wrong), or are you actually doing some kind of pushdown logic in R? Or maybe you have some way of preventing this case? |
|
Maybe I misunderstand your concerns, but here's how the R query building works. So in the case of the projection of |
|
Further digression: the ScanNode ToString method doesn't print anything useful, so we can't verify what filter and column predicates it has. I thought I filed an issue about that at some point but I can't find it now. |
This was the context I was missing (this is different than SQL for example). Given this context I believe your fix will work correctly (I'm going to merge it)
I've added #34625 to make sure we handle this in the new scan node. |
thisisnic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
|
Benchmark runs are scheduled for baseline = 1f8a335 and contender = f2c3928. f2c3928 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
['Python', 'R'] benchmarks have high level of regressions. |
Rationale for this change
Fixes #34519. #33770 introduced the bug; I had asked in the review why the C++ function wasn't using
FieldsInExpression. I swapped that in, and the test I added to reproduce the bug now passes.What changes are included in this PR?
Fix for the C++ function, test in R.
Are these changes tested?
Yes
Are there any user-facing changes?
The behavior observed in the report no longer happens.