GH-34519: [C++][R] Fix dataset scans that project the same name as a field #34576

nealrichardson · 2023-03-15T17:47:05Z

Rationale for this change

Fixes #34519. #33770 introduced the bug; I had asked in the review why the C++ function wasn't using FieldsInExpression. I swapped that in, and the test I added to reproduce the bug now passes.

What changes are included in this PR?

Fix for the C++ function, test in R.

Are these changes tested?

Yes

Are there any user-facing changes?

The behavior observed in the report no longer happens.

Closes: [R] Casting columns using dplyr::mutate in arrow datasets results in NA values #34519

… as a field

github-actions · 2023-03-15T17:47:34Z

Closes: [R] Casting columns using dplyr::mutate in arrow datasets results in NA values #34519

github-actions · 2023-03-15T17:47:37Z

⚠️ GitHub issue #34519 has been automatically assigned in GitHub to PR creator.

westonpace

Just looking at the C++ change it seems like it would technically allow something like make_struct(add(field_ref("x"), 2), field_ref("y")) which would end up loading x and y. However, I think we can classify that as "garbage in / garbage out".

I'm marking "approve" because the C++ change seems fine. However, I'm not sure I understand the root problem yet.

westonpace · 2023-03-15T19:53:01Z

Ok, I put in some debugging and ran your test case. The way R is building scan options the value of projection is not "the output of the scan" but "the output of the scan + project". So, in other words, given SELECT x, y, x+y the built plan would be...

Scan {
  projection=make_struct(field_ref(x), field_ref(y), add(field_ref(y), field_ref(x))
}
Project {
  exprs=[
    field_ref(0),
    field_ref(1),
    add(field_ref(1), field_ref(0))
  ]
}

Under the old model we would skip add(...) in ScanOptions::projection because it was not a field ref. This would have been fine for the above query. However, given the query SELECT cast(x, string) then R would give us:

Scan {
  projection=make_struct(cast(field_ref(x), string))
}
Project {
  exprs=[
    cast(field_ref(0), string)
  ]
}

However, we would have not realized you wanted to read in any fields at all because we skip cast(...). What I had expected (back when I updated the projection handling code) was something like...

Scan {
  projection=make_struct(field_ref(x))
}
Project {
  exprs=[
    cast(field_ref(0), string)
  ]
}

However, that puts the burden of calculating the materialized fields squarely on R's shoulders.

Also, this is more or less the direction the scan node itself is going. In ScanV2 the "projection" will just be a list of "columns to load".

Long term, this seems inevitable. For example, I think this approach may fail with a query like SELECT left.a from left INNER JOIN right on left.id = right.id WHERE left.b > right.b. The fact that you have to read left.b is not immediately evident until after the join. So I don't think R is going to give us left.b in the projection for the scan options.

The solution (I think) is a proper "pushdown projection" pass. First, R creates a plan with a scan node that is configured to load everything. Then, in a push down projection pass, the actual scan projection is calculated. Ideally this would be in C++ but it kind of breaks the old philosophy of "zero-optimization"

nealrichardson · 2023-03-15T20:05:25Z

So I don't think R is going to give us left.b in the projection for the scan options.

It would, the join code in R ensures that the join keys are present at the join step, then the select of left.a happens after. Fairly confident we have tests covering this, but easy enough to verify.

If it helps our confidence in this change, this is essentially restoring the code we had in the R C++ bindings before the most recent change, which instead of copying the code from the R package, it adapted a function that already existed in scanner.cc (and apparently wasn't always right).

westonpace · 2023-03-15T22:09:59Z

I'm fine with this change so don't let this discussion block merging.

It would, the join code in R ensures that the join keys are present at the join step, then the select of left.a happens after

Right. But I'm more worried about left.b. Maybe the case I came up with isn't good enough but I know there are some queries where you will inevitably end up with:

Scan -> Project -> Join -> Project

What expression is assigned to the scan in that case? Is it the expression from the first projection (would be wrong), the expression from the second projection (would be wrong), or are you actually doing some kind of pushdown logic in R? Or maybe you have some way of preventing this case?

nealrichardson · 2023-03-16T13:57:17Z

Maybe I misunderstand your concerns, but here's how the R query building works. select/mutate/filter build and modify a projection expression and filter expression. At scan time (collect), those get pushed down into the ScanNode when querying on a dataset. Aggregations and joins are different, in that they essentially wrap the preceding parts of the query in a black box, and nothing that happens after them modifies any previous steps.

So in the case of

Scan -> Project[0] -> Join -> Project[1]

the projection of Project[0] is pushed into Scan. Join can only specify join keys based on the columns in Project[0]: if b isn't in Project[0], you can't reference it in Join. We don't (currently) inspect Project[1] to see if further columns could be pruned from steps prior to Join and pushed down, i.e. Project[1] doesn't alter Project[0] and thus not Scan.

nealrichardson · 2023-03-16T15:46:49Z

Further digression: the ScanNode ToString method doesn't print anything useful, so we can't verify what filter and column predicates it has. I thought I filed an issue about that at some point but I can't find it now.

westonpace · 2023-03-17T21:43:30Z

Join can only specify join keys based on the columns in Project[0]: if b isn't in Project[0], you can't reference it in Join.

This was the context I was missing (this is different than SQL for example). Given this context I believe your fix will work correctly (I'm going to merge it)

Further digression: the ScanNode ToString method doesn't print anything useful, so we can't verify what filter and column predicates it has. I thought I filed an issue about that at some point but I can't find it now.

I've added #34625 to make sure we handle this in the new scan node.

thisisnic

Thanks!

ursabot · 2023-03-21T23:45:26Z

Benchmark runs are scheduled for baseline = 1f8a335 and contender = f2c3928. f2c3928 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.64% ⬆️0.03%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.13% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] f2c39287 ec2-t3-xlarge-us-east-2
[Finished] f2c39287 test-mac-arm
[Finished] f2c39287 ursa-i9-9960x
[Finished] f2c39287 ursa-thinkcentre-m75q
[Finished] 1f8a335d ec2-t3-xlarge-us-east-2
[Finished] 1f8a335d test-mac-arm
[Finished] 1f8a335d ursa-i9-9960x
[Finished] 1f8a335d ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2023-03-21T23:47:01Z

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm

apacheGH-34519: [C++][R] Fix dataset scans that project the same name…

6d09057

… as a field

nealrichardson requested review from paleolimbot, thisisnic and westonpace as code owners March 15, 2023 17:47

github-actions bot added Component: C++ Component: R awaiting review Awaiting review labels Mar 15, 2023

westonpace approved these changes Mar 15, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Mar 15, 2023

thisisnic approved these changes Mar 21, 2023

View reviewed changes

westonpace merged commit f2c3928 into apache:main Mar 21, 2023

nealrichardson deleted the fix-34519 branch March 21, 2023 17:48

GH-34519: [C++][R] Fix dataset scans that project the same name as a field #34576

GH-34519: [C++][R] Fix dataset scans that project the same name as a field #34576

Uh oh!

Conversation

nealrichardson commented Mar 15, 2023 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Mar 15, 2023

Uh oh!

github-actions bot commented Mar 15, 2023

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

westonpace commented Mar 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nealrichardson commented Mar 15, 2023

Uh oh!

westonpace commented Mar 15, 2023

Uh oh!

nealrichardson commented Mar 16, 2023

Uh oh!

nealrichardson commented Mar 16, 2023

Uh oh!

westonpace commented Mar 17, 2023

Uh oh!

thisisnic left a comment

Choose a reason for hiding this comment

Uh oh!

ursabot commented Mar 21, 2023

Uh oh!

ursabot commented Mar 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nealrichardson commented Mar 15, 2023 •

edited by github-actions bot

Loading

westonpace commented Mar 15, 2023 •

edited

Loading