Skip to content

Conversation

@nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented Mar 15, 2023

Rationale for this change

Fixes #34519. #33770 introduced the bug; I had asked in the review why the C++ function wasn't using FieldsInExpression. I swapped that in, and the test I added to reproduce the bug now passes.

What changes are included in this PR?

Fix for the C++ function, test in R.

Are these changes tested?

Yes

Are there any user-facing changes?

The behavior observed in the report no longer happens.

@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #34519 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looking at the C++ change it seems like it would technically allow something like make_struct(add(field_ref("x"), 2), field_ref("y")) which would end up loading x and y. However, I think we can classify that as "garbage in / garbage out".

I'm marking "approve" because the C++ change seems fine. However, I'm not sure I understand the root problem yet.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Mar 15, 2023
@westonpace
Copy link
Member

westonpace commented Mar 15, 2023

Ok, I put in some debugging and ran your test case. The way R is building scan options the value of projection is not "the output of the scan" but "the output of the scan + project". So, in other words, given SELECT x, y, x+y the built plan would be...

Scan {
  projection=make_struct(field_ref(x), field_ref(y), add(field_ref(y), field_ref(x))
}
Project {
  exprs=[
    field_ref(0),
    field_ref(1),
    add(field_ref(1), field_ref(0))
  ]
}

Under the old model we would skip add(...) in ScanOptions::projection because it was not a field ref. This would have been fine for the above query. However, given the query SELECT cast(x, string) then R would give us:

Scan {
  projection=make_struct(cast(field_ref(x), string))
}
Project {
  exprs=[
    cast(field_ref(0), string)
  ]
}

However, we would have not realized you wanted to read in any fields at all because we skip cast(...). What I had expected (back when I updated the projection handling code) was something like...

Scan {
  projection=make_struct(field_ref(x))
}
Project {
  exprs=[
    cast(field_ref(0), string)
  ]
}

However, that puts the burden of calculating the materialized fields squarely on R's shoulders.

Also, this is more or less the direction the scan node itself is going. In ScanV2 the "projection" will just be a list of "columns to load".

Long term, this seems inevitable. For example, I think this approach may fail with a query like SELECT left.a from left INNER JOIN right on left.id = right.id WHERE left.b > right.b. The fact that you have to read left.b is not immediately evident until after the join. So I don't think R is going to give us left.b in the projection for the scan options.

The solution (I think) is a proper "pushdown projection" pass. First, R creates a plan with a scan node that is configured to load everything. Then, in a push down projection pass, the actual scan projection is calculated. Ideally this would be in C++ but it kind of breaks the old philosophy of "zero-optimization"

@nealrichardson
Copy link
Member Author

So I don't think R is going to give us left.b in the projection for the scan options.

It would, the join code in R ensures that the join keys are present at the join step, then the select of left.a happens after. Fairly confident we have tests covering this, but easy enough to verify.

If it helps our confidence in this change, this is essentially restoring the code we had in the R C++ bindings before the most recent change, which instead of copying the code from the R package, it adapted a function that already existed in scanner.cc (and apparently wasn't always right).

@westonpace
Copy link
Member

I'm fine with this change so don't let this discussion block merging.

It would, the join code in R ensures that the join keys are present at the join step, then the select of left.a happens after

Right. But I'm more worried about left.b. Maybe the case I came up with isn't good enough but I know there are some queries where you will inevitably end up with:

Scan -> Project -> Join -> Project

What expression is assigned to the scan in that case? Is it the expression from the first projection (would be wrong), the expression from the second projection (would be wrong), or are you actually doing some kind of pushdown logic in R? Or maybe you have some way of preventing this case?

@nealrichardson
Copy link
Member Author

Maybe I misunderstand your concerns, but here's how the R query building works. select/mutate/filter build and modify a projection expression and filter expression. At scan time (collect), those get pushed down into the ScanNode when querying on a dataset. Aggregations and joins are different, in that they essentially wrap the preceding parts of the query in a black box, and nothing that happens after them modifies any previous steps.

So in the case of

Scan -> Project[0] -> Join -> Project[1]

the projection of Project[0] is pushed into Scan. Join can only specify join keys based on the columns in Project[0]: if b isn't in Project[0], you can't reference it in Join. We don't (currently) inspect Project[1] to see if further columns could be pruned from steps prior to Join and pushed down, i.e. Project[1] doesn't alter Project[0] and thus not Scan.

@nealrichardson
Copy link
Member Author

Further digression: the ScanNode ToString method doesn't print anything useful, so we can't verify what filter and column predicates it has. I thought I filed an issue about that at some point but I can't find it now.

@westonpace
Copy link
Member

Join can only specify join keys based on the columns in Project[0]: if b isn't in Project[0], you can't reference it in Join.

This was the context I was missing (this is different than SQL for example). Given this context I believe your fix will work correctly (I'm going to merge it)

Further digression: the ScanNode ToString method doesn't print anything useful, so we can't verify what filter and column predicates it has. I thought I filed an issue about that at some point but I can't find it now.

I've added #34625 to make sure we handle this in the new scan node.

Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@westonpace westonpace merged commit f2c3928 into apache:main Mar 21, 2023
@nealrichardson nealrichardson deleted the fix-34519 branch March 21, 2023 17:48
@ursabot
Copy link

ursabot commented Mar 21, 2023

Benchmark runs are scheduled for baseline = 1f8a335 and contender = f2c3928. f2c3928 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.64% ⬆️0.03%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.13% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] f2c39287 ec2-t3-xlarge-us-east-2
[Finished] f2c39287 test-mac-arm
[Finished] f2c39287 ursa-i9-9960x
[Finished] f2c39287 ursa-thinkcentre-m75q
[Finished] 1f8a335d ec2-t3-xlarge-us-east-2
[Finished] 1f8a335d test-mac-arm
[Finished] 1f8a335d ursa-i9-9960x
[Finished] 1f8a335d ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Mar 21, 2023

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[R] Casting columns using dplyr::mutate in arrow datasets results in NA values

4 participants