Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-15726: [C++] If a projected_schema is not supplied but a bound projection expression is then we should use that to infer the projected_schema #12466

Conversation

westonpace
Copy link
Member

New rules. This is somewhat of a short term fix until we address ARROW-12311.

  • If neither projection or projected_schema are given then fetch every field in the dataset schema
  • If projected_schema only is specified then the projection expression is a field_ref to every field name in projected_schema
  • If projection only is specified and bound and it is a make_struct with simple field_ref's then we create a projected_schema which is just the names/types from the bound projection expression.

… missing projection or projected_schema fields
@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@westonpace
Copy link
Member Author

@ursabot please benchmark lang=R

@ursabot
Copy link

ursabot commented Feb 18, 2022

Benchmark runs are scheduled for baseline = ade1027 and contender = 5f6b483. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.0% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Skipped ⚠️ Only ['C++', 'Java'] langs are supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@jonkeane
Copy link
Member

It's not super easy to find, but the logs from the benchmark: https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/159#6bba5997-49a2-4405-b8ae-6172b8a93a1a/6-3467

Are showing the same error we were seeing before:

Traceback (most recent call last):
--
  | File "/var/lib/buildkite-agent/builds/ursa-i9-9960x-1/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/benchmarks/benchmarks/_benchmark.py", line 201, in r_benchmark
  | result, output = self._get_benchmark_result(command)
  | File "/var/lib/buildkite-agent/builds/ursa-i9-9960x-1/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/benchmarks/benchmarks/_benchmark.py", line 228, in _get_benchmark_result
  | raise Exception(error)
  | Exception: Error: Garbage collection 16 = 12+1+3 (level 0) ...
  | 52.1 Mbytes of cons cells used (58%)
  | 12.5 Mbytes of vectors used (19%)
  | Error in `handle_csv_read_error()`:
  | ! NotImplemented: Unsupported cast from string to null using function cast_null
  | Backtrace:
  | ▆
  | 1. ├─arrowbench::run_bm(...)
  | 2. │ └─arrowbench:::run_iteration(bm, ctx, profiling = profiling)
  | 3. │   ├─arrowbench::measure(eval(bm$run, envir = ctx), profiling = profiling)
  | 4. │   │ ├─arrowbench:::with_gc_info(...)
  | 5. │   │ │ ├─bench:::with_gcinfo(eval.parent(expr))
  | 6. │   │ │ │ └─base::force(expr)
  | 7. │   │ │ └─base::eval.parent(expr)
  | 8. │   │ │   └─base::eval(expr, p)
  | 9. │   │ ├─arrowbench:::with_profiling(...)
  | 10. │   │ │ └─base::eval.parent(expr)
  | 11. │   │ │   └─base::eval(expr, p)
  | 12. │   │ ├─bench::bench_time(eval.parent(...))
  | 13. │   │ │ ├─stats::setNames(...)
  | 14. │   │ │ └─bench::
  | Execution halted
  |  
  | *** caught segfault ***

@westonpace
Copy link
Member Author

@jonkeane I think you are looking at the wrong conbench run. The run requested for this PR completed successfully: https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/160

Comment on lines +126 to +130
// Either the expression for this field is not a field_ref or it is not a
// simple field_ref. User must supply projected_schema
return Status::Invalid(
"No projected schema was supplied and we could not infer the projected "
"schema from the projection expression.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could infer names/types by just stringifying the expression and using the expression's type in this case, though I guess as a temporary fix it may not be worth it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. There is more we could do here but in general we can leave that up to the caller. If there is a need to use more complex expressions then the caller can supply the projected_schema and it should work just fine.

@lidavidm lidavidm changed the title [C++] If a projected_schema is not supplied but a bound projection expression is then we should use that to infer the projected_schema ARROW-15726: [C++] If a projected_schema is not supplied but a bound projection expression is then we should use that to infer the projected_schema Feb 23, 2022
@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@lidavidm
Copy link
Member

Which benchmarks are we looking at here? I don't see any meaningful improvements.

@jonkeane
Copy link
Member

@jonkeane I think you are looking at the wrong conbench run. The run requested for this PR completed successfully: https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/160

You're totally right, the correct build is 160 which does have dataset benchmarks running just fine.

But now I'm confused about why the comment up above that marks the benchmarks as failed:

[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x

and if you follow that link to conbench there aren't any dataset benchmarks reported. But both of those are likely conbench issues that don't have anything to do with the PR here. Sorry for the confusion!

@westonpace
Copy link
Member Author

@lidavidm There aren't supposed to be any improvements. An earlier change broke some accidental inference of the projected schema and so conbench was failing. This fix is to get conbench passing again with some intentional inference.

@lidavidm
Copy link
Member

Ah, I misunderstood. Thanks for the clarification.

@ursabot
Copy link

ursabot commented Feb 23, 2022

Benchmark runs are scheduled for baseline = 6b63103 and contender = bdd6bcd. bdd6bcd is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.25% ⬆️0.08%] test-mac-arm
[Failed ⬇️1.43% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.13% ⬆️0.04%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants