ARROW-15726: [C++] If a projected_schema is not supplied but a bound projection expression is then we should use that to infer the projected_schema #12466

westonpace · 2022-02-18T21:56:27Z

New rules. This is somewhat of a short term fix until we address ARROW-12311.

If neither projection or projected_schema are given then fetch every field in the dataset schema
If projected_schema only is specified then the projection expression is a field_ref to every field name in projected_schema
If projection only is specified and bound and it is a make_struct with simple field_ref's then we create a projected_schema which is just the names/types from the bound projection expression.

… missing projection or projected_schema fields

github-actions · 2022-02-18T21:56:46Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

westonpace · 2022-02-18T22:34:00Z

@ursabot please benchmark lang=R

ursabot · 2022-02-18T22:34:04Z

Benchmark runs are scheduled for baseline = ade1027 and contender = 5f6b483. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.0% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Skipped ⚠️ Only ['C++', 'Java'] langs are supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

jonkeane · 2022-02-21T19:36:48Z

It's not super easy to find, but the logs from the benchmark: https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/159#6bba5997-49a2-4405-b8ae-6172b8a93a1a/6-3467

Are showing the same error we were seeing before:

Traceback (most recent call last):
--
  | File "/var/lib/buildkite-agent/builds/ursa-i9-9960x-1/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/benchmarks/benchmarks/_benchmark.py", line 201, in r_benchmark
  | result, output = self._get_benchmark_result(command)
  | File "/var/lib/buildkite-agent/builds/ursa-i9-9960x-1/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/benchmarks/benchmarks/_benchmark.py", line 228, in _get_benchmark_result
  | raise Exception(error)
  | Exception: Error: Garbage collection 16 = 12+1+3 (level 0) ...
  | 52.1 Mbytes of cons cells used (58%)
  | 12.5 Mbytes of vectors used (19%)
  | Error in `handle_csv_read_error()`:
  | ! NotImplemented: Unsupported cast from string to null using function cast_null
  | Backtrace:
  | ▆
  | 1. ├─arrowbench::run_bm(...)
  | 2. │ └─arrowbench:::run_iteration(bm, ctx, profiling = profiling)
  | 3. │   ├─arrowbench::measure(eval(bm$run, envir = ctx), profiling = profiling)
  | 4. │   │ ├─arrowbench:::with_gc_info(...)
  | 5. │   │ │ ├─bench:::with_gcinfo(eval.parent(expr))
  | 6. │   │ │ │ └─base::force(expr)
  | 7. │   │ │ └─base::eval.parent(expr)
  | 8. │   │ │   └─base::eval(expr, p)
  | 9. │   │ ├─arrowbench:::with_profiling(...)
  | 10. │   │ │ └─base::eval.parent(expr)
  | 11. │   │ │   └─base::eval(expr, p)
  | 12. │   │ ├─bench::bench_time(eval.parent(...))
  | 13. │   │ │ ├─stats::setNames(...)
  | 14. │   │ │ └─bench::
  | Execution halted
  |  
  | *** caught segfault ***

westonpace · 2022-02-23T00:48:15Z

@jonkeane I think you are looking at the wrong conbench run. The run requested for this PR completed successfully: https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/160

lidavidm · 2022-02-23T13:04:46Z

cpp/src/arrow/dataset/scanner.cc

+          // Either the expression for this field is not a field_ref or it is not a
+          // simple field_ref.  User must supply projected_schema
+          return Status::Invalid(
+              "No projected schema was supplied and we could not infer the projected "
+              "schema from the projection expression.");


We could infer names/types by just stringifying the expression and using the expression's type in this case, though I guess as a temporary fix it may not be worth it.

Agreed. There is more we could do here but in general we can leave that up to the caller. If there is a need to use more complex expressions then the caller can supply the projected_schema and it should work just fine.

github-actions · 2022-02-23T13:09:00Z

https://issues.apache.org/jira/browse/ARROW-15726

github-actions · 2022-02-23T13:09:02Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

lidavidm · 2022-02-23T13:11:32Z

Which benchmarks are we looking at here? I don't see any meaningful improvements.

jonkeane · 2022-02-23T13:48:16Z

@jonkeane I think you are looking at the wrong conbench run. The run requested for this PR completed successfully: https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/160

You're totally right, the correct build is 160 which does have dataset benchmarks running just fine.

But now I'm confused about why the comment up above that marks the benchmarks as failed:

[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x

and if you follow that link to conbench there aren't any dataset benchmarks reported. But both of those are likely conbench issues that don't have anything to do with the PR here. Sorry for the confusion!

westonpace · 2022-02-23T18:29:06Z

@lidavidm There aren't supposed to be any improvements. An earlier change broke some accidental inference of the projected schema and so conbench was failing. This fix is to get conbench passing again with some intentional inference.

lidavidm · 2022-02-23T18:32:01Z

Ah, I misunderstood. Thanks for the clarification.

ursabot · 2022-02-23T18:41:46Z

Benchmark runs are scheduled for baseline = 6b63103 and contender = bdd6bcd. bdd6bcd is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.25% ⬆️0.08%] test-mac-arm
[Failed ⬇️1.43% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.13% ⬆️0.04%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Improved the robustness of the scanner so it can now generally handle…

5f6b483

… missing projection or projected_schema fields

github-actions bot added the Component: C++ label Feb 18, 2022

westonpace requested a review from lidavidm February 23, 2022 00:47

lidavidm approved these changes Feb 23, 2022

View reviewed changes

westonpace closed this in bdd6bcd Feb 23, 2022

Cerdore mentioned this pull request Jan 25, 2024

[C++][Dataset] Ignored failures in dataset-scanner-benchmark #39765

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-15726: [C++] If a projected_schema is not supplied but a bound projection expression is then we should use that to infer the projected_schema #12466

ARROW-15726: [C++] If a projected_schema is not supplied but a bound projection expression is then we should use that to infer the projected_schema #12466

westonpace commented Feb 18, 2022

github-actions bot commented Feb 18, 2022

westonpace commented Feb 18, 2022

ursabot commented Feb 18, 2022 •

edited

Loading

jonkeane commented Feb 21, 2022

westonpace commented Feb 23, 2022

lidavidm Feb 23, 2022

westonpace Feb 23, 2022

github-actions bot commented Feb 23, 2022

github-actions bot commented Feb 23, 2022

lidavidm commented Feb 23, 2022

jonkeane commented Feb 23, 2022

westonpace commented Feb 23, 2022

lidavidm commented Feb 23, 2022

ursabot commented Feb 23, 2022 •

edited

Loading

ARROW-15726: [C++] If a projected_schema is not supplied but a bound projection expression is then we should use that to infer the projected_schema #12466

ARROW-15726: [C++] If a projected_schema is not supplied but a bound projection expression is then we should use that to infer the projected_schema #12466

Conversation

westonpace commented Feb 18, 2022

github-actions bot commented Feb 18, 2022

westonpace commented Feb 18, 2022

ursabot commented Feb 18, 2022 • edited Loading

jonkeane commented Feb 21, 2022

westonpace commented Feb 23, 2022

lidavidm Feb 23, 2022

Choose a reason for hiding this comment

westonpace Feb 23, 2022

Choose a reason for hiding this comment

github-actions bot commented Feb 23, 2022

github-actions bot commented Feb 23, 2022

lidavidm commented Feb 23, 2022

jonkeane commented Feb 23, 2022

westonpace commented Feb 23, 2022

lidavidm commented Feb 23, 2022

ursabot commented Feb 23, 2022 • edited Loading

ursabot commented Feb 18, 2022 •

edited

Loading

ursabot commented Feb 23, 2022 •

edited

Loading