ARROW-16154: [R] Errors which pass through `handle_csv_read_error()` and `handle_parquet_io_error()` need better error tracing #12839

thisisnic · 2022-04-08T15:52:51Z

As discussed on #12826

Not sure how (if) to write tests but tried running it locally using the CSV directory set up in test-dataset-csv.R with and without this change, and without it, we get, e.g.

open_dataset(csv_dir)
# Error in `handle_parquet_io_error()` at r/R/dataset.R:221:6:
# ! Invalid: Error creating dataset. Could not read schema from '/tmp/RtmpuTyOD8/file5049dcf581a5/5/file1.csv': Could not open Parquet input source '/tmp/RtmpuTyOD8/file5049dcf581a5/5/file1.csv': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
# /home/nic2/arrow/cpp/src/arrow/dataset/file_parquet.cc:323  GetReader(source, scan_options). Is this a 'parquet' file?
# /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:40  InspectSchemas(std::move(options))
# /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:262  Inspect(options.inspect_options)
# ℹ Did you mean to specify a 'format' other than the default (parquet)?

and then with it:

open_dataset(csv_dir)
# Error in `open_dataset()`:
# ! Invalid: Error creating dataset. Could not read schema from '/tmp/RtmpLbqZs6/file4e4ca14fb5795/5/file1.csv': Could not open Parquet input source '/tmp/RtmpLbqZs6/file4e4ca14fb5795/5/file1.csv': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
# /home/nic2/arrow/cpp/src/arrow/dataset/file_parquet.cc:323  GetReader(source, scan_options). Is this a 'parquet' file?
# /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:40  InspectSchemas(std::move(options))
# /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:262  Inspect(options.inspect_options)
# ℹ Did you mean to specify a 'format' other than the default (parquet)?

github-actions · 2022-04-08T15:53:12Z

https://issues.apache.org/jira/browse/ARROW-16154

github-actions · 2022-04-08T15:53:14Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

nealrichardson

Awesome, thanks for doing this!

r/R/util.R

nealrichardson · 2022-04-09T12:25:26Z

r/R/csv.R

@@ -200,8 +200,8 @@ read_delim_arrow <- function(file,

  tryCatch(
    tab <- reader$Read(),
-    error = function(e) {
-      handle_csv_read_error(e, schema)
+    error = function(e, call = caller_env(n = 4)) {


Is it always n = 4? Is there a more certain way to capture this? (Like, if you define call_env outside of tryCatch, is it just this env?)

Is it always n = 4?

It's always n = 4 here, though I deliberately chose to pass the call parameter into handle_csv_read_error() so the function could be used elsewhere in the code where we may want to pass in a different environment.

Is there a more certain way to capture this? (Like, if you define call_env outside of tryCatch, is it just this env?)

I could call rlang::current_env() above the tryCatch block - I went for calling caller_env() here as it felt "cleaner" to keep that code within this block here.

I suppose that if the tryCatch block was changed to have more functions wrapped round it, then the number would be wrong; however, if we call current_env() outside of the block, we're unnecessarily calling it every time we call the function, even if there's no error.

Not sure what's better - what do you think?

Feels brittle but it's probably fine. I'd just leave in some comments explaining why n = 4, that you could have used caller_env() but this way is lazy/only does it if there's an error (aside: it's just calling parent.frame(), which on my machine takes in the hundreds of nanoseconds to run, so the cost of calling it every time is not something I'm concerned about).

We can revisit later if/when we want to chain together multiple error handlers. Also looks like rlang is growing some experimental tooling around here (https://rlang.r-lib.org/reference/try_fetch.html) so maybe that will mature and be ready whenever we revisit this.

In sum, seems like you've thought this through, so just leave a note explaining why this non-obvious thing is there and 👍 !

nealrichardson

One request to add a version of what you responded as a code comment, but otherwise LGTM, nice work!

nealrichardson · 2022-04-11T19:41:02Z

r/R/csv.R

@@ -200,8 +200,8 @@ read_delim_arrow <- function(file,

  tryCatch(
    tab <- reader$Read(),
-    error = function(e) {
-      handle_csv_read_error(e, schema)
+    error = function(e, call = caller_env(n = 4)) {


Feels brittle but it's probably fine. I'd just leave in some comments explaining why n = 4, that you could have used caller_env() but this way is lazy/only does it if there's an error (aside: it's just calling parent.frame(), which on my machine takes in the hundreds of nanoseconds to run, so the cost of calling it every time is not something I'm concerned about).

We can revisit later if/when we want to chain together multiple error handlers. Also looks like rlang is growing some experimental tooling around here (https://rlang.r-lib.org/reference/try_fetch.html) so maybe that will mature and be ready whenever we revisit this.

In sum, seems like you've thought this through, so just leave a note explaining why this non-obvious thing is there and 👍 !

ursabot · 2022-04-14T17:11:30Z

Benchmark runs are scheduled for baseline = 681ede6 and contender = 5d5cceb. 5d5cceb is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.67% ⬆️0.08%] test-mac-arm
[Failed ⬇️0.36% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.98% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/504| 5d5ccebe ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/490| 5d5ccebe test-mac-arm>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/488| 5d5ccebe ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/500| 5d5ccebe ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/503| 681ede6f ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/489| 681ede6f test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/490| 681ede6f ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/499| 681ede6f ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Make the error trace reflect the cause of the problem

93c487a

github-actions bot added the Component: R label Apr 8, 2022

thisisnic requested a review from nealrichardson April 8, 2022 15:53

thisisnic mentioned this pull request Apr 8, 2022

ARROW-15260: [R] open_dataset - add file_name as column #12826

Merged

nealrichardson reviewed Apr 9, 2022

View reviewed changes

Simplify routes through

7b44ac7

thisisnic requested a review from nealrichardson April 11, 2022 08:15

nealrichardson approved these changes Apr 11, 2022

View reviewed changes

Add comments

20fb006

thisisnic closed this in 5d5cceb Apr 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-16154: [R] Errors which pass through `handle_csv_read_error()` and `handle_parquet_io_error()` need better error tracing #12839

ARROW-16154: [R] Errors which pass through `handle_csv_read_error()` and `handle_parquet_io_error()` need better error tracing #12839

thisisnic commented Apr 8, 2022

github-actions bot commented Apr 8, 2022

github-actions bot commented Apr 8, 2022

nealrichardson left a comment

nealrichardson Apr 9, 2022

thisisnic Apr 11, 2022 •

edited

Loading

nealrichardson Apr 11, 2022

nealrichardson left a comment

nealrichardson Apr 11, 2022

ursabot commented Apr 14, 2022

ARROW-16154: [R] Errors which pass through handle_csv_read_error() and handle_parquet_io_error() need better error tracing #12839

ARROW-16154: [R] Errors which pass through handle_csv_read_error() and handle_parquet_io_error() need better error tracing #12839

Conversation

thisisnic commented Apr 8, 2022

github-actions bot commented Apr 8, 2022

github-actions bot commented Apr 8, 2022

nealrichardson left a comment

Choose a reason for hiding this comment

nealrichardson Apr 9, 2022

Choose a reason for hiding this comment

thisisnic Apr 11, 2022 • edited Loading

Choose a reason for hiding this comment

nealrichardson Apr 11, 2022

Choose a reason for hiding this comment

nealrichardson left a comment

Choose a reason for hiding this comment

nealrichardson Apr 11, 2022

Choose a reason for hiding this comment

ursabot commented Apr 14, 2022

ARROW-16154: [R] Errors which pass through `handle_csv_read_error()` and `handle_parquet_io_error()` need better error tracing #12839

ARROW-16154: [R] Errors which pass through `handle_csv_read_error()` and `handle_parquet_io_error()` need better error tracing #12839

thisisnic Apr 11, 2022 •

edited

Loading