ARROW-17355: [R] Refactor the handle_* utility functions for a better dev experience #14030

thisisnic · 2022-09-02T09:53:34Z

For context; these handle_* functions originally caught an error and if it contained a particular string raised an augmented error message with extra guidance for the user (and if not, raised the original error message).

This became problematic in a later PR where we wanted to test multiple conditions and only raise the original error if none of the conditions were met - the temporary approach was to move the responsibility for the raising of the original error to outside of the handle_* functions. The issue here is that this makes it easy for developers to forget to add in this line of code.

The proposed solution here implements a generic error augmentation function augment_io_error_msg(), which tests all conditions, raises an error with an augmented message if any conditions are met, or raises the original error if not.

github-actions · 2022-09-02T09:54:11Z

https://issues.apache.org/jira/browse/ARROW-17355

github-actions · 2022-09-02T09:54:13Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

thisisnic · 2022-09-02T10:01:16Z

r/R/util.R

+# handler function which checks for a number of different read errors
+augment_io_error_msg <- function(e, call, ...){
+  dots <- list2(...)
+  msg <- conditionMessage(e)
+
+  if (!is.null(dots[["schema"]])) {
+    handle_csv_read_error(msg, call, dots[["schema"]])
+  }
+  if (!is.null(dots[["format"]])) {
+    handle_parquet_io_error(msg, call, dots[["format"]])
+  }
+
+  handle_augmented_field_misuse(msg, call)
+  abort(msg, call = call)
+}


Not sure if this approach of calling functions conditionally based on which optional arguments are supplied is reasonable or a bit of an anti-pattern.

nealrichardson · 2022-09-06T19:40:36Z

Would something even more generic work? Something like

try_arrow <- function(expr, error_handlers = list()) {
  tryCatch(expr, error = function(e) {
    call <- caller_env(n = 5)
    for (fun in error_handlers) {
      fun(e)
    }
    abort(e, call = call)
  })
}

You'd pass a list of functions to check for those error conditions, like:

try_arrow(
  out <- as_arrow_table(x),
  error_handlers = list(
    handle_csv_read_error, 
    handle_augmented_field_misuse
  )
)

thisisnic · 2022-09-23T12:09:14Z

Would something even more generic work? Something like

try_arrow <- function(expr, error_handlers = list()) {
  tryCatch(expr, error = function(e) {
    call <- caller_env(n = 5)
    for (fun in error_handlers) {
      fun(e)
    }
    abort(e, call = call)
  })
}

You'd pass a list of functions to check for those error conditions, like:

try_arrow(
  out <- as_arrow_table(x),
  error_handlers = list(
    handle_csv_read_error, 
    handle_augmented_field_misuse
  )
)

I don't think that would work as those functions have different arguments which also need passing in, to provide more context for if the error should be triggered, and passing in both the functions and their arguments here might be a bit overcomplicated?

paleolimbot · 2022-09-26T12:32:20Z

Thank you for taking this on...I frequently find myself putting a breakpoint in the handle_csv_read_error() to figure this kind of stuff out.

I wonder if classed errors can help us here? We abort(..., class = "something") then tryCatch(..., something = function(e) { ...handle the special error with a nicer message but with a reference to the original }. rlang has some tooling to make this relatively straightforward:

rlang::local_interactive()

read_csv <- function() {
  rlang::abort("Some message", class = "special_csv_read_error")
}

some_other_fun <- function() {
  tryCatch({
    read_csv()
  }, special_csv_read_error = function(e) {
    rlang::abort("Some new message", parent = e)
  })
}

some_other_fun()
#> Error in `some_other_fun()`:
#> ! Some new message
#> Caused by error in `read_csv()`:
#> ! Some message

rlang::last_error()
#> Error: Can't show last error because no error was recorded yet

^{Created on 2022-09-26 by the reprex package (v2.0.1)}

paleolimbot · 2022-09-26T12:33:51Z

(Whoops, in my interactive session I get a nice traceback from rlang::last_error())

paleolimbot

I bumped into this in #14250 (comment) and this PR fixes that nicely! (The issue there was that handle_csv_read_error() was swallowing the error and not stop()ing, and the rest of the function was rather confused when tab had never been assigned).

Apologies for the indirection on the classed error comment...in this case those errors are coming directly from C++ and so we have no control over the error class (well, we could stick something in C++, but a battle for another day).

It's a matter of style, but functions that might abort() or stop() tend to be named stop_for_XXXX() in the places I've seen them in the tidyverse.

paleolimbot · 2022-09-28T12:48:04Z

r/R/util.R

@@ -251,3 +235,19 @@ handle_augmented_field_misuse <- function(e, call) {
 is_compressed <- function(compression) {
  !identical(compression, "uncompressed")
 }
+
+# handler function which checks for a number of different read errors
+augment_io_error_msg <- function(e, call, ...) {


Suggested change

augment_io_error_msg <- function(e, call, ...) {

augment_io_error_msg <- function(e, call, ..., schema = NULL, format = NULL) {

(then below it becomes !is.null(schema) and !is.null(format))

paleolimbot · 2022-09-28T13:00:46Z

r/R/util.R

-# TODO: Refactor as part of ARROW-17355 to prevent potential missed errors
-handle_parquet_io_error <- function(e, format, call) {
-  msg <- conditionMessage(e)
+handle_parquet_io_error <- function(msg, call, format) {


In the abort() below, we should probably be using abort(msg, call = call, parent = e). This will print out the original error, too, in addition to the pretty helpful message. For development it's nice because we generally are interested in the error coming from C++.

The original message is already printed out as we append the helpful message to it.

OK, I've tried it out both ways, and see what parent = e does, but I think it makes more sense the way it prints currently. What do you think?

The way it looks now:

Error in `open_dataset()`: ! Invalid: Error creating dataset. Could not read schema from 'file1.csv': Could not open Parquet input source '/tmp/RtmpLed3En/filebe14261f95ab2/5/file1.csv': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file. /home/nic2/arrow/cpp/src/arrow/dataset/file_parquet.cc:338 GetReader(source, scan_options). Is this a 'parquet' file? /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:40 InspectSchemas(std::move(options)) /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:262 Inspect(options.inspect_options) ℹ Did you mean to specify a 'format' other than the default (parquet)?

With parent = e:

Error in `open_dataset()`: ! ℹ Did you mean to specify a 'format' other than the default (parquet)? Caused by error: ! Invalid: Error creating dataset. Could not read schema from 'file1.csv': Could not open Parquet input source '/tmp/RtmpLed3En/filebe14261f95ab2/5/file1.csv': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file. /home/nic2/arrow/cpp/src/arrow/dataset/file_parquet.cc:338 GetReader(source, scan_options). Is this a 'parquet' file? /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:40 InspectSchemas(std::move(options)) /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:262 Inspect(options.inspect_options) Run `rlang::last_error()` to see where the error occurred.

I rather like the parent = e version? The error relevant to the user is front and centre with a possible cause suggested under. Normally Caused by error: contains the call from the original error, but we're messing with that. I think that rlang::last_error() and rlang::last_trace() are both improved with the parent = e version, too (I can play with it given a reprex, too).

My concern here is that this error can be triggered both in the case we test for - the user omits the correct format specification and so we use the default (Parquet) - but also in the case that they are trying to read from a corrupted Parquet file. If the latter is true, the additional message is not actually helpful and may be misleading. The additional message coming second helps frame it as a suggestion rather than as the definitive solution to the problem.

Both are a substantial improvement over what we have now...no need to have this PR hung up on that! I tend to be on the debugging end of these error handlers, where I really do want to know what the original stack trace was, and I'd hoped that attaching the original error as the parent might help with that.

paleolimbot · 2022-09-28T13:01:39Z

r/R/util.R

-handle_csv_read_error <- function(e, schema, call) {
-  msg <- conditionMessage(e)
-
+handle_csv_read_error <- function(msg, call, schema) {


Same as for parquet...I think abort(msg, call = call, parent = e) will give a pretty looking error message and include the original error.

paleolimbot · 2022-09-28T13:02:09Z

r/R/util.R

-# TODO: Refactor as part of ARROW-17355 to prevent potential missed errors
-handle_augmented_field_misuse <- function(e, call) {
-  msg <- conditionMessage(e)
+handle_augmented_field_misuse <- function(msg, call) {


Maybe use abort(..., parent = e) here, too.

paleolimbot · 2022-09-28T13:02:54Z

r/R/util.R

+  }
+
+  handle_augmented_field_misuse(msg, call)
+  abort(msg, call = call)


(no need for parent here since we're not modifying the error)

thisisnic · 2022-09-28T15:31:02Z

stop_for_XXXX

Do you have any examples of these? I did have a look but couldn't find anything.

paleolimbot

They're particularly prolific in vctrs (apparently without the "for" though!):

stringr::str_subset(names(asNamespace("vctrs")), "^stop_")
#>  [1] "stop_subscript_oob"                "stop_location_negative"           
#>  [3] "stop_non_list_type"                "stop_assert"                      
#>  [5] "stop_indicator_size"               "stop_sf"                          
#>  [7] "stop_incompatible_shape"           "stop_native_implementation"       
#>  [9] "stop_matches"                      "stop_names"                       
#> [11] "stop_unimplemented"                "stop_matches_incomplete"          
#> [13] "stop_incompatible_cast"            "stop_names_must_be_unique"        
#> [15] "stop_input_type"                   "stop_names_cannot_be_dot_dot"     
#> [17] "stop_assert_size"                  "stop_matches_nothing"             
#> [19] "stop_names_cannot_be_empty"        "stop_scalar_type"                 
#> [21] "stop_incompatible_type"            "stop_corrupt_ordered_levels"      
#> [23] "stop_matches_remaining"            "stop_subscript_missing"           
#> [25] "stop_unsupported"                  "stop_recycle_incompatible_size"   
#> [27] "stop_defunct"                      "stop_location_negative_positive"  
#> [29] "stop_corrupt_factor_levels"        "stop_incompatible_op"             
#> [31] "stop_lossy_cast"                   "stop_matches_multiple"            
#> [33] "stop_incompatible"                 "stop_incompatible_size"           
#> [35] "stop_location_negative_missing"    "stop_subscript"                   
#> [37] "stop_location_oob_non_consecutive" "stop_location_zero"               
#> [39] "stop_vctrs"

I don't want this PR, which really helps the dev experience, to get hung up on these suggestions though!

paleolimbot · 2022-09-29T12:47:34Z

r/R/util.R

-# TODO: Refactor as part of ARROW-17355 to prevent potential missed errors
-handle_parquet_io_error <- function(e, format, call) {
-  msg <- conditionMessage(e)
+handle_parquet_io_error <- function(msg, call, format) {


Both are a substantial improvement over what we have now...no need to have this PR hung up on that! I tend to be on the debugging end of these error handlers, where I really do want to know what the original stack trace was, and I'd hoped that attaching the original error as the parent might help with that.

ursabot · 2022-10-01T10:31:39Z

Benchmark runs are scheduled for baseline = d60d8c6 and contender = 4dfa617. 4dfa617 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.17% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.27% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.32% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 4dfa6176 ec2-t3-xlarge-us-east-2
[Failed] 4dfa6176 test-mac-arm
[Failed] 4dfa6176 ursa-i9-9960x
[Finished] 4dfa6176 ursa-thinkcentre-m75q
[Finished] d60d8c6d ec2-t3-xlarge-us-east-2
[Failed] d60d8c6d test-mac-arm
[Failed] d60d8c6d ursa-i9-9960x
[Finished] d60d8c6d ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

… dev experience (apache#14030) For context; these `handle_*` functions originally caught an error and if it contained a particular string raised an augmented error message with extra guidance for the user (and if not, raised the original error message). This became problematic in a later PR where we wanted to test multiple conditions and only raise the original error if none of the conditions were met - the temporary approach was to move the responsibility for the raising of the original error to outside of the `handle_*` functions. The issue here is that this makes it easy for developers to forget to add in this line of code. The proposed solution here implements a generic error augmentation function `augment_io_error_msg()`, which tests all conditions, raises an error with an augmented message if any conditions are met, or raises the original error if not. Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>

Refactor functions for combined errors

d280d14

github-actions bot added the Component: R label Sep 2, 2022

Remove warning headers

8062b6f

thisisnic commented Sep 2, 2022

View reviewed changes

thisisnic marked this pull request as ready for review September 2, 2022 10:09

Appease the linter

8de2042

paleolimbot reviewed Sep 28, 2022

View reviewed changes

thisisnic added 2 commits September 28, 2022 16:32

Add params back in

574757f

Remove redundant expression

8d331d8

paleolimbot approved these changes Sep 29, 2022

View reviewed changes

thisisnic merged commit 4dfa617 into apache:master Oct 1, 2022

eitsupi mentioned this pull request Oct 1, 2022

ARROW-17429: [R] Error messages are not helpful of read_csv_arrow with col_types option #13922

Closed

asfimport mentioned this pull request Oct 1, 2022

[R] Refactor the handle_* utility functions for a better dev experience #32629

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-17355: [R] Refactor the handle_* utility functions for a better dev experience #14030

ARROW-17355: [R] Refactor the handle_* utility functions for a better dev experience #14030

thisisnic commented Sep 2, 2022 •

edited

Loading

github-actions bot commented Sep 2, 2022

github-actions bot commented Sep 2, 2022

thisisnic Sep 2, 2022

nealrichardson commented Sep 6, 2022

thisisnic commented Sep 23, 2022 •

edited

Loading

paleolimbot commented Sep 26, 2022

paleolimbot commented Sep 26, 2022

paleolimbot left a comment

paleolimbot Sep 28, 2022

paleolimbot Sep 28, 2022

thisisnic Sep 28, 2022

thisisnic Sep 28, 2022

paleolimbot Sep 28, 2022

thisisnic Sep 29, 2022 •

edited

Loading

paleolimbot Sep 29, 2022

paleolimbot Sep 28, 2022

paleolimbot Sep 28, 2022

paleolimbot Sep 28, 2022

thisisnic commented Sep 28, 2022

paleolimbot left a comment

paleolimbot Sep 29, 2022

ursabot commented Oct 1, 2022

	augment_io_error_msg <- function(e, call, ...) {
	augment_io_error_msg <- function(e, call, ..., schema = NULL, format = NULL) {

ARROW-17355: [R] Refactor the handle_* utility functions for a better dev experience #14030

ARROW-17355: [R] Refactor the handle_* utility functions for a better dev experience #14030

Conversation

thisisnic commented Sep 2, 2022 • edited Loading

github-actions bot commented Sep 2, 2022

github-actions bot commented Sep 2, 2022

Choose a reason for hiding this comment

nealrichardson commented Sep 6, 2022

thisisnic commented Sep 23, 2022 • edited Loading

paleolimbot commented Sep 26, 2022

paleolimbot commented Sep 26, 2022

paleolimbot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thisisnic Sep 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thisisnic commented Sep 28, 2022

paleolimbot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ursabot commented Oct 1, 2022

thisisnic commented Sep 2, 2022 •

edited

Loading

thisisnic commented Sep 23, 2022 •

edited

Loading

thisisnic Sep 29, 2022 •

edited

Loading