Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17355: [R] Refactor the handle_* utility functions for a better dev experience #14030

Merged
merged 5 commits into from
Oct 1, 2022

Conversation

thisisnic
Copy link
Member

@thisisnic thisisnic commented Sep 2, 2022

For context; these handle_* functions originally caught an error and if it contained a particular string raised an augmented error message with extra guidance for the user (and if not, raised the original error message).

This became problematic in a later PR where we wanted to test multiple conditions and only raise the original error if none of the conditions were met - the temporary approach was to move the responsibility for the raising of the original error to outside of the handle_* functions. The issue here is that this makes it easy for developers to forget to add in this line of code.

The proposed solution here implements a generic error augmentation function augment_io_error_msg(), which tests all conditions, raises an error with an augmented message if any conditions are met, or raises the original error if not.

@github-actions
Copy link

github-actions bot commented Sep 2, 2022

@github-actions
Copy link

github-actions bot commented Sep 2, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

r/R/util.R Outdated
Comment on lines 242 to 256
# handler function which checks for a number of different read errors
augment_io_error_msg <- function(e, call, ...){
dots <- list2(...)
msg <- conditionMessage(e)

if (!is.null(dots[["schema"]])) {
handle_csv_read_error(msg, call, dots[["schema"]])
}
if (!is.null(dots[["format"]])) {
handle_parquet_io_error(msg, call, dots[["format"]])
}

handle_augmented_field_misuse(msg, call)
abort(msg, call = call)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this approach of calling functions conditionally based on which optional arguments are supplied is reasonable or a bit of an anti-pattern.

@thisisnic thisisnic marked this pull request as ready for review September 2, 2022 10:09
@nealrichardson
Copy link
Member

Would something even more generic work? Something like

try_arrow <- function(expr, error_handlers = list()) {
  tryCatch(expr, error = function(e) {
    call <- caller_env(n = 5)
    for (fun in error_handlers) {
      fun(e)
    }
    abort(e, call = call)
  })
}

You'd pass a list of functions to check for those error conditions, like:

try_arrow(
  out <- as_arrow_table(x),
  error_handlers = list(
    handle_csv_read_error, 
    handle_augmented_field_misuse
  )
)

@thisisnic
Copy link
Member Author

thisisnic commented Sep 23, 2022

Would something even more generic work? Something like

try_arrow <- function(expr, error_handlers = list()) {
  tryCatch(expr, error = function(e) {
    call <- caller_env(n = 5)
    for (fun in error_handlers) {
      fun(e)
    }
    abort(e, call = call)
  })
}

You'd pass a list of functions to check for those error conditions, like:

try_arrow(
  out <- as_arrow_table(x),
  error_handlers = list(
    handle_csv_read_error, 
    handle_augmented_field_misuse
  )
)

I don't think that would work as those functions have different arguments which also need passing in, to provide more context for if the error should be triggered, and passing in both the functions and their arguments here might be a bit overcomplicated?

@paleolimbot
Copy link
Member

Thank you for taking this on...I frequently find myself putting a breakpoint in the handle_csv_read_error() to figure this kind of stuff out.

I wonder if classed errors can help us here? We abort(..., class = "something") then tryCatch(..., something = function(e) { ...handle the special error with a nicer message but with a reference to the original }. rlang has some tooling to make this relatively straightforward:

rlang::local_interactive()

read_csv <- function() {
  rlang::abort("Some message", class = "special_csv_read_error")
}

some_other_fun <- function() {
  tryCatch({
    read_csv()
  }, special_csv_read_error = function(e) {
    rlang::abort("Some new message", parent = e)
  })
}

some_other_fun()
#> Error in `some_other_fun()`:
#> ! Some new message
#> Caused by error in `read_csv()`:
#> ! Some message

rlang::last_error()
#> Error: Can't show last error because no error was recorded yet

Created on 2022-09-26 by the reprex package (v2.0.1)

@paleolimbot
Copy link
Member

(Whoops, in my interactive session I get a nice traceback from rlang::last_error())

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bumped into this in #14250 (comment) and this PR fixes that nicely! (The issue there was that handle_csv_read_error() was swallowing the error and not stop()ing, and the rest of the function was rather confused when tab had never been assigned).

Apologies for the indirection on the classed error comment...in this case those errors are coming directly from C++ and so we have no control over the error class (well, we could stick something in C++, but a battle for another day).

It's a matter of style, but functions that might abort() or stop() tend to be named stop_for_XXXX() in the places I've seen them in the tidyverse.

r/R/util.R Outdated
@@ -251,3 +235,19 @@ handle_augmented_field_misuse <- function(e, call) {
is_compressed <- function(compression) {
!identical(compression, "uncompressed")
}

# handler function which checks for a number of different read errors
augment_io_error_msg <- function(e, call, ...) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
augment_io_error_msg <- function(e, call, ...) {
augment_io_error_msg <- function(e, call, ..., schema = NULL, format = NULL) {

(then below it becomes !is.null(schema) and !is.null(format))

# TODO: Refactor as part of ARROW-17355 to prevent potential missed errors
handle_parquet_io_error <- function(e, format, call) {
msg <- conditionMessage(e)
handle_parquet_io_error <- function(msg, call, format) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the abort() below, we should probably be using abort(msg, call = call, parent = e). This will print out the original error, too, in addition to the pretty helpful message. For development it's nice because we generally are interested in the error coming from C++.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original message is already printed out as we append the helpful message to it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've tried it out both ways, and see what parent = e does, but I think it makes more sense the way it prints currently. What do you think?

The way it looks now:

Error in `open_dataset()`:
! Invalid: Error creating dataset. Could not read schema from 'file1.csv': Could not open Parquet input source '/tmp/RtmpLed3En/filebe14261f95ab2/5/file1.csv': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
/home/nic2/arrow/cpp/src/arrow/dataset/file_parquet.cc:338  GetReader(source, scan_options). Is this a 'parquet' file?
/home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:40  InspectSchemas(std::move(options))
/home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:262  Inspect(options.inspect_options)
ℹ Did you mean to specify a 'format' other than the default (parquet)?

With parent = e:

Error in `open_dataset()`:
! ℹ Did you mean to specify a 'format' other than the default (parquet)?
Caused by error:
! Invalid: Error creating dataset. Could not read schema from 'file1.csv': Could not open Parquet input source '/tmp/RtmpLed3En/filebe14261f95ab2/5/file1.csv': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
/home/nic2/arrow/cpp/src/arrow/dataset/file_parquet.cc:338  GetReader(source, scan_options). Is this a 'parquet' file?
/home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:40  InspectSchemas(std::move(options))
/home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:262  Inspect(options.inspect_options)
Run `rlang::last_error()` to see where the error occurred.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rather like the parent = e version? The error relevant to the user is front and centre with a possible cause suggested under. Normally Caused by error: contains the call from the original error, but we're messing with that. I think that rlang::last_error() and rlang::last_trace() are both improved with the parent = e version, too (I can play with it given a reprex, too).

Copy link
Member Author

@thisisnic thisisnic Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern here is that this error can be triggered both in the case we test for - the user omits the correct format specification and so we use the default (Parquet) - but also in the case that they are trying to read from a corrupted Parquet file. If the latter is true, the additional message is not actually helpful and may be misleading. The additional message coming second helps frame it as a suggestion rather than as the definitive solution to the problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are a substantial improvement over what we have now...no need to have this PR hung up on that! I tend to be on the debugging end of these error handlers, where I really do want to know what the original stack trace was, and I'd hoped that attaching the original error as the parent might help with that.

handle_csv_read_error <- function(e, schema, call) {
msg <- conditionMessage(e)

handle_csv_read_error <- function(msg, call, schema) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as for parquet...I think abort(msg, call = call, parent = e) will give a pretty looking error message and include the original error.

# TODO: Refactor as part of ARROW-17355 to prevent potential missed errors
handle_augmented_field_misuse <- function(e, call) {
msg <- conditionMessage(e)
handle_augmented_field_misuse <- function(msg, call) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use abort(..., parent = e) here, too.

}

handle_augmented_field_misuse(msg, call)
abort(msg, call = call)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(no need for parent here since we're not modifying the error)

@thisisnic
Copy link
Member Author

stop_for_XXXX

Do you have any examples of these? I did have a look but couldn't find anything.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're particularly prolific in vctrs (apparently without the "for" though!):

stringr::str_subset(names(asNamespace("vctrs")), "^stop_")
#>  [1] "stop_subscript_oob"                "stop_location_negative"           
#>  [3] "stop_non_list_type"                "stop_assert"                      
#>  [5] "stop_indicator_size"               "stop_sf"                          
#>  [7] "stop_incompatible_shape"           "stop_native_implementation"       
#>  [9] "stop_matches"                      "stop_names"                       
#> [11] "stop_unimplemented"                "stop_matches_incomplete"          
#> [13] "stop_incompatible_cast"            "stop_names_must_be_unique"        
#> [15] "stop_input_type"                   "stop_names_cannot_be_dot_dot"     
#> [17] "stop_assert_size"                  "stop_matches_nothing"             
#> [19] "stop_names_cannot_be_empty"        "stop_scalar_type"                 
#> [21] "stop_incompatible_type"            "stop_corrupt_ordered_levels"      
#> [23] "stop_matches_remaining"            "stop_subscript_missing"           
#> [25] "stop_unsupported"                  "stop_recycle_incompatible_size"   
#> [27] "stop_defunct"                      "stop_location_negative_positive"  
#> [29] "stop_corrupt_factor_levels"        "stop_incompatible_op"             
#> [31] "stop_lossy_cast"                   "stop_matches_multiple"            
#> [33] "stop_incompatible"                 "stop_incompatible_size"           
#> [35] "stop_location_negative_missing"    "stop_subscript"                   
#> [37] "stop_location_oob_non_consecutive" "stop_location_zero"               
#> [39] "stop_vctrs"

I don't want this PR, which really helps the dev experience, to get hung up on these suggestions though!

# TODO: Refactor as part of ARROW-17355 to prevent potential missed errors
handle_parquet_io_error <- function(e, format, call) {
msg <- conditionMessage(e)
handle_parquet_io_error <- function(msg, call, format) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are a substantial improvement over what we have now...no need to have this PR hung up on that! I tend to be on the debugging end of these error handlers, where I really do want to know what the original stack trace was, and I'd hoped that attaching the original error as the parent might help with that.

@ursabot
Copy link

ursabot commented Oct 1, 2022

Benchmark runs are scheduled for baseline = d60d8c6 and contender = 4dfa617. 4dfa617 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.17% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.27% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.32% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 4dfa6176 ec2-t3-xlarge-us-east-2
[Failed] 4dfa6176 test-mac-arm
[Failed] 4dfa6176 ursa-i9-9960x
[Finished] 4dfa6176 ursa-thinkcentre-m75q
[Finished] d60d8c6d ec2-t3-xlarge-us-east-2
[Failed] d60d8c6d test-mac-arm
[Failed] d60d8c6d ursa-i9-9960x
[Finished] d60d8c6d ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Oct 17, 2022
… dev experience (apache#14030)

For context; these `handle_*` functions originally caught an error and if it contained a particular string raised an augmented error message with extra guidance for the user (and if not, raised the original error message).

This became problematic in a later PR where we wanted to test multiple conditions and only raise the original error if none of the conditions were met - the temporary approach was to move the responsibility for the raising of the original error to outside of the `handle_*` functions.  The issue here is that this makes it easy for developers to forget to add in this line of code.

The proposed solution here implements a generic error augmentation function `augment_io_error_msg()`, which tests all conditions, raises an error with an augmented message if any conditions are met, or raises the original error if not.

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants