-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17355: [R] Refactor the handle_* utility functions for a better dev experience #14030
ARROW-17355: [R] Refactor the handle_* utility functions for a better dev experience #14030
Conversation
|
r/R/util.R
Outdated
# handler function which checks for a number of different read errors | ||
augment_io_error_msg <- function(e, call, ...){ | ||
dots <- list2(...) | ||
msg <- conditionMessage(e) | ||
|
||
if (!is.null(dots[["schema"]])) { | ||
handle_csv_read_error(msg, call, dots[["schema"]]) | ||
} | ||
if (!is.null(dots[["format"]])) { | ||
handle_parquet_io_error(msg, call, dots[["format"]]) | ||
} | ||
|
||
handle_augmented_field_misuse(msg, call) | ||
abort(msg, call = call) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this approach of calling functions conditionally based on which optional arguments are supplied is reasonable or a bit of an anti-pattern.
Would something even more generic work? Something like
You'd pass a list of functions to check for those error conditions, like:
|
I don't think that would work as those functions have different arguments which also need passing in, to provide more context for if the error should be triggered, and passing in both the functions and their arguments here might be a bit overcomplicated? |
Thank you for taking this on...I frequently find myself putting a breakpoint in the I wonder if classed errors can help us here? We rlang::local_interactive()
read_csv <- function() {
rlang::abort("Some message", class = "special_csv_read_error")
}
some_other_fun <- function() {
tryCatch({
read_csv()
}, special_csv_read_error = function(e) {
rlang::abort("Some new message", parent = e)
})
}
some_other_fun()
#> Error in `some_other_fun()`:
#> ! Some new message
#> Caused by error in `read_csv()`:
#> ! Some message
rlang::last_error()
#> Error: Can't show last error because no error was recorded yet Created on 2022-09-26 by the reprex package (v2.0.1) |
(Whoops, in my interactive session I get a nice traceback from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bumped into this in #14250 (comment) and this PR fixes that nicely! (The issue there was that handle_csv_read_error()
was swallowing the error and not stop()
ing, and the rest of the function was rather confused when tab
had never been assigned).
Apologies for the indirection on the classed error comment...in this case those errors are coming directly from C++ and so we have no control over the error class (well, we could stick something in C++, but a battle for another day).
It's a matter of style, but functions that might abort()
or stop()
tend to be named stop_for_XXXX()
in the places I've seen them in the tidyverse.
r/R/util.R
Outdated
@@ -251,3 +235,19 @@ handle_augmented_field_misuse <- function(e, call) { | |||
is_compressed <- function(compression) { | |||
!identical(compression, "uncompressed") | |||
} | |||
|
|||
# handler function which checks for a number of different read errors | |||
augment_io_error_msg <- function(e, call, ...) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
augment_io_error_msg <- function(e, call, ...) { | |
augment_io_error_msg <- function(e, call, ..., schema = NULL, format = NULL) { |
(then below it becomes !is.null(schema)
and !is.null(format)
)
# TODO: Refactor as part of ARROW-17355 to prevent potential missed errors | ||
handle_parquet_io_error <- function(e, format, call) { | ||
msg <- conditionMessage(e) | ||
handle_parquet_io_error <- function(msg, call, format) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the abort()
below, we should probably be using abort(msg, call = call, parent = e)
. This will print out the original error, too, in addition to the pretty helpful message. For development it's nice because we generally are interested in the error coming from C++.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original message is already printed out as we append the helpful message to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've tried it out both ways, and see what parent = e
does, but I think it makes more sense the way it prints currently. What do you think?
The way it looks now:
Error in `open_dataset()`:
! Invalid: Error creating dataset. Could not read schema from 'file1.csv': Could not open Parquet input source '/tmp/RtmpLed3En/filebe14261f95ab2/5/file1.csv': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
/home/nic2/arrow/cpp/src/arrow/dataset/file_parquet.cc:338 GetReader(source, scan_options). Is this a 'parquet' file?
/home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:40 InspectSchemas(std::move(options))
/home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:262 Inspect(options.inspect_options)
ℹ Did you mean to specify a 'format' other than the default (parquet)?
With parent = e
:
Error in `open_dataset()`:
! ℹ Did you mean to specify a 'format' other than the default (parquet)?
Caused by error:
! Invalid: Error creating dataset. Could not read schema from 'file1.csv': Could not open Parquet input source '/tmp/RtmpLed3En/filebe14261f95ab2/5/file1.csv': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
/home/nic2/arrow/cpp/src/arrow/dataset/file_parquet.cc:338 GetReader(source, scan_options). Is this a 'parquet' file?
/home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:40 InspectSchemas(std::move(options))
/home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:262 Inspect(options.inspect_options)
Run `rlang::last_error()` to see where the error occurred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rather like the parent = e
version? The error relevant to the user is front and centre with a possible cause suggested under. Normally Caused by error:
contains the call
from the original error, but we're messing with that. I think that rlang::last_error()
and rlang::last_trace()
are both improved with the parent = e
version, too (I can play with it given a reprex, too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern here is that this error can be triggered both in the case we test for - the user omits the correct format specification and so we use the default (Parquet) - but also in the case that they are trying to read from a corrupted Parquet file. If the latter is true, the additional message is not actually helpful and may be misleading. The additional message coming second helps frame it as a suggestion rather than as the definitive solution to the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both are a substantial improvement over what we have now...no need to have this PR hung up on that! I tend to be on the debugging end of these error handlers, where I really do want to know what the original stack trace was, and I'd hoped that attaching the original error as the parent
might help with that.
handle_csv_read_error <- function(e, schema, call) { | ||
msg <- conditionMessage(e) | ||
|
||
handle_csv_read_error <- function(msg, call, schema) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as for parquet...I think abort(msg, call = call, parent = e)
will give a pretty looking error message and include the original error.
# TODO: Refactor as part of ARROW-17355 to prevent potential missed errors | ||
handle_augmented_field_misuse <- function(e, call) { | ||
msg <- conditionMessage(e) | ||
handle_augmented_field_misuse <- function(msg, call) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use abort(..., parent = e)
here, too.
} | ||
|
||
handle_augmented_field_misuse(msg, call) | ||
abort(msg, call = call) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(no need for parent here since we're not modifying the error)
Do you have any examples of these? I did have a look but couldn't find anything. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're particularly prolific in vctrs (apparently without the "for" though!):
stringr::str_subset(names(asNamespace("vctrs")), "^stop_")
#> [1] "stop_subscript_oob" "stop_location_negative"
#> [3] "stop_non_list_type" "stop_assert"
#> [5] "stop_indicator_size" "stop_sf"
#> [7] "stop_incompatible_shape" "stop_native_implementation"
#> [9] "stop_matches" "stop_names"
#> [11] "stop_unimplemented" "stop_matches_incomplete"
#> [13] "stop_incompatible_cast" "stop_names_must_be_unique"
#> [15] "stop_input_type" "stop_names_cannot_be_dot_dot"
#> [17] "stop_assert_size" "stop_matches_nothing"
#> [19] "stop_names_cannot_be_empty" "stop_scalar_type"
#> [21] "stop_incompatible_type" "stop_corrupt_ordered_levels"
#> [23] "stop_matches_remaining" "stop_subscript_missing"
#> [25] "stop_unsupported" "stop_recycle_incompatible_size"
#> [27] "stop_defunct" "stop_location_negative_positive"
#> [29] "stop_corrupt_factor_levels" "stop_incompatible_op"
#> [31] "stop_lossy_cast" "stop_matches_multiple"
#> [33] "stop_incompatible" "stop_incompatible_size"
#> [35] "stop_location_negative_missing" "stop_subscript"
#> [37] "stop_location_oob_non_consecutive" "stop_location_zero"
#> [39] "stop_vctrs"
I don't want this PR, which really helps the dev experience, to get hung up on these suggestions though!
# TODO: Refactor as part of ARROW-17355 to prevent potential missed errors | ||
handle_parquet_io_error <- function(e, format, call) { | ||
msg <- conditionMessage(e) | ||
handle_parquet_io_error <- function(msg, call, format) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both are a substantial improvement over what we have now...no need to have this PR hung up on that! I tend to be on the debugging end of these error handlers, where I really do want to know what the original stack trace was, and I'd hoped that attaching the original error as the parent
might help with that.
Benchmark runs are scheduled for baseline = d60d8c6 and contender = 4dfa617. 4dfa617 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
… dev experience (apache#14030) For context; these `handle_*` functions originally caught an error and if it contained a particular string raised an augmented error message with extra guidance for the user (and if not, raised the original error message). This became problematic in a later PR where we wanted to test multiple conditions and only raise the original error if none of the conditions were met - the temporary approach was to move the responsibility for the raising of the original error to outside of the `handle_*` functions. The issue here is that this makes it easy for developers to forget to add in this line of code. The proposed solution here implements a generic error augmentation function `augment_io_error_msg()`, which tests all conditions, raises an error with an augmented message if any conditions are met, or raises the original error if not. Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
For context; these
handle_*
functions originally caught an error and if it contained a particular string raised an augmented error message with extra guidance for the user (and if not, raised the original error message).This became problematic in a later PR where we wanted to test multiple conditions and only raise the original error if none of the conditions were met - the temporary approach was to move the responsibility for the raising of the original error to outside of the
handle_*
functions. The issue here is that this makes it easy for developers to forget to add in this line of code.The proposed solution here implements a generic error augmentation function
augment_io_error_msg()
, which tests all conditions, raises an error with an augmented message if any conditions are met, or raises the original error if not.