-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R][C++] Allow cancelling long-running commands #27688
Comments
Dewey Dunnington / @paleolimbot: SafeCallIntoRVoid([]() { cpp11::check_user_interrupt(); }) anywhere in C++ and it will return a non-OK status if there's a pending interrupt. That will only work for tasks run with If we have a Future that we can cancel, we could rig something similar, maybe using our own event loop (currently we use Arrow's RunInSerialExecutor and I don't know how customizable that is). In addition to RMonitor, there's also the 'later' package ( https://github.com/r-lib/later ) which can also run event loops although I don't know how customizable they are. In the R package we have the |
Cancellation is supported in C++ but not via cancellable futures (and probably won't be). Instead, operations which support cancellation take in a stop token. A stop token is something that the C++ code can poll on a regular basis to see if cancellation has been requested (very similar to The stop token is connected to a stop source which the user holds onto. If the user marks the stop source as cancelled then the stop token will see the cancellation the next time it is polled and exit. It sounds like, for R, this stop source approach won't work (is there no way to register a callback that gets called on cancellation instead of requiring polling?) In that case maybe we want a custom stop token implementation for R. This stop token's poll method could check |
Dewey Dunnington / @paleolimbot: |
Dewey Dunnington / @paleolimbot: |
That's not supported today but, it should be possible, and it seems like a pretty good way to solve this problem. One challenge is that we want to make sure to let cleanup happen (e.g. tasks that close file handles). However, there might be some middle ground here. |
Antoine Pitrou / @pitrou: |
Antoine Pitrou / @pitrou: and the C++ APIs that it relies on: They allow to automatically cancel a StopToken from a set of received signals. It should in turn interrupt whatever primitive is checking for StopToken cancellations (such as CSV reading). |
Dewey Dunnington / @paleolimbot: library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
tf <- tempfile()
readr::write_csv(vctrs::vec_rep(mtcars, 5e5), tf)
# try to slow down CSV reading
set_cpu_count(1)
set_io_thread_count(2)
# compare timing of cancelled vs not cancelled (hard to tell the difference)
system.time(read_csv_arrow(tf))
#> user system elapsed
#> 2.852 0.637 5.365
system.time(open_dataset(tf, format = "csv") |> dplyr::collect())
#> user system elapsed
#> 2.920 0.219 3.049
# compare responsiveness of cancelling the read using other APIs
# (usually quite a difference)
system.time(readr::read_csv(tf))
#> Rows: 16000000 Columns: 11
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> user system elapsed
#> 19.424 1.267 3.496
system.time(read.csv(tf))
#> user system elapsed
#> 20.858 0.718 21.864 It seems like we would need some sort of "run this bit of code in XX seconds" to implement this in the R bindings (or if there's an easier way that would be great!). It doesn't matter what thread it's on because |
Antoine Pitrou / @pitrou: |
Dewey Dunnington / @paleolimbot: |
Dewey Dunnington / @paleolimbot: library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
tf <- tempfile()
readr::write_csv(vctrs::vec_rep(mtcars, 5e5), tf)
# try to slow down CSV reading
set_cpu_count(1)
set_io_thread_count(2)
# you can cancel if you use an R connection because the error propagation
# will also propagate the interrupt condition
system.time(read_csv_arrow(file(tf)))
#> user system elapsed
#> 2.909 0.598 3.410 |
Antoine Pitrou / @pitrou: Instead there should be a public way to get that future without waiting for it. Then you can wait for the future yourself, but with a timeout to implement whatever polling you want. Of course, polling is suboptimal. Ideally there would be a way to temporarily override R's signal handlers, and reuse the same strategy as in Python. Perhaps that can be done using plain |
Antoine Pitrou / @pitrou: arrow/python/pyarrow/error.pxi Lines 223 to 242 in 2519230
... except that PyErr_SetInterrupt would become either either Rf_onintr or Rf_onintrNoResume (not sure which one). And I don't think PyErr_CheckSignals needs a R equivalent at all.
As for setting and restoring signal handlers, you would do that in C++ become R doesn't seem to have any equivalent APIs (so you would use our own |
Dewey Dunnington / @paleolimbot: |
When calling a long-running task (for example reading a CSV file) from the R prompt, users may want to interrupt with Ctrl-C.
Allowing this will require integrating R's user interruption facility with the cancellation API that's going to be exposed in C++ (see ARROW-8732).
Below some information I've gathered on the topic:
There is some hairy discussion of how to interrupt C++ code from R at https://stackoverflow.com/questions/40563522/r-how-to-write-interruptible-c-function-and-recover-partial-results and https://stat.ethz.ch/pipermail/r-devel/2011-April/060714.html .
It seems it may involve polling cpp11::check_user_interrupt() and catching any cpp11::unwind_exception that may signal an interruption. A complication is that apparently R APIs should only be called from the main thread. There's also a small library which claims to make writing all this easier: https://github.com/tnagler/RcppThread/blob/master/inst/include/RcppThread/RMonitor.hpp
But since user interruptions will only be noticed by the R main thread, the solution may be to launch heavy computations (e.g. CSV reading) in a separate thread and have the main R thread periodically poll for interrupts while waiting for the separate thread. This is what this dedicated thread class does in its join method: https://github.com/tnagler/RcppThread/blob/master/inst/include/RcppThread/Thread.hpp#L79
Reporter: Antoine Pitrou / @pitrou
Assignee: Dewey Dunnington / @paleolimbot
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-11841. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: