Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35649: [R] Always call RecordBatchReader::ReadNext() from DuckDB from the main R thread #36307

Merged
merged 2 commits into from Jun 29, 2023

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Jun 26, 2023

Rationale for this change

When passing a DuckDB result to Arrow via to_arrow() whose input was an Arrow dataset, calls to R code from other threads can occur in some DuckDB operations. This caused a crash or hang on Linux when attempting to combine pivot_longer() and write_dataset().

What changes are included in this PR?

  • Added a wrapper class around the RecordBatchReader that routes calls to ReadNext() through SafeCallIntoR().

Are these changes tested?

I can't find a new case that isn't covered by our existing tests, although I did remove a skip that was causing a similar problem at one point (#33033). Because it's difficult to predict/test where duckdb evaluates R code, it's hard to know exactly what to test here (I would have expected R code to be evaluated/a crash to occur with many of our existing tests, but even the pivot_longer() example does not crash on MacOS and Windows 馃し ).

I did verify on Ubuntu 22.04 that the reprex kindly provided by @PMassicotte errors before this PR and does not error after this PR:

library(tidyverse)
library(arrow)

one_level_tree <- tempfile()

mtcars |>
  to_duckdb() |>
  pivot_longer(everything()) |>
  to_arrow() |>
  # collect() |> # collecting make it work, otherwise, it hangs on write_dataset()
  write_dataset(one_level_tree, partitioning = "name")

list.files(one_level_tree, recursive = TRUE)

Are there any user-facing changes?

There are no user facing changes.

@github-actions
Copy link

鈿狅笍 GitHub issue #35649 has been automatically assigned in GitHub to PR creator.

@PMassicotte
Copy link

Thank you very much @paleolimbot for looking and fixing the issue! Really love the Arrow community!

@paleolimbot paleolimbot marked this pull request as ready for review June 27, 2023 14:54
@thisisnic
Copy link
Member

This all looks good, so will approve shortly, but @paleolimbot, mainly for the sake of my own understanding, would you mind expanding on why routing calls to ReadNext() through SafeCallIntoR() is needed? What is SafeCallIntoR() doing?

@paleolimbot
Copy link
Member Author

It's a good thing for all of our sanity!

SafeCallIntoR() either evaluates the expression it contains once on the main R thread OR errors it can't. The RunWithCapturedR() wrapper launches a thread (via the IO thread pools), sets up a "listener" on the main R thread that waits for requests to evaluate R code from SafeCallIntoR(). It's not a perfect demo, but there's some test code that demonstrates this:

auto result = RunWithCapturedR<std::string>([&thread, r_fun_that_returns_a_string]() {
auto fut = arrow::Future<std::string>::Make();
thread = std::thread([&fut, r_fun_that_returns_a_string]() {
auto result = SafeCallIntoR<std::string>(
[&] { return cpp11::as_cpp<std::string>(r_fun_that_returns_a_string()); });
fut.MarkFinished(result);
});
return fut;
});
if (thread.joinable()) {
thread.join();
}

The main difference with actual code is that the call to SafeCallIntoR() is usually buried deep in some utility code (e.g., read bytes from an R connection by calling the base R readBin() function) and the top-level call usually knows nothing about this (e.g., "Read this parquet file").

If SafeCallIntoR() is invoked from the R thread, it just evaluates the function without any of that dance. Basically, it is used everywhere we call anything from the R API (as in #include <R.h>) or cpp11 (because cpp11 wraps a lot of calls to R.h functions with the assumption that all of this is happening on the main thread).

Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making these changes!

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Jun 27, 2023
@thisisnic thisisnic merged commit a75299d into apache:main Jun 29, 2023
14 checks passed
@conbench-apache-arrow
Copy link

Conbench analyzed the 6 benchmark runs on commit a75299d8.

There were 9 benchmark results indicating a performance regression:

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R][C++] write_dataset() hangs when pivot_() operation is performed before
3 participants