New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R][C++] write_dataset() hangs when pivot_() operation is performed before #35649
Comments
I can confirm that this can be replicated in dev Arrow. I get the following error:
|
FWIW, I was only able to replicate this on Linux (MacOS/dev arrow and Windows/Release arrow successfully wrote the dataset). On Linux I got an additional note about C stack usage:
|
Interesting. Do you have an idea what can be the origin of the problem? |
Normally when I see a stack usage warning like that it's because of excessive (or infinite) recursion. However, that doesn't seem to be the case here: It's a (refreshingly) readable stack trace:
There are 1 or 2 other active threads and none of them seem to have any deep stack either. |
I seem to remember that the input to DuckDB is something along the lines of "all the R objects needed to initialize a scan". It seems like what may be happening is that the call to initialize the scan is getting called from another thread...maybe the pivot operation requires more than one scan? When R code gets called from another thread (outside the Arrow R package, where we do this safely), the failure mode can be wildly variable. |
Yes, I think this is an example of a non-R thread calling R code. This is an Arrow worker thread running the query. The Arrow worker is calling duckdb and asking for the next batch. Duckdb is then calling into R and asking for the next batch. Perhaps DuckDb needs something like If Acero knew that the source was going to be calling into R then Acero could use SafeCallIntoR to fetch the next batch but, from Acero's perspective, it's just grabbing a batch from "some kind of C data producer". Another possibility would be to do something like run the entire Acero plan synchronously (both I/O threads and CPU threads). Acero would never spawn a thread so you'll always be on the R thread but that will tank performance (especially running the I/O tasks synchronously) |
There's unfortunately no universal solution to this at the moment, but it's definitely the long-term solution for something like this. At the very least, duckdb could check whether it's on the main R thread and error otherwise (rather than hang or crash).
I think we might be able to make this one happen...this is basically a wrapping a ...given that literally all of those things are my fault, I'm happy to take a stab at fixing! |
Ah, yes, that makes sense. R is the one providing the C data stream source to Acero so I suppose you have the power to wrap it. And, it makes sense for you to assume that whatever it is will make R calls.
I think we can place a healthy amount of the blame both on myself and on R's ambitious threading model 😆 |
Or severe lack of ambition, as it may be. |
… from the main R thread (#36307) ### Rationale for this change When passing a DuckDB result to Arrow via `to_arrow()` whose input was an Arrow dataset, calls to R code from other threads can occur in some DuckDB operations. This caused a crash or hang on Linux when attempting to combine `pivot_longer()` and `write_dataset()`. ### What changes are included in this PR? - Added a wrapper class around the `RecordBatchReader` that routes calls to `ReadNext()` through `SafeCallIntoR()`. ### Are these changes tested? I can't find a new case that isn't covered by our existing tests, although I did remove a skip that was causing a similar problem at one point (#33033). Because it's difficult to predict/test where duckdb evaluates R code, it's hard to know exactly what to test here (I would have expected R code to be evaluated/a crash to occur with many of our existing tests, but even the `pivot_longer()` example does not crash on MacOS and Windows 🤷 ). I did verify on Ubuntu 22.04 that the reprex kindly provided by @ PMassicotte errors before this PR and does not error after this PR: ```r library(tidyverse) library(arrow) one_level_tree <- tempfile() mtcars |> to_duckdb() |> pivot_longer(everything()) |> to_arrow() |> # collect() |> # collecting make it work, otherwise, it hangs on write_dataset() write_dataset(one_level_tree, partitioning = "name") list.files(one_level_tree, recursive = TRUE) ``` ### Are there any user-facing changes? There are no user facing changes. * Closes: #35649 Authored-by: Dewey Dunnington <dewey@voltrondata.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
Describe the bug, including details regarding any error messages, version, and platform.
I am trying to work on a dataset without having to pull it with
collect()
I found out that if I was usingpivot_longer()
in the chain of operation,write_dataset()
is hanging and nothing seems to happen.Component(s)
R
The text was updated successfully, but these errors were encountered: