Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] R hangs when read_csv_arrow after set_io_thread_count(1) #36121

Closed
tdhock opened this issue Jun 16, 2023 · 4 comments · Fixed by #36304
Closed

[R] R hangs when read_csv_arrow after set_io_thread_count(1) #36121

tdhock opened this issue Jun 16, 2023 · 4 comments · Fixed by #36304

Comments

@tdhock
Copy link
Contributor

tdhock commented Jun 16, 2023

Describe the bug, including details regarding any error messages, version, and platform.

I tried setting the number of IO threads to 1, and then I expected to be able to read a CSV file, but instead I observed that the R interpreter hangs, perhaps in an infinite loop, and can not even be interrupted with control-C. I expected that I should be able to cancel this command with control-C.

If 1 IO thread is not supported, I would have at least expected an error message after running arrow::set_io_thread_count(1) such as "one IO thread is not allowed, please use at least two IO threads."

Also I would have expected some mention of how to control number of threads used for CSV reading on the man page for read_csv_arrow, but there is no mention of threads on that man page. Something like "use arrow::set_cpu_count(N_CPUS) to tell arrow to use N_CPUS for reading the CSV file" on that man page would be useful.

Related issues

Here is a minimal reproducible example R script:

write.csv(iris,"iris.csv")
arrow::io_thread_count()
sessionInfo()
head(arrow::read_csv_arrow("iris.csv"))
arrow::set_io_thread_count(2)
head(arrow::read_csv_arrow("iris.csv"))
arrow::set_io_thread_count(1)
head(arrow::read_csv_arrow("iris.csv"))

Output when running on Linux laptop:

(base) tdhock@tdhock-MacBook:~/R$ R --vanilla < arrow-hang.R 

R version 4.3.0 (2023-04-21) -- "Already Tomorrow"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R est un logiciel libre livré sans AUCUNE GARANTIE.
Vous pouvez le redistribuer sous certaines conditions.
Tapez 'license()' ou 'licence()' pour plus de détails.

R est un projet collaboratif avec de nombreux contributeurs.
Tapez 'contributors()' pour plus d'information et
'citation()' pour la façon de le citer dans les publications.

Tapez 'demo()' pour des démonstrations, 'help()' pour l'aide
en ligne ou 'help.start()' pour obtenir l'aide au format HTML.
Tapez 'q()' pour quitter R.

> write.csv(iris,"iris.csv")
> arrow::io_thread_count()
[1] 8
> sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
 [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0 bit_4.0.5        compiler_4.3.0   magrittr_2.0.3  
 [5] assertthat_0.2.1 R6_2.5.1         cli_3.6.1        glue_1.6.2      
 [9] bit64_4.0.5      vctrs_0.6.3      lifecycle_1.0.3  arrow_12.0.0    
[13] rlang_1.1.1      purrr_1.0.1     
> head(arrow::read_csv_arrow("iris.csv"))
    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 1          5.1         3.5          1.4         0.2  setosa
2 2          4.9         3.0          1.4         0.2  setosa
3 3          4.7         3.2          1.3         0.2  setosa
4 4          4.6         3.1          1.5         0.2  setosa
5 5          5.0         3.6          1.4         0.2  setosa
6 6          5.4         3.9          1.7         0.4  setosa
> arrow::set_io_thread_count(2)
> head(arrow::read_csv_arrow("iris.csv"))
    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 1          5.1         3.5          1.4         0.2  setosa
2 2          4.9         3.0          1.4         0.2  setosa
3 3          4.7         3.2          1.3         0.2  setosa
4 4          4.6         3.1          1.5         0.2  setosa
5 5          5.0         3.6          1.4         0.2  setosa
6 6          5.4         3.9          1.7         0.4  setosa
> arrow::set_io_thread_count(1)
> head(arrow::read_csv_arrow("iris.csv"))

Output when running on Linux server:

th798@cn36:~/R$ R --vanilla < arrow-hang.R

R version 4.3.0 (2023-04-21) -- "Already Tomorrow"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

During startup - Warning messages:
1: Setting LC_CTYPE failed, using "C" 
2: Setting LC_COLLATE failed, using "C" 
3: Setting LC_TIME failed, using "C" 
4: Setting LC_MESSAGES failed, using "C" 
5: Setting LC_MONETARY failed, using "C" 
6: Setting LC_PAPER failed, using "C" 
7: Setting LC_MEASUREMENT failed, using "C" 
> write.csv(iris,"iris.csv")
> arrow::io_thread_count()
[1] 8
> sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.7 (Ootpa)

Matrix products: default
BLAS:   /projects/genomic-ml/lib64/R/lib/libRblas.so 
LAPACK: /projects/genomic-ml/lib64/R/lib/libRlapack.so;  LAPACK version 3.11.0

locale:
[1] C

time zone: America/Phoenix
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0 bit_4.0.5        compiler_4.3.0   magrittr_2.0.3  
 [5] assertthat_0.2.1 R6_2.5.1         cli_3.6.1        glue_1.6.2      
 [9] bit64_4.0.5      vctrs_0.6.2      lifecycle_1.0.3  arrow_11.0.0.3  
[13] rlang_1.1.0      purrr_1.0.1     
> head(arrow::read_csv_arrow("iris.csv"))
    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 1          5.1         3.5          1.4         0.2  setosa
2 2          4.9         3.0          1.4         0.2  setosa
3 3          4.7         3.2          1.3         0.2  setosa
4 4          4.6         3.1          1.5         0.2  setosa
5 5          5.0         3.6          1.4         0.2  setosa
6 6          5.4         3.9          1.7         0.4  setosa
> arrow::set_io_thread_count(2)
> head(arrow::read_csv_arrow("iris.csv"))
    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 1          5.1         3.5          1.4         0.2  setosa
2 2          4.9         3.0          1.4         0.2  setosa
3 3          4.7         3.2          1.3         0.2  setosa
4 4          4.6         3.1          1.5         0.2  setosa
5 5          5.0         3.6          1.4         0.2  setosa
6 6          5.4         3.9          1.7         0.4  setosa
> arrow::set_io_thread_count(1)
> head(arrow::read_csv_arrow("iris.csv"))

On both computers the last command hangs (infinite loop?) and can not be interrupted, even with Control-C.

Component(s)

R

@tdhock
Copy link
Contributor Author

tdhock commented Jun 16, 2023

The arrow doc web page about threading model does not mention anything about a min number of IO threads, https://arrow.apache.org/docs/cpp/threading.html
Could a link to that page be added on the R man pages for arrow::cpu_count and arrow::io_thread_count?

@thisisnic
Copy link
Member

Can confirm that this can be reproduced on arrow 12.0.1 on Ubuntu 22.04, and agreed that we should warn & document better. Thanks for reporting this!

@thisisnic thisisnic changed the title R hangs when read_csv_arrow after set_io_thread_count(1) [R] R hangs when read_csv_arrow after set_io_thread_count(1) Jun 16, 2023
@paleolimbot
Copy link
Member

This one is my fault 😬 ...we hijack the IO thread pool to make it possible to call into R (e.g., user-defined functions, R connections as input) while doing certain Arrow tasks ( https://github.com/apache/arrow/blob/main/r/src/safe-call-into-r.h#L315 ). I imagine that there is some Arrow code that makes the usually safe assumption that there is at least one available IO thread.

@westonpace
Copy link
Member

I imagine that there is some Arrow code that makes the usually safe assumption that there is at least one available IO thread.

Yes 😆

paleolimbot added a commit that referenced this issue Jun 27, 2023
…#36304)

### Rationale for this change

Setting the number of threads in the IO thread pool to 1 causes a hang or crash when using some functions (notably: any Acero exec plan).

### What changes are included in this PR?

`set_io_thread_count()` now warns for `num_threads == 1`:

``` r
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.

# Already errors from C++
set_io_thread_count(0)
#> Error: Invalid: ThreadPool capacity must be > 0
# New warning!
set_io_thread_count(1)
#> Warning: `arrow::set_io_thread_count()` with num_threads < 2 may
#> cause certain operations to hang or crash.
#> ℹ Use num_threads >= 2 to support all operations
# No warning!
set_io_thread_count(2)
```

<sup>Created on 2023-06-26 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>

### Are these changes tested?

Yes

### Are there any user-facing changes?

Yes: some existing code may issue a warning that previously did not. Documentation was added.
* Closes: #36121

Authored-by: Dewey Dunnington <dewey@voltrondata.com>
Signed-off-by: Dewey Dunnington <dewey@voltrondata.com>
@paleolimbot paleolimbot added this to the 13.0.0 milestone Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants