-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] Casting columns using dplyr::mutate in arrow datasets results in NA values #34519
Comments
I can reproduce this on 5b2fbad. The same problem occurs with the csv format. > mtcars %>% write_dataset('./mtcars/', format = "csv")
> ds <- open_dataset('./mtcars', format = "csv")
> ds %>% dplyr::mutate(mpg=as.numeric(mpg)) %>% dplyr::collect()
# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
1 NA 6 160 110 3.9 2.62 16.5 0 1 4 4
2 NA 6 160 110 3.9 2.88 17.0 0 1 4 4
3 NA 4 108 93 3.85 2.32 18.6 1 1 4 1
4 NA 6 258 110 3.08 3.22 19.4 1 0 3 1
5 NA 8 360 175 3.15 3.44 17.0 0 0 3 2
6 NA 6 225 105 2.76 3.46 20.2 1 0 3 1
7 NA 8 360 245 3.21 3.57 15.8 0 0 3 4
8 NA 4 147. 62 3.69 3.19 20 1 0 4 2
9 NA 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 NA 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
# ℹ Use `print(n = ...)` to see more rows |
Thanks for reporting this @egillax; I can confirm that this issue is not present in 11.0.03, but is on the dev build. I'll investigate further. |
A little more investigation of the specific circumstances in which it does and does not occur: library(arrow)
library(dplyr)
# no problem when replacing column with self when there is just 1 column
df <- tibble::tibble(x = 1:10)
tf <- tempfile()
dir.create(tf)
write_dataset(df, tf)
open_dataset(tf) %>%
mutate(x = as.numeric(x)) %>%
collect()
#> # A tibble: 10 × 1
#> x
#> <dbl>
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
#> 7 7
#> 8 8
#> 9 9
#> 10 10
# NA values when there are 2 columns
df <- tibble::tibble(x = 1:10, y = 1:10)
tf <- tempfile()
dir.create(tf)
write_dataset(df, tf)
open_dataset(tf) %>%
mutate(x = as.numeric(x)) %>%
collect()
#> # A tibble: 10 × 2
#> x y
#> <dbl> <int>
#> 1 NA 1
#> 2 NA 2
#> 3 NA 3
#> 4 NA 4
#> 5 NA 5
#> 6 NA 6
#> 7 NA 7
#> 8 NA 8
#> 9 NA 9
#> 10 NA 10
# works fine if we're creating a brand new column
open_dataset(tf) %>%
mutate(z = as.numeric(x)) %>%
collect()
#> # A tibble: 10 × 3
#> x y z
#> <int> <int> <dbl>
#> 1 1 1 1
#> 2 2 2 2
#> 3 3 3 3
#> 4 4 4 4
#> 5 5 5 5
#> 6 6 6 6
#> 7 7 7 7
#> 8 8 8 8
#> 9 9 9 9
#> 10 10 10 10
# works fine if we're replacing a different column
open_dataset(tf) %>%
mutate(y = as.numeric(x)) %>%
collect()
#> # A tibble: 10 × 2
#> x y
#> <int> <dbl>
#> 1 1 1
#> 2 2 2
#> 3 3 3
#> 4 4 4
#> 5 5 5
#> 6 6 6
#> 7 7 7
#> 8 8 8
#> 9 9 9
#> 10 10 10
# works fine with in-memory datasets when replacing existing columns
InMemoryDataset$create(df) %>%
mutate(x = as.numeric(x)) %>%
collect()
#> # A tibble: 10 × 2
#> x y
#> <dbl> <int>
#> 1 1 1
#> 2 2 2
#> 3 3 3
#> 4 4 4
#> 5 5 5
#> 6 6 6
#> 7 7 7
#> 8 8 8
#> 9 9 9
#> 10 10 10 Given it works with 11.0.0.3 and not the dev version of the R package, and there are very few R code changes since 11.0.0.3, I'm inclined to think that this could be something happening at the C++ level. I'll try to narrow it down to the PR which caused this change. |
I've managed to narrow it down to #33770 which is where it first broke. CC @nealrichardson |
I can take a look. |
…field (#34576) ### Rationale for this change Fixes #34519. #33770 introduced the bug; I had [asked](https://github.com/apache/arrow/pull/33770/files#r1081612013) in the review why the C++ function wasn't using `FieldsInExpression`. I swapped that in, and the test I added to reproduce the bug now passes. ### What changes are included in this PR? Fix for the C++ function, test in R. ### Are these changes tested? Yes ### Are there any user-facing changes? The behavior observed in the report no longer happens. * Closes: #34519 Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
… as a field (apache#34576) ### Rationale for this change Fixes apache#34519. apache#33770 introduced the bug; I had [asked](https://github.com/apache/arrow/pull/33770/files#r1081612013) in the review why the C++ function wasn't using `FieldsInExpression`. I swapped that in, and the test I added to reproduce the bug now passes. ### What changes are included in this PR? Fix for the C++ function, test in R. ### Are these changes tested? Yes ### Are there any user-facing changes? The behavior observed in the report no longer happens. * Closes: apache#34519 Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
Describe the bug, including details regarding any error messages, version, and platform.
I was testing the latest arrow develop version using this method to install from git.
And now it seems I cannot cast columns in a dataset, it results in
NA
values:I tried using both parquet and arrow files. This does work using latest version on CRAN (11.0.0.3) and using arrow tables instead of datasets.
Reprex:
Created on 2023-03-09 with reprex v2.0.2
Arrow Info
Arrow package version: 11.0.0.9000Capabilities:
dataset TRUE
substrait FALSE
parquet TRUE
json TRUE
s3 FALSE
gcs FALSE
utf8proc TRUE
re2 TRUE
snappy TRUE
gzip FALSE
brotli FALSE
zstd FALSE
lz4 TRUE
lz4_frame TRUE
lzo FALSE
bz2 FALSE
jemalloc FALSE
mimalloc TRUE
To reinstall with more optional capabilities enabled, see
https://arrow.apache.org/docs/r/articles/install.html
Memory:
Allocator mimalloc
Current 13.31 Kb
Max 46.31 Mb
Runtime:
SIMD Level avx2
Detected SIMD Level avx2
Build:
C++ Library Version 12.0.0-SNAPSHOT
C++ Compiler GNU
C++ Compiler Version 12.2.0
Git ID b679a96
sessionInfo
R version 4.2.2 (2022-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.10Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.1
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=nl_NL.UTF-8 LC_NAME=nl_NL.UTF-8
[9] LC_ADDRESS=nl_NL.UTF-8 LC_TELEPHONE=nl_NL.UTF-8 LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=nl_NL.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] arrow_11.0.0.9000 dplyr_1.0.10 PatientLevelPrediction_6.2.0.9000
loaded via a namespace (and not attached):
[1] pkgload_1.3.2 bit64_4.0.5 jsonlite_1.8.4 DatabaseConnector_6.0.0 R.utils_2.12.2
[6] shiny_1.7.4 assertthat_0.2.1 highr_0.10 blob_1.2.3 remotes_2.4.2
[11] yaml_2.3.6 sessioninfo_1.2.2 pillar_1.8.1 RSQLite_2.2.18 lattice_0.20-45
[16] glue_1.6.2 reticulate_1.26 digest_0.6.31 promises_1.2.0.1 htmltools_0.5.4
[21] httpuv_1.6.8 Matrix_1.5-1 R.oo_1.25.0 clipr_0.8.0 pkgconfig_2.0.3
[26] devtools_2.4.5 purrr_1.0.1 xtable_1.8-4 processx_3.8.0 later_1.3.0
[31] ParallelLogger_3.0.1 tibble_3.1.8 styler_1.9.0 generics_0.1.3 usethis_2.1.6
[36] ellipsis_0.3.2 cachem_1.0.6 withr_2.5.0 cli_3.6.0 magrittr_2.0.3
[41] crayon_1.5.2 mime_0.12 memoise_2.0.1 evaluate_0.20 ps_1.7.2
[46] R.methodsS3_1.8.2 Andromeda_1.0.0 fs_1.5.2 fansi_1.0.3 R.cache_0.16.0
[51] pkgbuild_1.4.0 SqlRender_1.12.0 profvis_0.3.7 tools_4.2.2 data.table_1.14.4
[56] prettyunits_1.1.1 lifecycle_1.0.3 stringr_1.5.0 reprex_2.0.2 callr_3.7.3
[61] compiler_4.2.2 rlang_1.0.6 grid_4.2.2 rstudioapi_0.14 htmlwidgets_1.6.1
[66] miniUI_0.1.1.1 rmarkdown_2.19 DBI_1.1.3 R6_2.5.1 knitr_1.41
[71] fastmap_1.1.0 bit_4.0.4 utf8_1.2.2 stringi_1.7.12 rJava_1.0-6
[76] parallel_4.2.2 Rcpp_1.0.9 vctrs_0.5.1 png_0.1-7 urlchecker_1.0.1
[81] tidyselect_1.2.0 FeatureExtraction_3.2.0 xfun_0.36
Component(s)
R
The text was updated successfully, but these errors were encountered: