Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Casting columns using dplyr::mutate in arrow datasets results in NA values #34519

Closed
egillax opened this issue Mar 9, 2023 · 5 comments · Fixed by #34576
Closed

[R] Casting columns using dplyr::mutate in arrow datasets results in NA values #34519

egillax opened this issue Mar 9, 2023 · 5 comments · Fixed by #34576
Assignees
Labels
Component: R Priority: Blocker Marks a blocker for the release Type: bug
Milestone

Comments

@egillax
Copy link
Contributor

egillax commented Mar 9, 2023

Describe the bug, including details regarding any error messages, version, and platform.

I was testing the latest arrow develop version using this method to install from git.

And now it seems I cannot cast columns in a dataset, it results in NA values:

I tried using both parquet and arrow files. This does work using latest version on CRAN (11.0.0.3) and using arrow tables instead of datasets.

Reprex:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(arrow)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

mtcars %>% write_dataset('./mtcars/')
ds <- open_dataset('./mtcars')

ds %>% dplyr::collect()
#>     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> 5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> 6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> 7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> 8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> 9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> 10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> 11 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#> 12 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
#> 13 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
#> 14 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
#> 15 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#> 16 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#> 17 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#> 18 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> 19 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> 20 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> 21 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> 22 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
#> 23 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
#> 24 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#> 25 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#> 26 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> 27 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> 28 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> 29 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> 30 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> 31 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
#> 32 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

ds %>% dplyr::mutate(mpg=as.numeric(mpg)) %>% dplyr::collect()
#>    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> 1   NA   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> 2   NA   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> 3   NA   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> 4   NA   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> 5   NA   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> 6   NA   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> 7   NA   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> 8   NA   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> 9   NA   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> 10  NA   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> 11  NA   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#> 12  NA   8 275.8 180 3.07 4.070 17.40  0  0    3    3
#> 13  NA   8 275.8 180 3.07 3.730 17.60  0  0    3    3
#> 14  NA   8 275.8 180 3.07 3.780 18.00  0  0    3    3
#> 15  NA   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#> 16  NA   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#> 17  NA   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#> 18  NA   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> 19  NA   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> 20  NA   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> 21  NA   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> 22  NA   8 318.0 150 2.76 3.520 16.87  0  0    3    2
#> 23  NA   8 304.0 150 3.15 3.435 17.30  0  0    3    2
#> 24  NA   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#> 25  NA   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#> 26  NA   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> 27  NA   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> 28  NA   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> 29  NA   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> 30  NA   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> 31  NA   8 301.0 335 3.54 3.570 14.60  0  1    5    8
#> 32  NA   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Created on 2023-03-09 with reprex v2.0.2

Arrow Info Arrow package version: 11.0.0.9000

Capabilities:

dataset TRUE
substrait FALSE
parquet TRUE
json TRUE
s3 FALSE
gcs FALSE
utf8proc TRUE
re2 TRUE
snappy TRUE
gzip FALSE
brotli FALSE
zstd FALSE
lz4 TRUE
lz4_frame TRUE
lzo FALSE
bz2 FALSE
jemalloc FALSE
mimalloc TRUE

To reinstall with more optional capabilities enabled, see
https://arrow.apache.org/docs/r/articles/install.html

Memory:

Allocator mimalloc
Current 13.31 Kb
Max 46.31 Mb

Runtime:

SIMD Level avx2
Detected SIMD Level avx2

Build:

C++ Library Version 12.0.0-SNAPSHOT
C++ Compiler GNU
C++ Compiler Version 12.2.0
Git ID b679a96

sessionInfo R version 4.2.2 (2022-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.10

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.1

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=nl_NL.UTF-8 LC_NAME=nl_NL.UTF-8
[9] LC_ADDRESS=nl_NL.UTF-8 LC_TELEPHONE=nl_NL.UTF-8 LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=nl_NL.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] arrow_11.0.0.9000 dplyr_1.0.10 PatientLevelPrediction_6.2.0.9000

loaded via a namespace (and not attached):
[1] pkgload_1.3.2 bit64_4.0.5 jsonlite_1.8.4 DatabaseConnector_6.0.0 R.utils_2.12.2
[6] shiny_1.7.4 assertthat_0.2.1 highr_0.10 blob_1.2.3 remotes_2.4.2
[11] yaml_2.3.6 sessioninfo_1.2.2 pillar_1.8.1 RSQLite_2.2.18 lattice_0.20-45
[16] glue_1.6.2 reticulate_1.26 digest_0.6.31 promises_1.2.0.1 htmltools_0.5.4
[21] httpuv_1.6.8 Matrix_1.5-1 R.oo_1.25.0 clipr_0.8.0 pkgconfig_2.0.3
[26] devtools_2.4.5 purrr_1.0.1 xtable_1.8-4 processx_3.8.0 later_1.3.0
[31] ParallelLogger_3.0.1 tibble_3.1.8 styler_1.9.0 generics_0.1.3 usethis_2.1.6
[36] ellipsis_0.3.2 cachem_1.0.6 withr_2.5.0 cli_3.6.0 magrittr_2.0.3
[41] crayon_1.5.2 mime_0.12 memoise_2.0.1 evaluate_0.20 ps_1.7.2
[46] R.methodsS3_1.8.2 Andromeda_1.0.0 fs_1.5.2 fansi_1.0.3 R.cache_0.16.0
[51] pkgbuild_1.4.0 SqlRender_1.12.0 profvis_0.3.7 tools_4.2.2 data.table_1.14.4
[56] prettyunits_1.1.1 lifecycle_1.0.3 stringr_1.5.0 reprex_2.0.2 callr_3.7.3
[61] compiler_4.2.2 rlang_1.0.6 grid_4.2.2 rstudioapi_0.14 htmlwidgets_1.6.1
[66] miniUI_0.1.1.1 rmarkdown_2.19 DBI_1.1.3 R6_2.5.1 knitr_1.41
[71] fastmap_1.1.0 bit_4.0.4 utf8_1.2.2 stringi_1.7.12 rJava_1.0-6
[76] parallel_4.2.2 Rcpp_1.0.9 vctrs_0.5.1 png_0.1-7 urlchecker_1.0.1
[81] tidyselect_1.2.0 FeatureExtraction_3.2.0 xfun_0.36

Component(s)

R

@eitsupi
Copy link
Contributor

eitsupi commented Mar 12, 2023

I can reproduce this on 5b2fbad.
(Install with libarrow nightly binary arrow-11.0.0.100000193 on Ubuntu 22.04)

The same problem occurs with the csv format.

> mtcars %>% write_dataset('./mtcars/', format = "csv")
> ds <- open_dataset('./mtcars', format = "csv")
> ds %>% dplyr::mutate(mpg=as.numeric(mpg)) %>% dplyr::collect()
# A tibble: 32 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
 1    NA     6  160    110  3.9   2.62  16.5     0     1     4     4
 2    NA     6  160    110  3.9   2.88  17.0     0     1     4     4
 3    NA     4  108     93  3.85  2.32  18.6     1     1     4     1
 4    NA     6  258    110  3.08  3.22  19.4     1     0     3     1
 5    NA     8  360    175  3.15  3.44  17.0     0     0     3     2
 6    NA     6  225    105  2.76  3.46  20.2     1     0     3     1
 7    NA     8  360    245  3.21  3.57  15.8     0     0     3     4
 8    NA     4  147.    62  3.69  3.19  20       1     0     4     2
 9    NA     4  141.    95  3.92  3.15  22.9     1     0     4     2
10    NA     6  168.   123  3.92  3.44  18.3     1     0     4     4
# … with 22 more rows
# ℹ Use `print(n = ...)` to see more rows

@thisisnic
Copy link
Member

Thanks for reporting this @egillax; I can confirm that this issue is not present in 11.0.03, but is on the dev build. I'll investigate further.

@thisisnic
Copy link
Member

A little more investigation of the specific circumstances in which it does and does not occur:

library(arrow)
library(dplyr)

# no problem when replacing column with self when there is just 1 column
df <- tibble::tibble(x = 1:10) 
tf <- tempfile()
dir.create(tf)
write_dataset(df, tf)

open_dataset(tf) %>%
  mutate(x = as.numeric(x)) %>%
  collect()
#> # A tibble: 10 × 1
#>        x
#>    <dbl>
#>  1     1
#>  2     2
#>  3     3
#>  4     4
#>  5     5
#>  6     6
#>  7     7
#>  8     8
#>  9     9
#> 10    10

# NA values when there are 2 columns
df <- tibble::tibble(x = 1:10, y = 1:10) 
tf <- tempfile()
dir.create(tf)

write_dataset(df, tf)

open_dataset(tf) %>%
  mutate(x = as.numeric(x)) %>%
  collect()
#> # A tibble: 10 × 2
#>        x     y
#>    <dbl> <int>
#>  1    NA     1
#>  2    NA     2
#>  3    NA     3
#>  4    NA     4
#>  5    NA     5
#>  6    NA     6
#>  7    NA     7
#>  8    NA     8
#>  9    NA     9
#> 10    NA    10

# works fine if we're creating a brand new column
open_dataset(tf) %>%
  mutate(z = as.numeric(x)) %>%
  collect()
#> # A tibble: 10 × 3
#>        x     y     z
#>    <int> <int> <dbl>
#>  1     1     1     1
#>  2     2     2     2
#>  3     3     3     3
#>  4     4     4     4
#>  5     5     5     5
#>  6     6     6     6
#>  7     7     7     7
#>  8     8     8     8
#>  9     9     9     9
#> 10    10    10    10

# works fine if we're replacing a different column
open_dataset(tf) %>%
  mutate(y = as.numeric(x)) %>%
  collect()
#> # A tibble: 10 × 2
#>        x     y
#>    <int> <dbl>
#>  1     1     1
#>  2     2     2
#>  3     3     3
#>  4     4     4
#>  5     5     5
#>  6     6     6
#>  7     7     7
#>  8     8     8
#>  9     9     9
#> 10    10    10

# works fine with in-memory datasets when replacing existing columns
InMemoryDataset$create(df) %>%
  mutate(x = as.numeric(x)) %>%
  collect()
#> # A tibble: 10 × 2
#>        x     y
#>    <dbl> <int>
#>  1     1     1
#>  2     2     2
#>  3     3     3
#>  4     4     4
#>  5     5     5
#>  6     6     6
#>  7     7     7
#>  8     8     8
#>  9     9     9
#> 10    10    10

Given it works with 11.0.0.3 and not the dev version of the R package, and there are very few R code changes since 11.0.0.3, I'm inclined to think that this could be something happening at the C++ level. I'll try to narrow it down to the PR which caused this change.

@thisisnic thisisnic added the Priority: Blocker Marks a blocker for the release label Mar 15, 2023
@thisisnic thisisnic added this to the 12.0.0 milestone Mar 15, 2023
@thisisnic
Copy link
Member

I've managed to narrow it down to #33770 which is where it first broke. CC @nealrichardson

@nealrichardson
Copy link
Member

I can take a look.

nealrichardson added a commit to nealrichardson/arrow that referenced this issue Mar 15, 2023
westonpace pushed a commit that referenced this issue Mar 21, 2023
…field (#34576)

### Rationale for this change

Fixes #34519. #33770 introduced the bug; I had [asked](https://github.com/apache/arrow/pull/33770/files#r1081612013) in the review why the C++ function wasn't using `FieldsInExpression`. I swapped that in, and the test I added to reproduce the bug now passes.

### What changes are included in this PR?

Fix for the C++ function, test in R. 

### Are these changes tested?

Yes

### Are there any user-facing changes?

The behavior observed in the report no longer happens.
* Closes: #34519

Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue Mar 27, 2023
… as a field (apache#34576)

### Rationale for this change

Fixes apache#34519. apache#33770 introduced the bug; I had [asked](https://github.com/apache/arrow/pull/33770/files#r1081612013) in the review why the C++ function wasn't using `FieldsInExpression`. I swapped that in, and the test I added to reproduce the bug now passes.

### What changes are included in this PR?

Fix for the C++ function, test in R. 

### Are these changes tested?

Yes

### Are there any user-facing changes?

The behavior observed in the report no longer happens.
* Closes: apache#34519

Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: R Priority: Blocker Marks a blocker for the release Type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants