[R] Inconsistent behavior for arrow datasets vs working in memory #31564

asfimport · 2022-04-08T17:38:47Z

When I generate a sparse matrix using indices from an arrow dataset I get inconsistent behavior, sometimes there are duplicated indexes resulting in a matrix with values more than one at some places. When loading the dataset first in memory everything works as expected and all the values are one

Repro

library(Matrix)
library(dplyr)
library(arrow)

sparseMatrix <- Matrix::rsparsematrix(1e5,1e3, 0.05, repr="T")

dF <- data.frame(i=sparseMatrix@i + 1, j=sparseMatrix@j + 1)

arrow::write_dataset(dF, path='./data/feather', format='feather')
arrowDataset <- arrow::open_dataset('./data/feather', format='feather')

# run the below a few times, and at some time the output is more than just # 1 for unique(newSparse@x), indicating there are 
# duplicate indices for the sparse matrix (then it adds the values there)
newSparse <- Matrix::sparseMatrix(i = arrowDataset %>% pull(i) ,
                                  j = arrowDataset %>% pull(j),
                                  x = 1)
unique(newSparse@x) # here is the bug, @x is the slot for values


arrowInMemory <- arrowDataset %>% collect()

# after loading in memory the output is never more than 1 no matter how 
# often I run it
newSparse <- Matrix::sparseMatrix(i = arrowInMemory %>% pull(i) ,
                                  j = arrowInMemory %>% pull(j),
                                  x = 1)
unique(newSparse@x)

Environment: Ubuntu 21.10
R 4.1.3.
Arrow 7.0.0
Reporter: Egill Axfjord Fridgeirsson / @egillax
Assignee: Nicola Crane / @thisisnic

_{Note: This issue was originally created as ARROW-16157. Please see the migration documentation for further details.}

asfimport · 2022-04-13T09:12:03Z

Nicola Crane / @thisisnic:
Thanks for reporting this @egillax! I was able to replicate your error with the released version of Arrow you mention above, but when I run the dev version, I think this is fixed.

If you want to run the dev version, you can install via install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com") .

I'll close the ticket for now as I believe this is fixed, but let me know if you're still having this issue let me know and I'll reopen it and take another look.

asfimport · 2022-04-13T09:34:19Z

Egill Axfjord Fridgeirsson / @egillax:
Hi @thisisnic ,

I updated to the dev version and unfortunately I still get the issue.

Here is my arrow::info() output if that helps

 > arrow::arrow_info()
Arrow package version: 7.0.0.20220412

Capabilities:
               
dataset    TRUE
engine    FALSE
parquet    TRUE
json       TRUE
s3        FALSE
utf8proc   TRUE
re2        TRUE
snappy     TRUE
gzip      FALSE
brotli    FALSE
zstd      FALSE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2       FALSE
jemalloc  FALSE
mimalloc  FALSE

To reinstall with more optional capabilities enabled, see
   https://arrow.apache.org/docs/r/articles/install.html

Memory:
                  
Allocator   system
Current   76.29 Mb
Max        76.3 Mb

Runtime:
                        
SIMD Level          avx2
Detected SIMD Level avx2

Build:
                                   
C++ Library Version  8.0.0-SNAPSHOT
C++ Compiler                    GNU
C++ Compiler Version         11.2.0

asfimport · 2022-05-23T17:45:08Z

Nicola Crane / @thisisnic:
Figured it out @egillax!

Two important things to consider here:

Row order is not guaranteed when reading in datasets
Datasets are not read into memory until you call {}collect(){}, {}pull(){}, or similar.

In your first example, you read in both columns at once and so the values in each row are properly matched. However, in your second example, you read in the data 1 column at a time (so technically reading it in on 2 separate occasions, but with different columns chosen), and so on the occasions where you're getting values > 1 when you run {}unique(newSparse@x){}, those will be where the values have been read in in different orders.

This isn't a bug, though obviously not idea in this case here.

If you are using a single file to store the data instead of datasets, you could use write_feather() and read_feather(); in this case, order is guaranteed, and you won't have the same problem.

asfimport · 2022-05-24T11:27:32Z

Egill Axfjord Fridgeirsson / @egillax:
Thanks @thisisnic ! That's good to know.

asfimport closed this as completed May 23, 2022

asfimport assigned thisisnic Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] Inconsistent behavior for arrow datasets vs working in memory #31564

[R] Inconsistent behavior for arrow datasets vs working in memory #31564

asfimport commented Apr 8, 2022

asfimport commented Apr 13, 2022

asfimport commented Apr 13, 2022

asfimport commented May 23, 2022

asfimport commented May 24, 2022

[R] Inconsistent behavior for arrow datasets vs working in memory #31564

[R] Inconsistent behavior for arrow datasets vs working in memory #31564

Comments

asfimport commented Apr 8, 2022

asfimport commented Apr 13, 2022

asfimport commented Apr 13, 2022

asfimport commented May 23, 2022

asfimport commented May 24, 2022