Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Inconsistent behavior for arrow datasets vs working in memory #31564

Closed
asfimport opened this issue Apr 8, 2022 · 4 comments
Closed

[R] Inconsistent behavior for arrow datasets vs working in memory #31564

asfimport opened this issue Apr 8, 2022 · 4 comments
Assignees

Comments

@asfimport
Copy link

When I generate a sparse matrix using indices from an arrow dataset I get inconsistent behavior, sometimes there are duplicated indexes resulting in a matrix with values more than one at some places. When loading the dataset first in memory everything works as expected and all the values are one

Repro

library(Matrix)
library(dplyr)
library(arrow)

sparseMatrix <- Matrix::rsparsematrix(1e5,1e3, 0.05, repr="T")

dF <- data.frame(i=sparseMatrix@i + 1, j=sparseMatrix@j + 1)

arrow::write_dataset(dF, path='./data/feather', format='feather')
arrowDataset <- arrow::open_dataset('./data/feather', format='feather')

# run the below a few times, and at some time the output is more than just # 1 for unique(newSparse@x), indicating there are 
# duplicate indices for the sparse matrix (then it adds the values there)
newSparse <- Matrix::sparseMatrix(i = arrowDataset %>% pull(i) ,
                                  j = arrowDataset %>% pull(j),
                                  x = 1)
unique(newSparse@x) # here is the bug, @x is the slot for values


arrowInMemory <- arrowDataset %>% collect()

# after loading in memory the output is never more than 1 no matter how 
# often I run it
newSparse <- Matrix::sparseMatrix(i = arrowInMemory %>% pull(i) ,
                                  j = arrowInMemory %>% pull(j),
                                  x = 1)
unique(newSparse@x)

Environment: Ubuntu 21.10
R 4.1.3.
Arrow 7.0.0
Reporter: Egill Axfjord Fridgeirsson / @egillax
Assignee: Nicola Crane / @thisisnic

Note: This issue was originally created as ARROW-16157. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Nicola Crane / @thisisnic:
Thanks for reporting this @egillax! I was able to replicate your error with the released version of Arrow you mention above, but when I run the dev version, I think this is fixed.

If you want to run the dev version, you can install via install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com") .

I'll close the ticket for now as I believe this is fixed, but let me know if you're still having this issue let me know and I'll reopen it and take another look.

@asfimport
Copy link
Author

Egill Axfjord Fridgeirsson / @egillax:
Hi @thisisnic ,

I updated to the dev version and unfortunately I still get the issue.

Here is my arrow::info() output if that helps

 > arrow::arrow_info()
Arrow package version: 7.0.0.20220412

Capabilities:
               
dataset    TRUE
engine    FALSE
parquet    TRUE
json       TRUE
s3        FALSE
utf8proc   TRUE
re2        TRUE
snappy     TRUE
gzip      FALSE
brotli    FALSE
zstd      FALSE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2       FALSE
jemalloc  FALSE
mimalloc  FALSE

To reinstall with more optional capabilities enabled, see
   https://arrow.apache.org/docs/r/articles/install.html

Memory:
                  
Allocator   system
Current   76.29 Mb
Max        76.3 Mb

Runtime:
                        
SIMD Level          avx2
Detected SIMD Level avx2

Build:
                                   
C++ Library Version  8.0.0-SNAPSHOT
C++ Compiler                    GNU
C++ Compiler Version         11.2.0

@asfimport
Copy link
Author

Nicola Crane / @thisisnic:
Figured it out @egillax

Two important things to consider here:

  1. Row order is not guaranteed when reading in datasets
  2. Datasets are not read into memory until you call {}collect(){}, {}pull(){}, or similar.

In your first example, you read in both columns at once and so the values in each row are properly matched.  However, in your second example, you read in the data 1 column at a time (so technically reading it in on 2 separate occasions, but with different columns chosen), and so on the occasions where you're getting values > 1 when you run {}unique(newSparse@x){}, those will be where the values have been read in in different orders.

This isn't a bug, though obviously not idea in this case here.

If you are using a single file to store the data instead of datasets, you could use write_feather() and read_feather(); in this case, order is guaranteed, and you won't have the same problem.

@asfimport
Copy link
Author

Egill Axfjord Fridgeirsson / @egillax:
Thanks @thisisnic ! That's good to know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants