New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] Inconsistent behavior for arrow datasets vs working in memory #31564
Comments
Nicola Crane / @thisisnic: If you want to run the dev version, you can install via I'll close the ticket for now as I believe this is fixed, but let me know if you're still having this issue let me know and I'll reopen it and take another look. |
Egill Axfjord Fridgeirsson / @egillax: I updated to the dev version and unfortunately I still get the issue. Here is my arrow::info() output if that helps > arrow::arrow_info()
Arrow package version: 7.0.0.20220412
Capabilities:
dataset TRUE
engine FALSE
parquet TRUE
json TRUE
s3 FALSE
utf8proc TRUE
re2 TRUE
snappy TRUE
gzip FALSE
brotli FALSE
zstd FALSE
lz4 TRUE
lz4_frame TRUE
lzo FALSE
bz2 FALSE
jemalloc FALSE
mimalloc FALSE
To reinstall with more optional capabilities enabled, see
https://arrow.apache.org/docs/r/articles/install.html
Memory:
Allocator system
Current 76.29 Mb
Max 76.3 Mb
Runtime:
SIMD Level avx2
Detected SIMD Level avx2
Build:
C++ Library Version 8.0.0-SNAPSHOT
C++ Compiler GNU
C++ Compiler Version 11.2.0 |
Nicola Crane / @thisisnic: Two important things to consider here:
In your first example, you read in both columns at once and so the values in each row are properly matched. However, in your second example, you read in the data 1 column at a time (so technically reading it in on 2 separate occasions, but with different columns chosen), and so on the occasions where you're getting values > 1 when you run This isn't a bug, though obviously not idea in this case here. If you are using a single file to store the data instead of datasets, you could use |
Egill Axfjord Fridgeirsson / @egillax: |
When I generate a sparse matrix using indices from an arrow dataset I get inconsistent behavior, sometimes there are duplicated indexes resulting in a matrix with values more than one at some places. When loading the dataset first in memory everything works as expected and all the values are one
Repro
Environment: Ubuntu 21.10
R 4.1.3.
Arrow 7.0.0
Reporter: Egill Axfjord Fridgeirsson / @egillax
Assignee: Nicola Crane / @thisisnic
Note: This issue was originally created as ARROW-16157. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: