New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] errors when downloading parquet files from s3. #11934
Comments
Thanks for the report @JasperSch . Just to confirm, do you get any problems printing the retrieved data in the last step, or not, i.e. is it just the point at which you're running |
I couldn't reproduce this using minio locally...is there anything that I'm not understanding about your setup? If you can modify this example to reproduce your error we will be better able to help fix! library(arrow, warn.conflicts = FALSE)
dir <- tempfile()
dir.create(dir)
subdir <- file.path(dir, "some_subdir")
dir.create(subdir)
list.files(dir)
#> [1] "some_subdir"
minio_server <- processx::process$new("minio", args = c("server", dir), supervise = TRUE)
Sys.sleep(1)
stopifnot(minio_server$is_alive())
#> Error: minio_server$is_alive() is not TRUE
# make sure we can connect
s3_uri <- "s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000"
bucket <- s3_bucket(s3_uri)
bucket$ls("some_subdir")
#> [1] "some_subdir/test"
# write a dataset to minio
data <- data.frame(x = letters[1:5])
write_dataset(
dataset = data,
path = bucket$path("some_subdir/test")
)
bucket$ls("some_subdir/test")
#> [1] "some_subdir/test/part-0.parquet"
dplyr::collect(arrow::open_dataset(bucket$path("some_subdir/test")))
#> x
#> 1 a
#> 2 b
#> 3 c
#> 4 d
#> 5 e
minio_server$interrupt()
#> [1] FALSE
Sys.sleep(1)
stopifnot(!minio_server$is_alive()) Created on 2022-01-10 by the reprex package (v2.0.1) |
Just a problem with |
@paleolimbot Thank you for the example. I just noticed btw that my MWE was not fully reproducible. |
Ran into some issues in installing minio, but eventually managed to set it up in a docker container.
The example below should be very close to what you proposed.
So, in your example, I think you could you try running:
|
Thanks for making this example easy for me to reproduce! You're right, this example fails for me in the same way that it fails for you. Based on the stack trace of the error, it looks like this is coming from the minio.s3 library (and the aws.s3 library in your previous example). From examining the local file that was saved, it doesn't appear that the arrow package wrote an invalid file...rather, it looks like the minio.s3 and aws.s3 packages are interpreting the content of the file as a file path somewhere. Would it be reasonable to open an issue in either or both of those repositories to fix that code? |
Yes, that would be reasonable. Decided to open it here in the first place since I've got the feeling that the root cause of the issues lies in the way See below an extended version of my example above. Also (not shown here) creating the files locally with Thus conclusively, my assumption is that So, to me it's still a question whether
|
Thank you for your response! I have a feeling the folks who maintain aws.s3 and/or minio.s3 will have a better handle on the mode of failure. I'd suggest opening an issue there and/or submitting your fix as a pull request...the maintainers there may have a suggestion as to whether or not we should be writing to S3 in a different way. |
I poked around at this a bit. The error seems to be that write_dataset is creating files with the application/xml content type and then If I hardcode the C++ to set the content-type to something else (application/parquet) then minio.s3::save_object works fine.
|
It appears that the AWS SDK forces a content-type. If one isn't set then it will use application/xml (which is rather unfortunate). That being said, I don't understand why So I would argue it is a bug in both libraries. I opened https://issues.apache.org/jira/browse/ARROW-15306 which should be pretty straightforward to fix if everyone agrees it is a good thing to do. |
I'm closing this as it appears that the Arrow ticket is resolved and the ticket is opened on the AWS SDK - if this problem persists, feel free to reopen. |
When writing a dateset to S3 as parquet files using
write_dataset
, I get download errors when retrieving the files afterwards.Error: 'PAR1���2�2L���' does not exist in current working directory ('/tmp/Rtmpk1pQuU').
Despite of the errors, the files do however still get downloaded.
The errors do not seem to occur when I use
write_dataset
locally and upload the files to s3 manually usingaws.s3::put_object
.They also stop occurring if I re-upload the downloaded files.
System info:
R version 3.6.3
arrow 6.0.1
aws.s3 0.3.21
MWE:
The text was updated successfully, but these errors were encountered: