Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memDecompress error #59

Closed
achekroud opened this issue Jul 31, 2016 · 22 comments
Closed

memDecompress error #59

achekroud opened this issue Jul 31, 2016 · 22 comments
Labels

Comments

@achekroud
Copy link

Hi team

I may be working in a very strange use case, I'm not sure. Feel free to disregard this if so.

I'm working in r-studio-server hosted on an EC2 instance in AWS (amazon linux OS, R 3.2.2, rstudio-server version 0.99.465). I'm trying to use the aws.s3 package to access an .rds file that is in an s3 bucket. The file is approximately 200mb on disk as an RDS. The EC2 instance is m4.2xlarge, so there should be around 32GB of RAM available.

The bucket is called "chek1", and get_bucket("chek1") works fine.

However, when I do:
> s3readRDS(object = "SAMHDA/RAWdata/vcat.08-14.rds", bucket = "chek1")

I get the following cryptic error message

Error in memDecompress(from = as.vector(r), type = "gzip") : 
  internal error -3 in memDecompress(2)

I'm not sure whats going on here. Does anyone have any ideas/workarounds? I really liked the look/feel of this package, and was pretty surprised to get tripped up with this. Googling the error message only returns a random conversation from 2012 between @hadley wickham and brian ripley lol.

Adam

@leeper leeper added the question label Aug 1, 2016
@leeper
Copy link
Member

leeper commented Aug 1, 2016

This looks like it might be a bug. Are you able to get the object as a raw vector using get_object(object = "SAMHDA/RAWdata/vcat.08-14.rds", bucket = "chek1")?

@achekroud
Copy link
Author

Yeah, the command executes. I wasn't sure what the output does/means though.

@leeper
Copy link
Member

leeper commented Sep 7, 2016

I am unable to reproduce this. Given that you can read the file using get_object(), it seems it is probably an issue with the file rather than with this package. I'm closing for now. Feel free to open a new issue or follow-up here if you continue to experience issues.

@leeper leeper closed this as completed Sep 7, 2016
@yasminlucero
Copy link

yasminlucero commented Oct 6, 2016

I had this exact behavior as well. Notably, the RDS that failed was a large file (85MB). The s3readRDS worked fine on a small file (1KB). Oh, and I verified that I can read the file via other means (s3 fs mount). So, there is no reason to expect that the file is corrupt.

big.test <- s3readRDS(object = "bigtest.RDS", bucket = "grv-myexamplebucket")

Error in memDecompress(from = as.vector(r), type = "gzip") : 
  internal error -3 in memDecompress(2)

big.test.raw <- get_object(object = "bigtest.RDS", bucket = "grv-myexamplebucket")

  attr(big.test.raw, 'content-type')
[1] "application/octet-stream"
  attr(big.test.raw, 'content-length')
[1] "88697837"

I haven't figured out yet how to parse the raw object.

The error is on line 6045 ish: https://github.com/wch/r-source/blob/af7f52f70101960861e5d995d3a4bec010bc89e6/src/main/connections.c

@leeper leeper reopened this Oct 7, 2016
@vicmayrink
Copy link

vicmayrink commented Jan 16, 2017

I'm experiencing exactly the same issue. Did you find any solution?

@mjpdenver
Copy link

Likewise - I get the same response as yasminlucero trying to read an RDS file.

Thanks

@fanghaolei
Copy link

fanghaolei commented Apr 12, 2017

I'm experiencing the exact same issue. It appears to me that this memDecompress error only occurs when I sync some .rds file to a bucket via aws CLI tool first and then try to download it with S3readRDS().

Thanks!

@ieaves
Copy link

ieaves commented Apr 18, 2017

I have no idea if this is related to the issue everyone else is seeing but in my use case s3saveRDS requires headers=list("x-amz-server-side-encryption" = "AES256") like so:

s3saveRDS(my_object, bucket=my_bucket, object=my_file_name, headers=list("x-amz-server-side-encryption" = "AES256"))

however, attempting to use s3readRDS with the same headers results in the cryptic memDecompress error.

Removing the headers from the readRDS call like s3readRDS(bucket=my_bucket, object=my_file_name) allowed me to successfully load from s3.

@leonawicz
Copy link

I am experiencing the same issue with package version aws.s3_0.2.2.

First I tried to use s3readRDS on .rds files I had previously uploaded to an AWS S3 bucket using the S3 web GUI uploader. This give the same memDecompress error noted above. I can always read the raw vector with get_object.

The second way I did this was to use put_object to upload .rds files to my bucket. Trying to load such a file with s3readRDS results in the same error.

The third way I tried was to upload rds files to my bucket strictly using the s3saveRDS wrapper. Only if uploaded in this manner can I then subsequently load .rds files using s3readRDS.

I am not sure what is different about these files based on method of upload. I was hopeful that at least the second approach using put_object on local .rds files for uploading would have been a solution, because it is analogous to the approach I have to use for uploading .RData files, using put_object directly instead of s3save (see issue #128)

For the time being, it seems that uploading strictly via s3saveRDS will avoid the reading errors with s3readRDS. Not ideal, but this is working for me. And at least at a glance (haven't fully tested) doing so fortunately does not appear to lead to file size bloat like in the above referenced issue.

Regards,
Matt

@leeper
Copy link
Member

leeper commented Apr 23, 2017

@leonawicz Can you give this a try on the latest version from GitHub?

@leeper leeper closed this as completed in cfa7541 Apr 23, 2017
@leonawicz
Copy link

leonawicz commented Apr 24, 2017

I can confirm with the latest github version aws.s3_0.2.4 I can load an object into R via s3readRDS regardless of which of the three methods of upload to AWS I'd previously used: upload R object directly with s3saveRDS, upload a previously saved (using base saveRDS) local .rds file via put_object, or upload previously saved .rds file using the AWS GUI uploader utility.

@Serenthia
Copy link
Contributor

FYI, this change has meant that I can't read any binary files I previously saved to S3 with the old method, which is a breaking change as far as I'm concerned.

Re-uploading them with the new s3saveRDS method means they then can be read, however I can't do this for thousands of past files...

@leeper
Copy link
Member

leeper commented Apr 24, 2017

@Serenthia what error do you get when trying to read a previously uploaded RDS?

@leonawicz
Copy link

@leeper I also noticed just now that I could no longer read .rds files uploaded with the previous package version. I had to delete them all from AWS and reupload before I could read them with the newer package version s3readRDS. The error is:

Error in readRDS(tmp) : unknown input format

This occurs trying to read older .rds files. Newer ones are fine. It seems somehow the file created was dependent on the aws.s3 package version. Hopefully, it was a bug unique to the old version? I'm unsure why when reading a .rds file with s3readRDS it would matter how it was created and uploaded to AWS. But for some reason it seems to matter with which package version the file was made.

@leeper leeper reopened this Apr 24, 2017
@Serenthia
Copy link
Contributor

Can confirm that that's the same behaviour and error message that I'm experiencing. Thanks for the reopen!

@leeper
Copy link
Member

leeper commented Apr 25, 2017

Okay, I think I've tracked this down to being a decompression issue. Just to confirm that you're experiencing it the same way (@Serenthia, @leonawicz), if you do this for one of the older files:

o <- get_object("s3://yourbucket/yourobject")
unserialize(memDecompress(o, "gzip"))

Do you get back what you expect?

@Serenthia
Copy link
Contributor

@leeper Yes - using that, I can successfully read a file that returns the unknown input format error using readRDS.

leeper added a commit that referenced this issue Apr 25, 2017
@leeper
Copy link
Member

leeper commented Apr 25, 2017

Okay, I've tracked this down to the previous behavior being a bug (specifically, serialize() sets xdr = TRUE by default (writing to big endian), which is (basically) never what we want. The current behavior is correct and more consistent with using saveRDS() and readRDS() directly.

However, because it would be annoying to figure this out for a given file, s3readRDS() now tries to read and then tries to unserialize if that fails, so it should work on both older (incorrect) and new files.

Let me know if not and I'll continue to patch.

@Serenthia
Copy link
Contributor

Thanks! 0.2.5 looks perfect 👍

@leeper leeper closed this as completed Apr 26, 2017
@drorata
Copy link
Contributor

drorata commented Nov 4, 2019

What about non-RDS files? I fail to load a compressed JSON from S3.

@drorata
Copy link
Contributor

drorata commented Nov 4, 2019

I found the workaround to be something like:

read_gzip_json_from_s3_to_df <- function(path) {
  #' Read a single gzipeed JSON file from S3 location into a dataframe
  #'
  #' The compressed JSON should contain a single object per line
  #' with no commas of array structure wrapping the objects
  #'
  #' @param path S3 location of an object; e.g. s3://my-bucket/some/folders/file.json.gz
  raw_data <- path %>% get_object %>% rawConnection %>% gzcon %>% jsonlite::stream_in() %>% jsonlite::flatten()
  raw_data
}

@dmaupin12
Copy link

dmaupin12 commented Dec 2, 2021

I just had this happen on a fairly large dataset as well. The following code is how i upload to server. Is there a better way to do this to avoid this happening in the future?

tmp <- tempfile()
saveRDS(full_data, tmp)
put_object(tmp, object = paste0(s3_path,"full_data.rds"), show_progress = TRUE, multipart = TRUE)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests