Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3save/save significant file size inconsistency #128

Closed
leonawicz opened this issue Apr 21, 2017 · 5 comments
Closed

s3save/save significant file size inconsistency #128

leonawicz opened this issue Apr 21, 2017 · 5 comments
Labels

Comments

@leonawicz
Copy link

Hi,

This is a great package, easy to use, and seems to be the most promising existing package. I am curious if I am missing something or if this could be addressed to make the package better.

I noticed that when using s3save, it sends a relatively decompressed raw data file to AWS. While the .RData file generated behaves exactly the same if downloaded and then loaded into R using the base R function, load, the file is much larger in size than if I were to save the same objects to a local .RData file using save. The latter is much more compressed.

While this in no way affects function and s3save appears to be an analog to save on the surface, it makes much larger files than save. This also defeats the purpose of rapid file retrieval over the internet from AWS (e.g., in R Shiny apps when there are a number of files preferably stored externally and loaded on demand rather than hardcoded into the app).

An easy way around this is to use save to save a local .RData file and then use put_object to send the more compressed version of the R workspace file to AWS. For me, this allowed me to use the package because I was then able to retrieve ~1.8 MB files in about one second with s3load rather than ~15 MB files containing identical objects, which took about 12 seconds to retrieve with s3load (far too slow for serving apps for example). It was the perfect use case to highlight why I would have to use this latter approach.

What I am wondering (assuming I'm not missing some existing option) is wouldn't it be preferable to increase the consistency between s3save and save in this regard? The method would not have to be a replacement, but perhaps an option (default?) would be to have s3save simply create a local .RData temporary file behind the scenes using save and then upload that file to AWS via put_object. It seems that this most accurately reproduces the behavior of save as well as avoiding unnecessary file size expansion for remote storage/retrieval.

I am using aws.s3_0.2.2 from Github.

Regards,
Matt

@leeper
Copy link
Member

leeper commented Apr 22, 2017

@leonawicz Thanks for this report. I don't know how to explain this given that s3save() uses save internally. The only difference is a reliance on rawConnection() rather than writing to disk.

I've created a new branch that replaces all the internal usage of rawConnection() with calls to tempfile(). I suspect it may solve this issue as well as the various other read/write problems that have been reported, including #59.

Can you give it a try and let me know how it affects your use case?

@leonawicz
Copy link
Author

I can confirm with version aws.s3_0.2.4 the internal use of tempfile in s3save works to keep the data compressed just like using save would do. Thanks!

If there is a way to achieve the same using rawConnection I could still see higher value in it perhaps? E.g., I imagine it may be more efficient to avoid creating files on disk if someone is looping through repeated calls to s3save (or s3saveRDS) based on different workspaces and/or objects they are generating. Maybe that was part of the initial motivation to go in that direction. As it stands of course, this would generally be outweighed by all the extra time it would take to send the larger files to AWS when using rawConnection. However, if that issue is solvable maybe it would then be the best approach again? I think it is fine as is now, but I am just thinking out loud.

The only other thing I am thinking about regarding the current tempfile-based approach is that many users use the file extention .RData, but I have seen people using .RData, .Rdata, and .rdata. I've tried to find online some authoritative source regarding what the standard is since many systems are case sensitive. I always thought it was .RData but I don't actually know. Even so, if one is the "right" one, the fact remains that many people tend to stick to one of each of those flavors of extension case. While it seems irrelevant when using s3save because the user never sees the temporary file, it would not be unusual if they or their intended audience, colleagues, etc., might download R data files from AWS outside of the aws.s3 workflow context, in which case they might be surprised by the file extention case or might then want/need to batch rename all their files. Perhaps it might be good to include a function argument with one as the default, but that allows the user to specify a specific extension like with ext=c("RData", "Rdata", "rdata") or something like that?

@leeper
Copy link
Member

leeper commented Apr 24, 2017

You can specify an object key with whatever extension you want. The tempfile gets purged immediately, so it would only be seen during some kind of debugging.

@leonawicz
Copy link
Author

Oops. Right, sorry, I mixed up file and object when reading the s3save code. Was thinking the tempfile name was becoming the object name. Never mind. Thanks!

@leeper
Copy link
Member

leeper commented Apr 25, 2017

And to respond substantively, I've been reading (no time for full benchmarks, unfortunately) and it seems that the i/o cost of writing to disk is probably minimal as rawConnections can apparently be somewhat slow. They also lead to modification of defaults for some functions, so my inclination is to stick with the new behavior.

Closing (for now).

@leeper leeper closed this as completed Apr 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants