Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve memory usage and execution time of listing objects with file system backend #556

Merged

Conversation

ironsmile
Copy link
Contributor

@ironsmile ironsmile commented Aug 13, 2021

I am using the GCS Fake Server to develop locally and it is mostly great. But I've noticed it is completely unable to list the objects in my fie system bucket. Even when I give it a prefix which ensures only one object will match. It consumes all of the machine's memory and never finishes, presumably because it spends all of its time swapping. Information about my bucket: 53017 files with overall size of 20.3G. Sadly, the nature of my work is such that this is a relatively small data set.

So I went in started poking around the code. It quickly became evident that two things are happening:

  • All the files of the bucket are loaded into memory, all at once. For every object list command.
  • When filters are used (such as "prefix") this does not prevent files which do not match this filter from being parsed and loaded in the process memory.

This PR fixes those two issues in its two commits. Previously the list object command was taking all of my machine's 32GB of ram and was not finishing even after I've waited on it for half an hour. Now such list commands take no memory at all (in the range of few KB) and finish instantaneously.

While the above is great, I suspect there are many more places where the emulator will be significantly faster. I just haven't clocked them. On top of my head I see that deleting the bucket will require no memory where before it had the same problem as listing objects.

Further Improvements

It would be great if it is possible to read only the meta data for blobs stored on the file system. Unfortunately with the JSON encoding I don't see how that would be possible. As it stands one have to load all of the file contents in order for the JSON parser to do its thing. This is sad, though. When we consider that in many situations we would want to get only the blob meta-data.

I think the only way to achieve this cleanly would be to drop the JSON altogether and find another way of storing the meta-data. Possible approaches are file headers similarly to the nginx file cache or separate ".attrs" files like what gocloud.dev/blob does.

Previously all of the object blobs were read into memory on every list
command. Even when the list command would've returned nothing. There are
many problems with this approach:

* It is extremely slow to read all files on every list
* At some point if the blobs in the storage are more than the memory on
the machine the process will crash with OOM error.

This MR makes it so that for most operations the actual blob object
contents are not kept in memory. Instead only a small struct
(ObjectAttrs) is used. Unfortunately due to the nature of JSON encoding
all objects are read at least once from the disk in full.
Previously all files were read from a bucket before some of them were
dropped after prefix test. This is extremely inefficient in light of the
fact that all files are actually read into memory for the file system
bucket. A moderatly large bucket will cause listing to take many minutes
even when eventually only a few results are returned.
@fsouza fsouza enabled auto-merge (squash) August 13, 2021 17:08
@fsouza fsouza disabled auto-merge August 13, 2021 17:08
Copy link
Owner

@fsouza fsouza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thank you very much! 🎉

@fsouza fsouza merged commit 200f86b into fsouza:main Aug 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants