Improve memory usage and execution time of listing objects with file system backend #556

ironsmile · 2021-08-13T15:10:59Z

I am using the GCS Fake Server to develop locally and it is mostly great. But I've noticed it is completely unable to list the objects in my fie system bucket. Even when I give it a prefix which ensures only one object will match. It consumes all of the machine's memory and never finishes, presumably because it spends all of its time swapping. Information about my bucket: 53017 files with overall size of 20.3G. Sadly, the nature of my work is such that this is a relatively small data set.

So I went in started poking around the code. It quickly became evident that two things are happening:

All the files of the bucket are loaded into memory, all at once. For every object list command.
When filters are used (such as "prefix") this does not prevent files which do not match this filter from being parsed and loaded in the process memory.

This PR fixes those two issues in its two commits. Previously the list object command was taking all of my machine's 32GB of ram and was not finishing even after I've waited on it for half an hour. Now such list commands take no memory at all (in the range of few KB) and finish instantaneously.

While the above is great, I suspect there are many more places where the emulator will be significantly faster. I just haven't clocked them. On top of my head I see that deleting the bucket will require no memory where before it had the same problem as listing objects.

Further Improvements

It would be great if it is possible to read only the meta data for blobs stored on the file system. Unfortunately with the JSON encoding I don't see how that would be possible. As it stands one have to load all of the file contents in order for the JSON parser to do its thing. This is sad, though. When we consider that in many situations we would want to get only the blob meta-data.

I think the only way to achieve this cleanly would be to drop the JSON altogether and find another way of storing the meta-data. Possible approaches are file headers similarly to the nginx file cache or separate ".attrs" files like what gocloud.dev/blob does.

Previously all of the object blobs were read into memory on every list command. Even when the list command would've returned nothing. There are many problems with this approach: * It is extremely slow to read all files on every list * At some point if the blobs in the storage are more than the memory on the machine the process will crash with OOM error. This MR makes it so that for most operations the actual blob object contents are not kept in memory. Instead only a small struct (ObjectAttrs) is used. Unfortunately due to the nature of JSON encoding all objects are read at least once from the disk in full.

Previously all files were read from a bucket before some of them were dropped after prefix test. This is extremely inefficient in light of the fact that all files are actually read into memory for the file system bucket. A moderatly large bucket will cause listing to take many minutes even when eventually only a few results are returned.

fsouza

Nice! Thank you very much! 🎉

ironsmile added 2 commits August 13, 2021 17:08

fsouza enabled auto-merge (squash) August 13, 2021 17:08

fsouza disabled auto-merge August 13, 2021 17:08

fsouza approved these changes Aug 14, 2021

View reviewed changes

fsouza merged commit 200f86b into fsouza:main Aug 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve memory usage and execution time of listing objects with file system backend #556

Improve memory usage and execution time of listing objects with file system backend #556

ironsmile commented Aug 13, 2021 •

edited

Loading

fsouza left a comment

Improve memory usage and execution time of listing objects with file system backend #556

Improve memory usage and execution time of listing objects with file system backend #556

Conversation

ironsmile commented Aug 13, 2021 • edited Loading

Further Improvements

fsouza left a comment

Choose a reason for hiding this comment

ironsmile commented Aug 13, 2021 •

edited

Loading