a mmap'ed cache #177

blackmagic02881 · 2019-04-12T01:57:31Z

pls do not merge. just show the idea for discussion. this is a file backed cache which

create a tmp file with lifecycle same as the object
file get mmap as bytestream. data get pulled from remote on 1st cache miss.

pros

larger cache, better performance on random access.
cons
cache lost as soon as object released
accessing every part of a large object could lead to OOS in tmp directory.

martindurant · 2019-04-19T13:45:11Z

An interesting idea, as I said before. Would be nice to see what kind of performance you get here - presumably depends on the type of disc and OS caching policy (so complicated to measure); also, I suppose memory use does not necessarily stay low if large parts of a large file are accessed. Do you know what are the system requirements for sparse file support and mem-mapping?

Am I right in thinking that you split the file into block-sized pieces and keep track, in memory, of which blocks have been seen?

More comments to come on the code itself.

martindurant · 2019-04-19T13:47:21Z

s3fs/core.py

@@ -1209,6 +1214,16 @@ def __init__(self, s3, path, mode='rb', block_size=5 * 2 ** 20, acl="",
                    self.version_id = info.get('VersionId')
            except (ClientError, ParamValidationError) as e:
                raise_from(IOError("File not accessible", path), e)
+                if self.file_backed_cache:


This indentation looks wrong, why would this block be within the except clause?

This block could be moved to a separate method, rather than extending __init__ like this

yes it was wrong. a c&p error.

martindurant · 2019-04-19T13:56:25Z

s3fs/core.py

+                    f_no = fd.fileno()
+                    self.start = 0
+                    self.end = self.size
+                    self.cache = mmap.mmap(f_no, self.size)


Just a suggestion, it feels like a class could encapsulate this caching behaviour, so that you only need to set .cache here to the instance, the instance keeps references to:

the file that is mmapped

the function partial to get range of bytes (i.e., _fetch_range with all args filled in except the start/stop)

keep it's own close-on-del functionality?

This way you wouldn't need to have the multiple if self.file_backed_cache: branchings.

Also, you can make this class serialisable, for which otherwise you'd have to put extra logic into __getstate__ of the file class - it would not be normal to serialise the S3File instances, but it's better practice to make sure it's possible.

martindurant · 2019-04-19T13:59:24Z

s3fs/core.py

@@ -1296,6 +1311,8 @@ def readline(self, length=-1):

        If length is specified, at most size bytes will be read.
        """
+        if self.file_backed_cache:


Well, this is not really true - text mode shouldn't be done as it is here at all, but with a TextIOWrapper, which is what happens in fsspec. Then, readline wouldn't depend on the specifics of byte fetching. I suspect this here should work with the current code and mmap implementation too, though.

martindurant · 2019-04-19T14:00:18Z

s3fs/core.py

@@ -1339,7 +1356,7 @@ def _fetch(self, start, end):
        if start < self.start:
            if not self.fill_cache and end + self.blocksize < self.start:
                self.start, self.end = None, None
-                return self._fetch(start, end)


No need to change these - you made sure that _fetch does the same thing as before when not mmapping.

martindurant · 2019-04-19T14:01:02Z

s3fs/core.py

-        self.cache = None
+
+        if self.file_backed_cache:
+            self.cache.close()


If you just do self.cache = None, so the reference drops, doesn't the file get cleaned up anyway upon garbage collection?

martindurant · 2019-04-19T14:08:38Z

s3fs/core.py

+                    f_no = fd.fileno()
+                    self.start = 0
+                    self.end = self.size
+                    self.cache = mmap.mmap(f_no, self.size)


Note that the windows version of mmap will automatically extend the file size here, rather than doing the sparse trick (but this might take up the fill space of the file - to be tried). I suspect the sparse trick might not work on windows, you'd get "write past EOF" error or similar.

What is windows? lol. sorry i only have mac and linux at hand.

martindurant · 2019-04-19T14:09:57Z

s3fs/core.py

+            end = self.end
+        start_block = start // self.blocksize
+        end_block = end // self.blocksize
+        for i in range(start_block, end_block + 1):


This loop may be wasteful if requesting contiguous blocks, often the overhead of establishing the connection it comparable to the download time; instead of doing a fetch for each block, they should be combined where possible.

totally agreed. i just want to show how it works it really should check and merge small requests like this to be as large as possible.

martindurant · 2019-04-19T14:12:24Z

s3fs/core.py

@@ -1296,6 +1311,8 @@ def readline(self, length=-1):

        If length is specified, at most size bytes will be read.
        """
+        if self.file_backed_cache:
+            raise ValueError('readline not available in file backed cache mode')


I notice that the mmap API supports readline directly https://docs.python.org/3/library/mmap.html#mmap.mmap.readline

blackmagic02881 · 2019-04-19T15:58:29Z

thanks for the comments. yes, it is just to show the idea that how a mmap one could looks like. performance wise, it highly depends on the workload. we have a pattern that it will read a portion of block X, then seek to read block Y. then later it comes back to block X again. the current cache with trim will not be able to deal with this. and that is why i proposed this mmap idea.

i like the idea of extracting this into a separate caching manager so it could be extensible.

blackmagic02881 · 2019-04-19T16:07:52Z

that is why up front i said this code is not for merged but to show the idea. i think original s3fs need some refactoring before adding any code like this.

allows for implementing mmap from fsspec/s3fs#177

martindurant · 2019-05-29T20:31:13Z

Implemented in fsspec, no need to duplicate here

a mmap'ed cache

a169252

martindurant reviewed Apr 19, 2019

View reviewed changes

martindurant mentioned this pull request Apr 23, 2019

trouble loading netcdf4 files with xarray on s3 #168

Closed

martindurant pushed a commit to fsspec/filesystem_spec that referenced this pull request Apr 23, 2019

start to abstract out the read cache for files

f371f62

allows for implementing mmap from fsspec/s3fs#177

martindurant mentioned this pull request Apr 23, 2019

start to abstract out the read cache for files fsspec/filesystem_spec#40

Merged

martindurant closed this May 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a mmap'ed cache #177

a mmap'ed cache #177

blackmagic02881 commented Apr 12, 2019

martindurant commented Apr 19, 2019

martindurant Apr 19, 2019

martindurant Apr 19, 2019

blackmagic02881 Apr 19, 2019

martindurant Apr 19, 2019

blackmagic02881 Apr 19, 2019

martindurant Apr 19, 2019

martindurant Apr 19, 2019

martindurant Apr 19, 2019

blackmagic02881 Apr 19, 2019

martindurant Apr 19, 2019

blackmagic02881 Apr 19, 2019

martindurant Apr 19, 2019

blackmagic02881 Apr 19, 2019

martindurant Apr 19, 2019

blackmagic02881 commented Apr 19, 2019

blackmagic02881 commented Apr 19, 2019

martindurant commented May 29, 2019

a mmap'ed cache #177

a mmap'ed cache #177

Conversation

blackmagic02881 commented Apr 12, 2019

martindurant commented Apr 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blackmagic02881 commented Apr 19, 2019

blackmagic02881 commented Apr 19, 2019

martindurant commented May 29, 2019