-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
a mmap'ed cache #177
a mmap'ed cache #177
Conversation
An interesting idea, as I said before. Would be nice to see what kind of performance you get here - presumably depends on the type of disc and OS caching policy (so complicated to measure); also, I suppose memory use does not necessarily stay low if large parts of a large file are accessed. Do you know what are the system requirements for sparse file support and mem-mapping? Am I right in thinking that you split the file into block-sized pieces and keep track, in memory, of which blocks have been seen? More comments to come on the code itself. |
@@ -1209,6 +1214,16 @@ def __init__(self, s3, path, mode='rb', block_size=5 * 2 ** 20, acl="", | |||
self.version_id = info.get('VersionId') | |||
except (ClientError, ParamValidationError) as e: | |||
raise_from(IOError("File not accessible", path), e) | |||
if self.file_backed_cache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indentation looks wrong, why would this block be within the except
clause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block could be moved to a separate method, rather than extending __init__
like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it was wrong. a c&p error.
f_no = fd.fileno() | ||
self.start = 0 | ||
self.end = self.size | ||
self.cache = mmap.mmap(f_no, self.size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a suggestion, it feels like a class could encapsulate this caching behaviour, so that you only need to set .cache
here to the instance, the instance keeps references to:
- the file that is mmapped
- the function partial to get range of bytes (i.e.,
_fetch_range
with all args filled in except the start/stop) - keep it's own close-on-del functionality?
This way you wouldn't need to have the multiple if self.file_backed_cache:
branchings.
Also, you can make this class serialisable, for which otherwise you'd have to put extra logic into __getstate__
of the file class - it would not be normal to serialise the S3File instances, but it's better practice to make sure it's possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
@@ -1296,6 +1311,8 @@ def readline(self, length=-1): | |||
|
|||
If length is specified, at most size bytes will be read. | |||
""" | |||
if self.file_backed_cache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this is not really true - text mode shouldn't be done as it is here at all, but with a TextIOWrapper, which is what happens in fsspec. Then, readline wouldn't depend on the specifics of byte fetching. I suspect this here should work with the current code and mmap implementation too, though.
@@ -1339,7 +1356,7 @@ def _fetch(self, start, end): | |||
if start < self.start: | |||
if not self.fill_cache and end + self.blocksize < self.start: | |||
self.start, self.end = None, None | |||
return self._fetch(start, end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to change these - you made sure that _fetch
does the same thing as before when not mmapping.
self.cache = None | ||
|
||
if self.file_backed_cache: | ||
self.cache.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you just do self.cache = None
, so the reference drops, doesn't the file get cleaned up anyway upon garbage collection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true.
f_no = fd.fileno() | ||
self.start = 0 | ||
self.end = self.size | ||
self.cache = mmap.mmap(f_no, self.size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the windows version of mmap will automatically extend the file size here, rather than doing the sparse trick (but this might take up the fill space of the file - to be tried). I suspect the sparse trick might not work on windows, you'd get "write past EOF" error or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is windows? lol. sorry i only have mac and linux at hand.
end = self.end | ||
start_block = start // self.blocksize | ||
end_block = end // self.blocksize | ||
for i in range(start_block, end_block + 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop may be wasteful if requesting contiguous blocks, often the overhead of establishing the connection it comparable to the download time; instead of doing a fetch for each block, they should be combined where possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally agreed. i just want to show how it works it really should check and merge small requests like this to be as large as possible.
@@ -1296,6 +1311,8 @@ def readline(self, length=-1): | |||
|
|||
If length is specified, at most size bytes will be read. | |||
""" | |||
if self.file_backed_cache: | |||
raise ValueError('readline not available in file backed cache mode') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice that the mmap API supports readline directly https://docs.python.org/3/library/mmap.html#mmap.mmap.readline
thanks for the comments. yes, it is just to show the idea that how a mmap one could looks like. performance wise, it highly depends on the workload. we have a pattern that it will read a portion of block X, then seek to read block Y. then later it comes back to block X again. the current cache with trim will not be able to deal with this. and that is why i proposed this mmap idea. i like the idea of extracting this into a separate caching manager so it could be extensible. |
that is why up front i said this code is not for merged but to show the idea. i think original s3fs need some refactoring before adding any code like this. |
allows for implementing mmap from fsspec/s3fs#177
Implemented in fsspec, no need to duplicate here |
pls do not merge. just show the idea for discussion. this is a file backed cache which
pros
cons