Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a mmap'ed cache #177

Closed
wants to merge 1 commit into from
Closed

Conversation

blackmagic02881
Copy link
Contributor

pls do not merge. just show the idea for discussion. this is a file backed cache which

  • create a tmp file with lifecycle same as the object
  • file get mmap as bytestream. data get pulled from remote on 1st cache miss.

pros

  • larger cache, better performance on random access.
    cons
  • cache lost as soon as object released
  • accessing every part of a large object could lead to OOS in tmp directory.

@martindurant
Copy link
Member

An interesting idea, as I said before. Would be nice to see what kind of performance you get here - presumably depends on the type of disc and OS caching policy (so complicated to measure); also, I suppose memory use does not necessarily stay low if large parts of a large file are accessed. Do you know what are the system requirements for sparse file support and mem-mapping?

Am I right in thinking that you split the file into block-sized pieces and keep track, in memory, of which blocks have been seen?

More comments to come on the code itself.

@@ -1209,6 +1214,16 @@ def __init__(self, s3, path, mode='rb', block_size=5 * 2 ** 20, acl="",
self.version_id = info.get('VersionId')
except (ClientError, ParamValidationError) as e:
raise_from(IOError("File not accessible", path), e)
if self.file_backed_cache:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indentation looks wrong, why would this block be within the except clause?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block could be moved to a separate method, rather than extending __init__ like this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it was wrong. a c&p error.

f_no = fd.fileno()
self.start = 0
self.end = self.size
self.cache = mmap.mmap(f_no, self.size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion, it feels like a class could encapsulate this caching behaviour, so that you only need to set .cache here to the instance, the instance keeps references to:

  • the file that is mmapped
  • the function partial to get range of bytes (i.e., _fetch_range with all args filled in except the start/stop)
  • keep it's own close-on-del functionality?

This way you wouldn't need to have the multiple if self.file_backed_cache: branchings.

Also, you can make this class serialisable, for which otherwise you'd have to put extra logic into __getstate__ of the file class - it would not be normal to serialise the S3File instances, but it's better practice to make sure it's possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@@ -1296,6 +1311,8 @@ def readline(self, length=-1):

If length is specified, at most size bytes will be read.
"""
if self.file_backed_cache:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is not really true - text mode shouldn't be done as it is here at all, but with a TextIOWrapper, which is what happens in fsspec. Then, readline wouldn't depend on the specifics of byte fetching. I suspect this here should work with the current code and mmap implementation too, though.

@@ -1339,7 +1356,7 @@ def _fetch(self, start, end):
if start < self.start:
if not self.fill_cache and end + self.blocksize < self.start:
self.start, self.end = None, None
return self._fetch(start, end)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change these - you made sure that _fetch does the same thing as before when not mmapping.

self.cache = None

if self.file_backed_cache:
self.cache.close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you just do self.cache = None, so the reference drops, doesn't the file get cleaned up anyway upon garbage collection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true.

f_no = fd.fileno()
self.start = 0
self.end = self.size
self.cache = mmap.mmap(f_no, self.size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the windows version of mmap will automatically extend the file size here, rather than doing the sparse trick (but this might take up the fill space of the file - to be tried). I suspect the sparse trick might not work on windows, you'd get "write past EOF" error or similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is windows? lol. sorry i only have mac and linux at hand.

end = self.end
start_block = start // self.blocksize
end_block = end // self.blocksize
for i in range(start_block, end_block + 1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop may be wasteful if requesting contiguous blocks, often the overhead of establishing the connection it comparable to the download time; instead of doing a fetch for each block, they should be combined where possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally agreed. i just want to show how it works it really should check and merge small requests like this to be as large as possible.

@@ -1296,6 +1311,8 @@ def readline(self, length=-1):

If length is specified, at most size bytes will be read.
"""
if self.file_backed_cache:
raise ValueError('readline not available in file backed cache mode')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that the mmap API supports readline directly https://docs.python.org/3/library/mmap.html#mmap.mmap.readline

@blackmagic02881
Copy link
Contributor Author

thanks for the comments. yes, it is just to show the idea that how a mmap one could looks like. performance wise, it highly depends on the workload. we have a pattern that it will read a portion of block X, then seek to read block Y. then later it comes back to block X again. the current cache with trim will not be able to deal with this. and that is why i proposed this mmap idea.

i like the idea of extracting this into a separate caching manager so it could be extensible.

@blackmagic02881
Copy link
Contributor Author

that is why up front i said this code is not for merged but to show the idea. i think original s3fs need some refactoring before adding any code like this.

martindurant pushed a commit to fsspec/filesystem_spec that referenced this pull request Apr 23, 2019
@martindurant
Copy link
Member

Implemented in fsspec, no need to duplicate here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants