-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read invalid file content due to directory cache #424
Comments
Note that the directory listings cache supports timeouts, if requested, with the argument |
@martindurant Yes, but this does not solve the issue, just make is less likely. First of all, it is not transparent to the user, in which cases the directory cache is filled. For example Second this is about an inconsistent state due to the directory cache. If also the file content would be cached together with the directory cache (which is of course not possible) then the situation would change, since then you simply get the last cached file content. But here you get a corrupted file content and just from calling the And finally, what is a good timeout in those cases? 1 day? 1 hour? 1 minute? What is the right timing to make race conditions unlikely but gives you still a nice performance. Even for my application I don't know. But you can think about this in another way: When it is worth of reading the content of an object in S3, it is even more worth to read the header before to get an up-to-date object size and ETag. |
On the other hand, making extra requests on every access is wasteful - otherwise we would have no directory cache at all. I think this situation is best explained in documentation. However, I totally agree with using the Etag information from the cache, when we already have it, as you proposed. |
(PS: you can turn off the directory cache completely, if you wish) |
@martindurant I think the directory cache is absolutely perfect and meaningful, when working with directory content only. So doing Is there a possibility (official API) to invalidate the cache for a certain file-key? In this case I could call the invalidate method before open the file to ensure that the file header up always up-to-date when opening it. edit: I think I found it: |
Is there still something to do here? |
I solved this issue by always calling So, if you don't want to refresh the cache entry for the file, which is going to be opened, there is nothing further to do from Best regards |
Bug reports that follow these guidelines are easier to diagnose, and so are often handled much more quickly.
-->
What happened:
The following error is an extension to the this error:
#422
In the original error we see s3fs reading invalid file-content, when the file has changed between opening and reading the file. This is a seldom race condition and must be catched (or prevented) from the higher layer application. However this error is getting more worse, when using the directory cache of s3fs, where it can also happen without a race between two concurrent processes.
If we call
fs.ls(...)
on that file. The file information (including size) is stored inside the directory cache. Thus when we change the file after thels
command and the size changes, we cannot read the file afterwards again - even after hours, reading is not possible again. The reason is because in the constructor ofS3File
and the inherited base class, the size (detail dictionary) is read by usingfs.info(...)
but without therefresh=True
attribute. Thus the size is just read from the directory cache and most probably outdated.What you expected to happen:
I expect that the size is correctly refreshed when opening the file, thus a write to the same file from another process is not triggering this error.
Minimal Complete Verifiable Example:
In the following example, I simulate a write access to the same file from another process with a direct boto3 call, to not affect internal states of the s3fs object, like caches:
Anything else we need to know?:
I suggest to always refresh the file details when creating the file object (open the file).
Environment:
Same as in #422
The text was updated successfully, but these errors were encountered: