Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read invalid file content due to directory cache #424

Closed
anetbnd opened this issue Feb 17, 2021 · 7 comments
Closed

Read invalid file content due to directory cache #424

anetbnd opened this issue Feb 17, 2021 · 7 comments

Comments

@anetbnd
Copy link

anetbnd commented Feb 17, 2021

Bug reports that follow these guidelines are easier to diagnose, and so are often handled much more quickly.
-->

What happened:

The following error is an extension to the this error:
#422

In the original error we see s3fs reading invalid file-content, when the file has changed between opening and reading the file. This is a seldom race condition and must be catched (or prevented) from the higher layer application. However this error is getting more worse, when using the directory cache of s3fs, where it can also happen without a race between two concurrent processes.

If we call fs.ls(...) on that file. The file information (including size) is stored inside the directory cache. Thus when we change the file after the ls command and the size changes, we cannot read the file afterwards again - even after hours, reading is not possible again. The reason is because in the constructor of S3File and the inherited base class, the size (detail dictionary) is read by using fs.info(...) but without the refresh=True attribute. Thus the size is just read from the directory cache and most probably outdated.

What you expected to happen:

I expect that the size is correctly refreshed when opening the file, thus a write to the same file from another process is not triggering this error.

Minimal Complete Verifiable Example:

In the following example, I simulate a write access to the same file from another process with a direct boto3 call, to not affect internal states of the s3fs object, like caches:

import s3fs
import boto3
import botocore.config


PROXY = None
AWS_KEY = '...'
AWS_SECRET = '...'
USE_SSL = False
BUCKET = '...'
FILE = 'test.file'
FILE_PATH = f's3://{BUCKET}/{FILE}'

def create_file(boto_client, content: bytes):
    object = boto_client.Object(BUCKET, FILE)
    object.put(Body=content)


fs = s3fs.S3FileSystem(
    anon=False,
    key=AWS_KEY,
    secret=AWS_SECRET,
    config_kwargs={'proxies':{'http': PROXY, 'https': PROXY}},
    use_ssl=USE_SSL,
)

boto_session = boto3.session.Session(
    aws_access_key_id=AWS_KEY,
    aws_secret_access_key=AWS_SECRET,
)

boto_client = boto_session.resource(
    's3',
    use_ssl=USE_SSL,
    config=botocore.config.Config(proxies={'http': PROXY, 'https': PROXY})
)


create_file(boto_client, b'123')

with fs.open(FILE_PATH, 'rb') as f:
    content = f.read()
    assert content == b'123', f"content was {content}"

# fs.ls(FILE_PATH) <- uncomment this to trigger the error

create_file(boto_client, b'ABCDEFGHIJKLMN')

with fs.open(FILE_PATH, 'rb') as f:
    content = f.read()
    assert content == b'ABCDEFGHIJKLMN', f"content was {content}"

Anything else we need to know?:

I suggest to always refresh the file details when creating the file object (open the file).

Environment:

Same as in #422

@anetbnd anetbnd changed the title Read invalid file content due to directory caching Read invalid file content due to directory cache Feb 17, 2021
@martindurant
Copy link
Member

Note that the directory listings cache supports timeouts, if requested, with the argument listings_expiry_time or LRU with argument max_paths. These are described in fsspec.dircache.DirCache.

@anetbnd
Copy link
Author

anetbnd commented Feb 17, 2021

@martindurant Yes, but this does not solve the issue, just make is less likely.

First of all, it is not transparent to the user, in which cases the directory cache is filled. For example info and ' isfile' does not fill the directory cache isdir only when using it on the parent directory. Some other commands call those commands indirectly, you have to do go deep into the s3fs code to understand it and with each version of s3fs this could change.

Second this is about an inconsistent state due to the directory cache. If also the file content would be cached together with the directory cache (which is of course not possible) then the situation would change, since then you simply get the last cached file content. But here you get a corrupted file content and just from calling the ls and open methods (there could be days between) it is absolutely not clear to the user, why the file content is corrupted.

And finally, what is a good timeout in those cases? 1 day? 1 hour? 1 minute? What is the right timing to make race conditions unlikely but gives you still a nice performance. Even for my application I don't know. But you can think about this in another way: When it is worth of reading the content of an object in S3, it is even more worth to read the header before to get an up-to-date object size and ETag.

@martindurant
Copy link
Member

On the other hand, making extra requests on every access is wasteful - otherwise we would have no directory cache at all. I think this situation is best explained in documentation. However, I totally agree with using the Etag information from the cache, when we already have it, as you proposed.
The answer to "what's the best expiry" is situation-dependent, and supposes that the user knows something about the state of their data. A good deal of data on S3 is write once, read many (so the default works), or sometimes the result of some lamnda, where the trigger notification ensures that the file has just appeared (so the default works here too).

@martindurant
Copy link
Member

(PS: you can turn off the directory cache completely, if you wish)

@anetbnd
Copy link
Author

anetbnd commented Feb 17, 2021

@martindurant I think the directory cache is absolutely perfect and meaningful, when working with directory content only. So doing ls, info, isdir, isfile, checksum and so on.

Is there a possibility (official API) to invalidate the cache for a certain file-key? In this case I could call the invalidate method before open the file to ensure that the file header up always up-to-date when opening it.

edit: I think I found it:
invalidate_cache(path=None)

@martindurant
Copy link
Member

Is there still something to do here?

@anetbnd
Copy link
Author

anetbnd commented Apr 14, 2021

Hi @martindurant

I solved this issue by always calling invalid_cache(path=....) directly before opening a file.

So, if you don't want to refresh the cache entry for the file, which is going to be opened, there is nothing further to do from s3fs side. Thank you for your support here!

Best regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants