New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s3fs throws FileNotFound Exception while reading updated file through fastparquet api #272
Comments
@martindurant Can you please look into this once. |
@jalpes196 if you have other systems changing the state of S3 then I'd recommend not using the caching. You can invalidate the cache with S3FileSystem.invalidate_cache. |
Ah, I see the following now:
So what issue are you reporting? It seems like things are working as documented (though whether that documented behavior is ideal for your use case isn't clear). |
The issue i wanted to report is that if other systems are changing the file (like appending data) i am trying to read then s3fs itself should invalidate cache. Why should user need to check the state of file and then invalidate cache. Isn't it expected behaviour of any filesystem like library ? |
The paint of the cache is not to have to check with the server an re-download the details every time, and is reasonable behaviour in many situations. We are trying to give the user the option on cache behaviour, so that you can decide whether this is useful to you or not. |
Yes, agreed. But isn't it should be handled at least when file state is changed. Because if as a user i choose not to invalidate cache then i get FileNotFound Exception which is bit confusing at a first glance. Exception handling message should be changed to let user know about cache handling or something more helpful to detect this behaviour at first ? It was really difficult for me to debug at first place because s3 says it has "eventually consistent" behaviour which was misleading when i encountered first (i thought it is because of s3's "eventually consistent" behaviour says FileNotFound). |
How would s3fs know that the cache is stale? |
That is what my concern is how to make s3fs understand this. It should read its metadata (modified time) to avoid such scenarios. or else Exception handling itself should be changed when cache is stale. Works ? |
To check the current state of the remote listing is to reload the remote listing, so would not be a very useful as a cache. We could implement a second check for the specific case of trying to access a file we thought didn't exist. |
That will be really helpful specially when s3fs is used to cache instance and make read ops fast. In my case, it is fastparquet where i was fighting to bring read operation down to ~100 ms. This feature will be very helpful. |
You may be interested to try fsspec/filesystem_spec#216 |
Hi,
I am using s3fs to read/write s3 files with fastparquet library. fastparquet provides python interface to read/write parquet format files. Please refer below block to understand how s3fs is used with fastparquet library to read/write parquet files.
The function myopen provided to the constructor must be callable with f(path, mode) and produce an open file context.
If file is updated/modified then using same s3fs connection object throws FileNotFound Exception.
During investigation of this issue, I found that S3FileSystem class inherits python's AbstractFileSystem class. AbstractFileSystem (An abstract super-class for pythonic file-systems) inherits _Cached (Metaclass for caching file system instances) class.
It fails to detect s3 file modifications as it has already cached file instance for that file.
Please explain this issue. Whether it is design flow or extra params to be passed to enable S3FileSystem class to detect file level modifications.
To use the same s3fs connection, you can disable file instance caching as below.
The text was updated successfully, but these errors were encountered: