Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3fs throws FileNotFound Exception while reading updated file through fastparquet api #272

Open
jalpes196 opened this issue Dec 10, 2019 · 11 comments

Comments

@jalpes196
Copy link

Hi,
I am using s3fs to read/write s3 files with fastparquet library. fastparquet provides python interface to read/write parquet format files. Please refer below block to understand how s3fs is used with fastparquet library to read/write parquet files.

import s3fs      
from fastparquet import ParquetFile
s3 = s3fs.S3FileSystem()
myopen = s3.open
pf = ParquetFile('/mybucket/data.parquet', open_with=myopen)
df = pf.to_pandas()

The function myopen provided to the constructor must be callable with f(path, mode) and produce an open file context.

If file is updated/modified then using same s3fs connection object throws FileNotFound Exception.

Traceback (most recent call last):  
File "/usr/local/lib/python3.6/dist-packages/fastparquet/api.py", line 110,
in __init__    with open_with(fn2, \'rb\') as f:
File "/usr/local/lib/python3.6/dist-packages/fsspec/spec.py", line 689, in open    
autocommit=ac, **kwargs)  
File "/usr/local/lib/python3.6/dist-packages/s3fs/core.py", line 314, in _open    autocommit=autocommit)
File "/usr/local/lib/python3.6/dist-packages/s3fs/core.py", line 939, in __init__    cache_type=cache_type)  
File "/usr/local/lib/python3.6/dist-packages/fsspec/spec.py", line 884, in __init__    
self.details = fs.info(path)  
File "/usr/local/lib/python3.6/dist-packages/s3fs/core.py", line 498, in info   
return super().info(path)  
File "/usr/local/lib/python3.6/dist-packages/fsspec/spec.py", line 509, in info
raise FileNotFoundError(path)\nFileNotFoundError: mybucket/data.parquet/_metadata

During investigation of this issue, I found that S3FileSystem class inherits python's AbstractFileSystem class. AbstractFileSystem (An abstract super-class for pythonic file-systems) inherits _Cached (Metaclass for caching file system instances) class.
It fails to detect s3 file modifications as it has already cached file instance for that file.
Please explain this issue. Whether it is design flow or extra params to be passed to enable S3FileSystem class to detect file level modifications.

To use the same s3fs connection, you can disable file instance caching as below.

import s3fs      
from fastparquet import ParquetFile
s3fs.S3FileSystem.cachable = False
s3 = s3fs.S3FileSystem()
myopen = s3.open
pf = ParquetFile('/mybucket/data.parquet', open_with=myopen)
df = pf.to_pandas()
@jalpes196
Copy link
Author

@martindurant Can you please look into this once.

@TomAugspurger
Copy link
Contributor

@jalpes196 if you have other systems changing the state of S3 then I'd recommend not using the caching. You can invalidate the cache with S3FileSystem.invalidate_cache.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 16, 2019

Ah, I see the following now:

To use the same s3fs connection, you can disable file instance caching as below.

So what issue are you reporting? It seems like things are working as documented (though whether that documented behavior is ideal for your use case isn't clear).

@jalpes196
Copy link
Author

The issue i wanted to report is that if other systems are changing the file (like appending data) i am trying to read then s3fs itself should invalidate cache. Why should user need to check the state of file and then invalidate cache. Isn't it expected behaviour of any filesystem like library ?

@martindurant
Copy link
Member

The paint of the cache is not to have to check with the server an re-download the details every time, and is reasonable behaviour in many situations. We are trying to give the user the option on cache behaviour, so that you can decide whether this is useful to you or not.

@jalpes196
Copy link
Author

jalpes196 commented Dec 16, 2019

The paint of the cache is not to have to check with the server an re-download the details every time, and is reasonable behaviour in many situations. We are trying to give the user the option on cache behaviour, so that you can decide whether this is useful to you or not.

Yes, agreed. But isn't it should be handled at least when file state is changed. Because if as a user i choose not to invalidate cache then i get FileNotFound Exception which is bit confusing at a first glance. Exception handling message should be changed to let user know about cache handling or something more helpful to detect this behaviour at first ?

It was really difficult for me to debug at first place because s3 says it has "eventually consistent" behaviour which was misleading when i encountered first (i thought it is because of s3's "eventually consistent" behaviour says FileNotFound).

@TomAugspurger
Copy link
Contributor

But isn't it should be handled at least when file state is changed.

How would s3fs know that the cache is stale?

@jalpes196
Copy link
Author

But isn't it should be handled at least when file state is changed.

How would s3fs know that the cache is stale?

That is what my concern is how to make s3fs understand this. It should read its metadata (modified time) to avoid such scenarios. or else Exception handling itself should be changed when cache is stale. Works ?

@martindurant
Copy link
Member

To check the current state of the remote listing is to reload the remote listing, so would not be a very useful as a cache. We could implement a second check for the specific case of trying to access a file we thought didn't exist.

@jalpes196
Copy link
Author

To check the current state of the remote listing is to reload the remote listing, so would not be a very useful as a cache. We could implement a second check for the specific case of trying to access a file we thought didn't exist.

That will be really helpful specially when s3fs is used to cache instance and make read ops fast. In my case, it is fastparquet where i was fighting to bring read operation down to ~100 ms. This feature will be very helpful.

@martindurant
Copy link
Member

You may be interested to try fsspec/filesystem_spec#216

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants