s3fs throws FileNotFound Exception while reading updated file through fastparquet api #272

jalpes196 · 2019-12-10T07:23:48Z

Hi,
I am using s3fs to read/write s3 files with fastparquet library. fastparquet provides python interface to read/write parquet format files. Please refer below block to understand how s3fs is used with fastparquet library to read/write parquet files.

import s3fs      
from fastparquet import ParquetFile
s3 = s3fs.S3FileSystem()
myopen = s3.open
pf = ParquetFile('/mybucket/data.parquet', open_with=myopen)
df = pf.to_pandas()

The function myopen provided to the constructor must be callable with f(path, mode) and produce an open file context.

If file is updated/modified then using same s3fs connection object throws FileNotFound Exception.

Traceback (most recent call last):  
File "/usr/local/lib/python3.6/dist-packages/fastparquet/api.py", line 110,
in __init__    with open_with(fn2, \'rb\') as f:
File "/usr/local/lib/python3.6/dist-packages/fsspec/spec.py", line 689, in open    
autocommit=ac, **kwargs)  
File "/usr/local/lib/python3.6/dist-packages/s3fs/core.py", line 314, in _open    autocommit=autocommit)
File "/usr/local/lib/python3.6/dist-packages/s3fs/core.py", line 939, in __init__    cache_type=cache_type)  
File "/usr/local/lib/python3.6/dist-packages/fsspec/spec.py", line 884, in __init__    
self.details = fs.info(path)  
File "/usr/local/lib/python3.6/dist-packages/s3fs/core.py", line 498, in info   
return super().info(path)  
File "/usr/local/lib/python3.6/dist-packages/fsspec/spec.py", line 509, in info
raise FileNotFoundError(path)\nFileNotFoundError: mybucket/data.parquet/_metadata

During investigation of this issue, I found that S3FileSystem class inherits python's AbstractFileSystem class. AbstractFileSystem (An abstract super-class for pythonic file-systems) inherits _Cached (Metaclass for caching file system instances) class.
It fails to detect s3 file modifications as it has already cached file instance for that file.
Please explain this issue. Whether it is design flow or extra params to be passed to enable S3FileSystem class to detect file level modifications.

To use the same s3fs connection, you can disable file instance caching as below.

import s3fs      
from fastparquet import ParquetFile
s3fs.S3FileSystem.cachable = False
s3 = s3fs.S3FileSystem()
myopen = s3.open
pf = ParquetFile('/mybucket/data.parquet', open_with=myopen)
df = pf.to_pandas()

The text was updated successfully, but these errors were encountered:

jalpes196 · 2019-12-16T07:42:57Z

@martindurant Can you please look into this once.

TomAugspurger · 2019-12-16T13:54:16Z

@jalpes196 if you have other systems changing the state of S3 then I'd recommend not using the caching. You can invalidate the cache with S3FileSystem.invalidate_cache.

TomAugspurger · 2019-12-16T13:55:29Z

Ah, I see the following now:

To use the same s3fs connection, you can disable file instance caching as below.

So what issue are you reporting? It seems like things are working as documented (though whether that documented behavior is ideal for your use case isn't clear).

jalpes196 · 2019-12-16T16:16:58Z

The issue i wanted to report is that if other systems are changing the file (like appending data) i am trying to read then s3fs itself should invalidate cache. Why should user need to check the state of file and then invalidate cache. Isn't it expected behaviour of any filesystem like library ?

martindurant · 2019-12-16T16:19:30Z

The paint of the cache is not to have to check with the server an re-download the details every time, and is reasonable behaviour in many situations. We are trying to give the user the option on cache behaviour, so that you can decide whether this is useful to you or not.

jalpes196 · 2019-12-16T18:29:57Z

The paint of the cache is not to have to check with the server an re-download the details every time, and is reasonable behaviour in many situations. We are trying to give the user the option on cache behaviour, so that you can decide whether this is useful to you or not.

Yes, agreed. But isn't it should be handled at least when file state is changed. Because if as a user i choose not to invalidate cache then i get FileNotFound Exception which is bit confusing at a first glance. Exception handling message should be changed to let user know about cache handling or something more helpful to detect this behaviour at first ?

It was really difficult for me to debug at first place because s3 says it has "eventually consistent" behaviour which was misleading when i encountered first (i thought it is because of s3's "eventually consistent" behaviour says FileNotFound).

TomAugspurger · 2019-12-16T18:51:39Z

But isn't it should be handled at least when file state is changed.

How would s3fs know that the cache is stale?

jalpes196 · 2019-12-16T18:55:13Z

But isn't it should be handled at least when file state is changed.

How would s3fs know that the cache is stale?

That is what my concern is how to make s3fs understand this. It should read its metadata (modified time) to avoid such scenarios. or else Exception handling itself should be changed when cache is stale. Works ?

martindurant · 2019-12-16T18:57:44Z

To check the current state of the remote listing is to reload the remote listing, so would not be a very useful as a cache. We could implement a second check for the specific case of trying to access a file we thought didn't exist.

jalpes196 · 2019-12-16T19:03:28Z

To check the current state of the remote listing is to reload the remote listing, so would not be a very useful as a cache. We could implement a second check for the specific case of trying to access a file we thought didn't exist.

That will be really helpful specially when s3fs is used to cache instance and make read ops fast. In my case, it is fastparquet where i was fighting to bring read operation down to ~100 ms. This feature will be very helpful.

martindurant · 2020-01-30T15:08:10Z

You may be interested to try fsspec/filesystem_spec#216

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s3fs throws FileNotFound Exception while reading updated file through fastparquet api #272

s3fs throws FileNotFound Exception while reading updated file through fastparquet api #272

jalpes196 commented Dec 10, 2019

jalpes196 commented Dec 16, 2019

TomAugspurger commented Dec 16, 2019

TomAugspurger commented Dec 16, 2019 •

edited

jalpes196 commented Dec 16, 2019

martindurant commented Dec 16, 2019

jalpes196 commented Dec 16, 2019 •

edited

TomAugspurger commented Dec 16, 2019

jalpes196 commented Dec 16, 2019

martindurant commented Dec 16, 2019

jalpes196 commented Dec 16, 2019

martindurant commented Jan 30, 2020

Navigation Menu

s3fs throws FileNotFound Exception while reading updated file through fastparquet api #272

s3fs throws FileNotFound Exception while reading updated file through fastparquet api #272

Comments

jalpes196 commented Dec 10, 2019

jalpes196 commented Dec 16, 2019

TomAugspurger commented Dec 16, 2019

TomAugspurger commented Dec 16, 2019 • edited

jalpes196 commented Dec 16, 2019

martindurant commented Dec 16, 2019

jalpes196 commented Dec 16, 2019 • edited

TomAugspurger commented Dec 16, 2019

jalpes196 commented Dec 16, 2019

martindurant commented Dec 16, 2019

jalpes196 commented Dec 16, 2019

martindurant commented Jan 30, 2020

TomAugspurger commented Dec 16, 2019 •

edited

jalpes196 commented Dec 16, 2019 •

edited