Error when using with pd.read_parquet with threading on #213

mtrbean · 2019-08-06T19:24:44Z

Since updating to latest version I randomly get errors like this:

Traceback (most recent call last):
  File "print_parquet_metadata.py", line 28, in <module>
    df = pd.read_parquet(args.file, use_threads=True)
  File "/Users/mtrbean/.pyenv/versions/stats3/lib/python3.6/site-packages/pandas/io/parquet.py", line 282, in read_parquet
    return impl.read(path, columns=columns, **kwargs)
  File "/Users/mtrbean/.pyenv/versions/stats3/lib/python3.6/site-packages/pandas/io/parquet.py", line 129, in read
    **kwargs).to_pandas()
  File "/Users/mtrbean/.pyenv/versions/stats3/lib/python3.6/site-packages/pyarrow/parquet.py", line 1216, in read_table
    use_pandas_metadata=use_pandas_metadata)
  File "/Users/mtrbean/.pyenv/versions/stats3/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, in read
    use_threads=use_threads)
  File "pyarrow/_parquet.pyx", line 1086, in pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (26292) than expected (59582)

I'm pretty sure that the file is not corrupted because sometimes it can be read successfully, and it can also be read when the file is local.

It is also very hard to reproduce consistently but if I try to read the file in a loop, I realize a pattern:

This will succeed 30 times in a row without a problem

f = "s3://<bucket>/<parquet_file>"
for i in range(30):
    pd.read_parquet(f, use_threads=False)

This will fail sometimes on the first iteration, sometimes on the second, and never got to iteration number 5:

f = "s3://<bucket>/<parquet_file>"
for i in range(30):
    pd.read_parquet(f, use_threads=True)

I suspect that it has something to do with random access with multiple threads, and the cache is not handling it correctly.

Finally this is the debug log which I hope will help

DEBUG:urllib3.util.retry:Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): mybucket.s3.us-west-2.amazonaws.com
DEBUG:urllib3.connectionpool:https://mybucket.s3.us-west-2.amazonaws.com:443 "GET /?list-type=2&prefix=&delimiter=%2F&encoding-type=url HTTP/1.1" 200 None
DEBUG:urllib3.util.retry:Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
DEBUG:urllib3.connectionpool:https://mybucket.s3.us-west-2.amazonaws.com:443 "GET /?list-type=2&prefix=temp%2F&delimiter=%2F&encoding-type=url HTTP/1.1" 200 None
DEBUG:fsspec:<File-like object S3FileSystem, mybucket/temp/data.parquet> read: 18606060 - 18671596
DEBUG:s3fs.core:Fetch: mybucket/temp/data.parquet, 18606060-23914476
DEBUG:urllib3.util.retry:Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
DEBUG:urllib3.connectionpool:https://mybucket.s3.us-west-2.amazonaws.com:443 "GET /temp/data.parquet HTTP/1.1" 206 65536
DEBUG:fsspec:<File-like object S3FileSystem, mybucket/temp/data.parquet> read: 18176581 - 18670182
DEBUG:s3fs.core:Fetch: mybucket/temp/data.parquet, 18176581-18606060
DEBUG:urllib3.util.retry:Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
DEBUG:urllib3.connectionpool:https://mybucket.s3.us-west-2.amazonaws.com:443 "GET /temp/data.parquet HTTP/1.1" 206 429479
DEBUG:fsspec:<File-like object S3FileSystem, mybucket/temp/data.parquet> read: 4 - 7237571
DEBUG:s3fs.core:Fetch: mybucket/temp/data.parquet, 4-12480451
DEBUG:urllib3.util.retry:Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
DEBUG:urllib3.connectionpool:https://mybucket.s3.us-west-2.amazonaws.com:443 "GET /temp/data.parquet HTTP/1.1" 206 12480447
DEBUG:fsspec:<File-like object S3FileSystem, mybucket/temp/data.parquet> read: 8274755 - 18176458
DEBUG:s3fs.core:Fetch: mybucket/temp/data.parquet, 8274755-10485764
DEBUG:urllib3.util.retry:Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
DEBUG:urllib3.connectionpool:https://mybucket.s3.us-west-2.amazonaws.com:443 "GET /temp/data.parquet HTTP/1.1" 206 2211009
DEBUG:fsspec:<File-like object S3FileSystem, mybucket/temp/data.parquet> read: 7237648 - 8274676
DEBUG:s3fs.core:Fetch: mybucket/temp/data.parquet, 7237648-8274755
DEBUG:urllib3.util.retry:Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
DEBUG:urllib3.connectionpool:https://mybucket.s3.us-west-2.amazonaws.com:443 "GET /temp/data.parquet HTTP/1.1" 206 1037107
Traceback (most recent call last):
  File "print_parquet_metadata.py", line 28, in <module>
    df = pd.read_parquet(args.file, use_threads=True)
  File "/Users/mtrbean/.pyenv/versions/stats3/lib/python3.6/site-packages/pandas/io/parquet.py", line 282, in read_parquet
    return impl.read(path, columns=columns, **kwargs)
  File "/Users/mtrbean/.pyenv/versions/stats3/lib/python3.6/site-packages/pandas/io/parquet.py", line 129, in read
    **kwargs).to_pandas()
  File "/Users/mtrbean/.pyenv/versions/stats3/lib/python3.6/site-packages/pyarrow/parquet.py", line 1216, in read_table
    use_pandas_metadata=use_pandas_metadata)
  File "/Users/mtrbean/.pyenv/versions/stats3/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, in read
    use_threads=use_threads)
  File "pyarrow/_parquet.pyx", line 1086, in pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (26292) than expected (59582)

The text was updated successfully, but these errors were encountered:

martindurant · 2019-08-06T19:30:46Z

Please post/link to pyarrow too. s3fs does not do anything particular with threads, the S3FileObjects are independent instances, but S3FileSystem is a singleton (for given parameters) with shared state.

mtrbean · 2019-08-06T21:39:09Z

I don't think it's a problem with pyarrow per se, as it's reading regions that do not have overlap. And reading a local file always succeeds, threading turned on or not. Since I think there is only one instance of S3FileObjects, I think the issue is cache not being thread safe (and apologies if I misunderstood anything):

If you read the lines starting with DEBUG:fsspec in the debug logs above, the parquet file is read in 5 chunks. The first chunk is 65536 bytes of metadata at 18606060 - 18671596 (which is at the end of the file), and then the following blocks in parallel when threading is turned on:

4 - 7237571
7237648 - 8274676
8274755 - 18176458
18176581 - 18670182

However s3fs.core:Fetch is pulling

(metadata) 18606060 - 23914476 <- beyond end of file but harmless
4 - 12480451
7237648 - 8274755
8274755 - 10485764 <- much less than the requested range
18176581 - 18606060 <- also less than the requested range, but the missing range 18606060 - 18670182 is already covered by the first metadata fetch (18606060 - 18671596)

which seems to suggest that s3fs.core:Fetch is confused about what is in cache and what is not. The range 10485765 - 18176458 was never downloaded.

mtrbean · 2019-08-07T03:20:54Z

I currently work around it by turning off threading in pyarrow, but I think the best way is to turn off caching. What is the best way to turn off fsspec’s caching? @martindurant

martindurant · 2019-08-07T11:52:36Z

Yeah, I didn't at first realise that the same file instance was being shared across threads - that's not how I normally think about things from a Dask perspective.

What is the best way to turn off fsspec’s caching?

Yes, that's the conclusion I was coming to also. You can use open(.., cache_type='none') to make a file instance and pass that instead of the URL. There is no way to change the default (except monkey patching)

mtrbean · 2019-08-07T22:21:46Z

Should we allow setting default cache_type in the AbstractFileSystem (as an argument to __init__)? I'm happy to submit a PR to fsspec. Also happy to implement a threadsafe cache

martindurant · 2019-08-07T22:28:37Z

Both of those sound reasonable.

Note on a threadsafe cache: I would still want the file instance to be pickleable, so presumably the cache in instances reconstituted by pickle would always have empty caches (unless they happen to be in the same thread?).

mtrbean · 2019-08-08T00:15:10Z

I decided to implement default cache_type in s3fs (PR #214 ).

However when working on the cache I found that currently MMapCache is not pickleable. I guess it needs to be fixed as well?

martindurant · 2019-08-08T02:44:52Z

Good point.

fsspec/filesystem_spec#99 does this, but cloudpickle only. Would be good not to have to use it, you might have a better idea.

mtrbean · 2019-08-08T02:56:16Z

If you don’t mind losing the cache content, you can try implementing getstate and setstate

martindurant · 2019-08-08T13:20:10Z

I think I'll park it there and see what people think. If you have a nice implementation, would be happy to see it, and indeed one for a threadsafe bytes cache. I don't know when I would get to this.

martindurant · 2019-08-08T14:18:41Z

You know, I may even back away from saying that files should be pickleable, since we do have OpenFile after all, specifically for passing around things that are pickleable, but can be made into files with with.

maximerihouey · 2019-08-28T16:03:57Z

Hello, do you have any updates on this issue ?

martindurant · 2019-09-09T17:17:37Z

Apparently boto3 sessions are not thread-safe, but we have not come across this before, so maybe previously they accidentally were. Certainly, s3fs instances get shared, but this is not new behaviour. I do not have a good idea of how to produce thread-local boto sessions.

mtrbean mentioned this issue Aug 7, 2019

allow setting default S3File's cache_type in S3FileSystem #214

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when using with pd.read_parquet with threading on #213

Error when using with pd.read_parquet with threading on #213

mtrbean commented Aug 6, 2019

martindurant commented Aug 6, 2019

mtrbean commented Aug 6, 2019 •

edited

Loading

mtrbean commented Aug 7, 2019

martindurant commented Aug 7, 2019

mtrbean commented Aug 7, 2019 •

edited

Loading

martindurant commented Aug 7, 2019

mtrbean commented Aug 8, 2019

martindurant commented Aug 8, 2019

mtrbean commented Aug 8, 2019

martindurant commented Aug 8, 2019

martindurant commented Aug 8, 2019

maximerihouey commented Aug 28, 2019

martindurant commented Sep 9, 2019

Error when using with pd.read_parquet with threading on #213

Error when using with pd.read_parquet with threading on #213

Comments

mtrbean commented Aug 6, 2019

martindurant commented Aug 6, 2019

mtrbean commented Aug 6, 2019 • edited Loading

mtrbean commented Aug 7, 2019

martindurant commented Aug 7, 2019

mtrbean commented Aug 7, 2019 • edited Loading

martindurant commented Aug 7, 2019

mtrbean commented Aug 8, 2019

martindurant commented Aug 8, 2019

mtrbean commented Aug 8, 2019

martindurant commented Aug 8, 2019

martindurant commented Aug 8, 2019

maximerihouey commented Aug 28, 2019

martindurant commented Sep 9, 2019

mtrbean commented Aug 6, 2019 •

edited

Loading

mtrbean commented Aug 7, 2019 •

edited

Loading