Exception in interaction between `pyarrow.filesystem.S3FSWrapper` and `s3fs.core.S3FileSystem` #366

dmcguire81 · 2020-09-11T00:27:55Z

What happened: Using s3fs with pyarrow within petastorm throws an exception (TypeError: 'coroutine' object is not iterable) in multiple places that use pyarrow.parquet.ParquetDataset (petastorm.reader.make_reader, petastorm.reader.make_batch_reader, etc.).

What you expected to happen: No exception

Minimal Complete Verifiable Example:

import pyarrow.parquet as pq
from s3fs import S3FileSystem
from pyarrow.filesystem import S3FSWrapper

fs = S3FSWrapper(S3FileSystem())
dataset_url = "s3://our-bucket/series/of/prefixes/partition1=foo/partition2=bar"

# This throws the relevant exception
dataset = pq.ParquetDataset(dataset_url, filesystem=fs)
./env/lib/python3.7/site-packages/pyarrow/filesystem.py:394: RuntimeWarning: coroutine 'S3FileSystem._ls' was never awaited
  for key in list(self.fs._ls(path, refresh=refresh)):
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "./env/lib/python3.7/site-packages/pyarrow/parquet.py", line 1170, in __init__
    open_file_func=partial(_open_dataset_file, self._metadata)
  File "./env/lib/python3.7/site-packages/pyarrow/parquet.py", line 1348, in _make_manifest
    metadata_nthreads=metadata_nthreads)
  File "./env/lib/python3.7/site-packages/pyarrow/parquet.py", line 927, in __init__
    self._visit_level(0, self.dirpath, [])
  File "./env/lib/python3.7/site-packages/pyarrow/parquet.py", line 942, in _visit_level
    _, directories, files = next(fs.walk(base_path))
  File "./env/lib/python3.7/site-packages/pyarrow/filesystem.py", line 394, in walk
    for key in list(self.fs._ls(path, refresh=refresh)):
TypeError: 'coroutine' object is not iterable

Anything else we need to know?: I'm having trouble navigating the bug-reporting process for Apache Arrow, if you're able to pass this on to them.

Environment:

Dask version: 0.5.1
Python version: 3.7.8
Operating System: Mac OS 10.15.6
Install method (conda, pip, source): pip install s3fs==0.5.1

The text was updated successfully, but these errors were encountered:

dmcguire81 · 2020-09-11T17:39:37Z

On the off chance that was the relatedness was applied to the wrong defect, I tried the fix from #365, but it didn't seem fruitful. Notably, aiobotocore is already at the specified version in the above set-up:

$ pip freeze | grep aiobotocore
aiobotocore==1.1.1

I also tried setting AWS_DEFAULT_REGION, to no avail:

$ python repro.py
...
TypeError: 'coroutine' object is not iterable
$ AWS_DEFAULT_REGION=us-east-1 python repro.py
...
TypeError: 'coroutine' object is not iterable

dmcguire81 · 2020-09-11T17:40:03Z

@martindurant got anything else to try?

martindurant · 2020-09-11T20:55:52Z

This one I think will be fixed by using S3FileSystem directly - the wrapper should not be calling _ls but ls. If so, indeed it is for pyarrow to fix, although I suspect they don't want to use this wrapper any more anyway - in which case peta shouldn't be calling it.

Ideally, it should actually call find for S3, which would (as of recently, maybe only on master) list all files below a prefix, rather than making many requests for all the sub-prefixes.

dmcguire81 · 2020-09-11T21:31:59Z

Thanks, will check! I'm already in talks to pass the File System instance as a parameter to petastorm.read.make_reader and petastorm.reader.make_batch_reader, so that might do it for our use case.

dmcguire81 · 2020-09-11T21:47:54Z

@martindurant that did solve this issue, but, unfortunately, seems to have reduced it to #365, because this test case is not making any progress on s3fs==0.5.1, either:

import pyarrow.parquet as pq
from s3fs import S3FileSystem

fs = S3FileSystem()
# This hard-codes 2 out of the 4 total partitions in the path (leaving 2 to be crawled)
dataset_url = "s3://our-bucket/series/of/prefixes/partition1=foo/partition2=bar"

dataset = pq.ParquetDataset(dataset_url, filesystem=fs, metadata_nthreads=100)

I'll let it run a while and let you know.

dmcguire81 · 2020-09-11T22:16:34Z

Almost 30 minutes with no progress, so I'm not convinced it's better:

$ time python repro.py
^C
...

real	26m58.446s
user	0m1.468s
sys	0m0.384s

martindurant · 2020-09-14T14:33:34Z

Obviously I'll have to get back to this to see what's going on. For now, is it not reasonable to have validate_schema=False for you? It seems to me that checking the schema of every data file is very wasteful even where you can do it parallelised.

dmcguire81 · 2020-09-14T18:12:33Z

I'll see if that makes any difference, thanks! For what it's worth, petastorm seems to already work this way, so if this fixes it, we might be unblocked by just upgrading our installed s3fs.

I was trying to err on the side of hiding details about petastorm, because you had said you don't have it installed, and can't help debug, but if you think of anything else that might make a difference, let me know. I'm obviously not as useful in debugging the integration as their maintainer, but I'm reasonably familiar, at this point, and have an ongoing conversation with @selitvin, so am a reasonable proxy for getting this integration working, correctly.

dmcguire81 · 2020-09-14T19:38:23Z

That didn't seem to help:

import pyarrow.parquet as pq
from s3fs import S3FileSystem

fs = S3FileSystem()
# This hard-codes 2 out of the 4 total partitions in the path (leaving 2 to be crawled)
dataset_url = "s3://our-bucket/series/of/prefixes/partition1=foo/partition2=bar"

dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False, metadata_nthreads=100)

$ time python repro.py
^C

real	91m59.852s
user	0m1.520s
sys	0m0.344s

martindurant · 2020-09-14T19:40:23Z

Can you tell what files its loading? Without verify, it should only touch one file.

dmcguire81 · 2020-09-14T21:03:47Z

I can't sniff the AWS requests because they're encrypted at the transport and application layers. However, as I mentioned in the other ticket (#365), it's done doing meaningful work (as measured by the number of connections to S3 in ESTABLISHED state) almost immediately, after, perhaps, 10-15 seconds:

Launching:

$ time python repro.py &
[1] 75641

Measuring (note the Process ID above is for time, so using the child PID):

$ lsof -p 75642 | grep s3 | wc -l
      11
$ lsof -p 75642 | grep s3 | wc -l
      11
$ lsof -p 75642 | grep s3 | wc -l
      11
$ lsof -p 75642 | grep s3 | wc -l
      10
$ lsof -p 75642 | grep s3 | wc -l
       1
$ lsof -p 75642 | grep s3 | wc -l
       0

I have to transcribe the PID, but it couldn't take me longer than maybe 5s to do that, then I'm measuring for maybe 10s before the connections all disappear. This has all the classic hallmarks of deadlock, which is why I wrote up the other ticket that way.

martindurant · 2020-09-14T21:09:11Z

You can turn on logging in s3fs to see what the calls are

…

On September 14, 2020 5:04:02 PM EDT, David McGuire ***@***.***> wrote: I can't sniff the AWS requests because they're encrypted at the transport and application layers. However, as I mentioned in the other ticket (#365), it's done doing meaningful work (as measured by the number of connections to S3 in `ESTABLISHED` state) almost immediately, after, perhaps, 10-15 seconds: Launching: ```sh $ time python repro.py & [1] 75641 ``` Measuring (note the Process ID above is for `time`, so using the child PID): ```sh $ lsof -p 75642 | grep s3 | wc -l 11 $ lsof -p 75642 | grep s3 | wc -l 11 $ lsof -p 75642 | grep s3 | wc -l 11 $ lsof -p 75642 | grep s3 | wc -l 10 $ lsof -p 75642 | grep s3 | wc -l 1 $ lsof -p 75642 | grep s3 | wc -l 0 ``` I have to transcribe the PID, but it couldn't take me longer than maybe 5s to do that, then I'm measuring for maybe 10s before the connections all disappear. This has all the classic hallmarks of deadlock, which is why I wrote up the other ticket that way. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #366 (comment)

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

dmcguire81 · 2020-09-17T06:39:27Z

I'll follow up with petastorm to have them remove the wrapper, if pyarrow isn't going (/ doesn't need to) maintain it.

dmcguire81 mentioned this issue Sep 11, 2020

Deadlock in the interaction between pyarrow.filesystem.S3FSWrapper and s3fs.core.S3FileSystem #365

Closed

dmcguire81 closed this as completed Sep 17, 2020

This was referenced Sep 29, 2020

petastorm.make_reader from s3 bucket path fails uber/petastorm#587

Open

S3FSWrapper is deprecated as of s3fs 0.5.0 uber/petastorm#609

Closed

mnarodovitch mentioned this issue Jan 8, 2021

Fix walk apache/arrow#9136

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception in interaction between `pyarrow.filesystem.S3FSWrapper` and `s3fs.core.S3FileSystem` #366

Exception in interaction between `pyarrow.filesystem.S3FSWrapper` and `s3fs.core.S3FileSystem` #366

dmcguire81 commented Sep 11, 2020 •

edited

dmcguire81 commented Sep 11, 2020

dmcguire81 commented Sep 11, 2020

martindurant commented Sep 11, 2020

dmcguire81 commented Sep 11, 2020

dmcguire81 commented Sep 11, 2020 •

edited

dmcguire81 commented Sep 11, 2020

martindurant commented Sep 14, 2020

dmcguire81 commented Sep 14, 2020

dmcguire81 commented Sep 14, 2020

martindurant commented Sep 14, 2020

dmcguire81 commented Sep 14, 2020

martindurant commented Sep 14, 2020 via email

dmcguire81 commented Sep 17, 2020

Exception in interaction between pyarrow.filesystem.S3FSWrapper and s3fs.core.S3FileSystem #366

Exception in interaction between pyarrow.filesystem.S3FSWrapper and s3fs.core.S3FileSystem #366

Comments

dmcguire81 commented Sep 11, 2020 • edited

dmcguire81 commented Sep 11, 2020

dmcguire81 commented Sep 11, 2020

martindurant commented Sep 11, 2020

dmcguire81 commented Sep 11, 2020

dmcguire81 commented Sep 11, 2020 • edited

dmcguire81 commented Sep 11, 2020

martindurant commented Sep 14, 2020

dmcguire81 commented Sep 14, 2020

dmcguire81 commented Sep 14, 2020

martindurant commented Sep 14, 2020

dmcguire81 commented Sep 14, 2020

martindurant commented Sep 14, 2020 via email

dmcguire81 commented Sep 17, 2020

Exception in interaction between `pyarrow.filesystem.S3FSWrapper` and `s3fs.core.S3FileSystem` #366

Exception in interaction between `pyarrow.filesystem.S3FSWrapper` and `s3fs.core.S3FileSystem` #366

dmcguire81 commented Sep 11, 2020 •

edited

dmcguire81 commented Sep 11, 2020 •

edited