Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception in interaction between pyarrow.filesystem.S3FSWrapper and s3fs.core.S3FileSystem #366

Closed
dmcguire81 opened this issue Sep 11, 2020 · 13 comments

Comments

@dmcguire81
Copy link

dmcguire81 commented Sep 11, 2020

What happened: Using s3fs with pyarrow within petastorm throws an exception (TypeError: 'coroutine' object is not iterable) in multiple places that use pyarrow.parquet.ParquetDataset (petastorm.reader.make_reader, petastorm.reader.make_batch_reader, etc.).

What you expected to happen: No exception

Minimal Complete Verifiable Example:

import pyarrow.parquet as pq
from s3fs import S3FileSystem
from pyarrow.filesystem import S3FSWrapper

fs = S3FSWrapper(S3FileSystem())
dataset_url = "s3://our-bucket/series/of/prefixes/partition1=foo/partition2=bar"

# This throws the relevant exception
dataset = pq.ParquetDataset(dataset_url, filesystem=fs)
./env/lib/python3.7/site-packages/pyarrow/filesystem.py:394: RuntimeWarning: coroutine 'S3FileSystem._ls' was never awaited
  for key in list(self.fs._ls(path, refresh=refresh)):
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "./env/lib/python3.7/site-packages/pyarrow/parquet.py", line 1170, in __init__
    open_file_func=partial(_open_dataset_file, self._metadata)
  File "./env/lib/python3.7/site-packages/pyarrow/parquet.py", line 1348, in _make_manifest
    metadata_nthreads=metadata_nthreads)
  File "./env/lib/python3.7/site-packages/pyarrow/parquet.py", line 927, in __init__
    self._visit_level(0, self.dirpath, [])
  File "./env/lib/python3.7/site-packages/pyarrow/parquet.py", line 942, in _visit_level
    _, directories, files = next(fs.walk(base_path))
  File "./env/lib/python3.7/site-packages/pyarrow/filesystem.py", line 394, in walk
    for key in list(self.fs._ls(path, refresh=refresh)):
TypeError: 'coroutine' object is not iterable

Anything else we need to know?: I'm having trouble navigating the bug-reporting process for Apache Arrow, if you're able to pass this on to them.

Environment:

  • Dask version: 0.5.1
  • Python version: 3.7.8
  • Operating System: Mac OS 10.15.6
  • Install method (conda, pip, source): pip install s3fs==0.5.1
@dmcguire81
Copy link
Author

On the off chance that was the relatedness was applied to the wrong defect, I tried the fix from #365, but it didn't seem fruitful. Notably, aiobotocore is already at the specified version in the above set-up:

$ pip freeze | grep aiobotocore
aiobotocore==1.1.1

I also tried setting AWS_DEFAULT_REGION, to no avail:

$ python repro.py
...
TypeError: 'coroutine' object is not iterable
$ AWS_DEFAULT_REGION=us-east-1 python repro.py
...
TypeError: 'coroutine' object is not iterable

@dmcguire81
Copy link
Author

@martindurant got anything else to try?

@martindurant
Copy link
Member

This one I think will be fixed by using S3FileSystem directly - the wrapper should not be calling _ls but ls. If so, indeed it is for pyarrow to fix, although I suspect they don't want to use this wrapper any more anyway - in which case peta shouldn't be calling it.

Ideally, it should actually call find for S3, which would (as of recently, maybe only on master) list all files below a prefix, rather than making many requests for all the sub-prefixes.

@dmcguire81
Copy link
Author

Thanks, will check! I'm already in talks to pass the File System instance as a parameter to petastorm.read.make_reader and petastorm.reader.make_batch_reader, so that might do it for our use case.

@dmcguire81
Copy link
Author

dmcguire81 commented Sep 11, 2020

@martindurant that did solve this issue, but, unfortunately, seems to have reduced it to #365, because this test case is not making any progress on s3fs==0.5.1, either:

import pyarrow.parquet as pq
from s3fs import S3FileSystem

fs = S3FileSystem()
# This hard-codes 2 out of the 4 total partitions in the path (leaving 2 to be crawled)
dataset_url = "s3://our-bucket/series/of/prefixes/partition1=foo/partition2=bar"

dataset = pq.ParquetDataset(dataset_url, filesystem=fs, metadata_nthreads=100)

I'll let it run a while and let you know.

@dmcguire81
Copy link
Author

Almost 30 minutes with no progress, so I'm not convinced it's better:

$ time python repro.py
^C
...

real	26m58.446s
user	0m1.468s
sys	0m0.384s

@martindurant
Copy link
Member

Obviously I'll have to get back to this to see what's going on. For now, is it not reasonable to have validate_schema=False for you? It seems to me that checking the schema of every data file is very wasteful even where you can do it parallelised.

@dmcguire81
Copy link
Author

I'll see if that makes any difference, thanks! For what it's worth, petastorm seems to already work this way, so if this fixes it, we might be unblocked by just upgrading our installed s3fs.

I was trying to err on the side of hiding details about petastorm, because you had said you don't have it installed, and can't help debug, but if you think of anything else that might make a difference, let me know. I'm obviously not as useful in debugging the integration as their maintainer, but I'm reasonably familiar, at this point, and have an ongoing conversation with @selitvin, so am a reasonable proxy for getting this integration working, correctly.

@dmcguire81
Copy link
Author

That didn't seem to help:

import pyarrow.parquet as pq
from s3fs import S3FileSystem

fs = S3FileSystem()
# This hard-codes 2 out of the 4 total partitions in the path (leaving 2 to be crawled)
dataset_url = "s3://our-bucket/series/of/prefixes/partition1=foo/partition2=bar"

dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False, metadata_nthreads=100)
$ time python repro.py
^C

real	91m59.852s
user	0m1.520s
sys	0m0.344s

@martindurant
Copy link
Member

Can you tell what files its loading? Without verify, it should only touch one file.

@dmcguire81
Copy link
Author

I can't sniff the AWS requests because they're encrypted at the transport and application layers. However, as I mentioned in the other ticket (#365), it's done doing meaningful work (as measured by the number of connections to S3 in ESTABLISHED state) almost immediately, after, perhaps, 10-15 seconds:

Launching:

$ time python repro.py &
[1] 75641

Measuring (note the Process ID above is for time, so using the child PID):

$ lsof -p 75642 | grep s3 | wc -l
      11
$ lsof -p 75642 | grep s3 | wc -l
      11
$ lsof -p 75642 | grep s3 | wc -l
      11
$ lsof -p 75642 | grep s3 | wc -l
      10
$ lsof -p 75642 | grep s3 | wc -l
       1
$ lsof -p 75642 | grep s3 | wc -l
       0

I have to transcribe the PID, but it couldn't take me longer than maybe 5s to do that, then I'm measuring for maybe 10s before the connections all disappear. This has all the classic hallmarks of deadlock, which is why I wrote up the other ticket that way.

@martindurant
Copy link
Member

martindurant commented Sep 14, 2020 via email

@dmcguire81
Copy link
Author

I'll follow up with petastorm to have them remove the wrapper, if pyarrow isn't going (/ doesn't need to) maintain it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants