-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exception in interaction between pyarrow.filesystem.S3FSWrapper
and s3fs.core.S3FileSystem
#366
Comments
On the off chance that was the relatedness was applied to the wrong defect, I tried the fix from #365, but it didn't seem fruitful. Notably, $ pip freeze | grep aiobotocore
aiobotocore==1.1.1 I also tried setting $ python repro.py
...
TypeError: 'coroutine' object is not iterable
$ AWS_DEFAULT_REGION=us-east-1 python repro.py
...
TypeError: 'coroutine' object is not iterable |
@martindurant got anything else to try? |
This one I think will be fixed by using S3FileSystem directly - the wrapper should not be calling Ideally, it should actually call |
Thanks, will check! I'm already in talks to pass the File System instance as a parameter to |
@martindurant that did solve this issue, but, unfortunately, seems to have reduced it to #365, because this test case is not making any progress on import pyarrow.parquet as pq
from s3fs import S3FileSystem
fs = S3FileSystem()
# This hard-codes 2 out of the 4 total partitions in the path (leaving 2 to be crawled)
dataset_url = "s3://our-bucket/series/of/prefixes/partition1=foo/partition2=bar"
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, metadata_nthreads=100) I'll let it run a while and let you know. |
Almost 30 minutes with no progress, so I'm not convinced it's better: $ time python repro.py
^C
...
real 26m58.446s
user 0m1.468s
sys 0m0.384s |
Obviously I'll have to get back to this to see what's going on. For now, is it not reasonable to have |
I'll see if that makes any difference, thanks! For what it's worth, petastorm seems to already work this way, so if this fixes it, we might be unblocked by just upgrading our installed I was trying to err on the side of hiding details about |
That didn't seem to help: import pyarrow.parquet as pq
from s3fs import S3FileSystem
fs = S3FileSystem()
# This hard-codes 2 out of the 4 total partitions in the path (leaving 2 to be crawled)
dataset_url = "s3://our-bucket/series/of/prefixes/partition1=foo/partition2=bar"
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False, metadata_nthreads=100) $ time python repro.py
^C
real 91m59.852s
user 0m1.520s
sys 0m0.344s |
Can you tell what files its loading? Without verify, it should only touch one file. |
I can't sniff the AWS requests because they're encrypted at the transport and application layers. However, as I mentioned in the other ticket (#365), it's done doing meaningful work (as measured by the number of connections to S3 in Launching: $ time python repro.py &
[1] 75641 Measuring (note the Process ID above is for $ lsof -p 75642 | grep s3 | wc -l
11
$ lsof -p 75642 | grep s3 | wc -l
11
$ lsof -p 75642 | grep s3 | wc -l
11
$ lsof -p 75642 | grep s3 | wc -l
10
$ lsof -p 75642 | grep s3 | wc -l
1
$ lsof -p 75642 | grep s3 | wc -l
0 I have to transcribe the PID, but it couldn't take me longer than maybe 5s to do that, then I'm measuring for maybe 10s before the connections all disappear. This has all the classic hallmarks of deadlock, which is why I wrote up the other ticket that way. |
You can turn on logging in s3fs to see what the calls are
…On September 14, 2020 5:04:02 PM EDT, David McGuire ***@***.***> wrote:
I can't sniff the AWS requests because they're encrypted at the
transport and application layers. However, as I mentioned in the other
ticket (#365), it's done doing meaningful work (as measured by the
number of connections to S3 in `ESTABLISHED` state) almost immediately,
after, perhaps, 10-15 seconds:
Launching:
```sh
$ time python repro.py &
[1] 75641
```
Measuring (note the Process ID above is for `time`, so using the child
PID):
```sh
$ lsof -p 75642 | grep s3 | wc -l
11
$ lsof -p 75642 | grep s3 | wc -l
11
$ lsof -p 75642 | grep s3 | wc -l
11
$ lsof -p 75642 | grep s3 | wc -l
10
$ lsof -p 75642 | grep s3 | wc -l
1
$ lsof -p 75642 | grep s3 | wc -l
0
```
I have to transcribe the PID, but it couldn't take me longer than maybe
5s to do that, then I'm measuring for maybe 10s before the connections
all disappear. This has all the classic hallmarks of deadlock, which is
why I wrote up the other ticket that way.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#366 (comment)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
I'll follow up with |
What happened: Using
s3fs
withpyarrow
withinpetastorm
throws an exception (TypeError: 'coroutine' object is not iterable
) in multiple places that usepyarrow.parquet.ParquetDataset
(petastorm.reader.make_reader
,petastorm.reader.make_batch_reader
, etc.).What you expected to happen: No exception
Minimal Complete Verifiable Example:
Anything else we need to know?: I'm having trouble navigating the bug-reporting process for Apache Arrow, if you're able to pass this on to them.
Environment:
0.5.1
3.7.8
pip install s3fs==0.5.1
The text was updated successfully, but these errors were encountered: