-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock in the interaction between pyarrow.filesystem.S3FSWrapper
and s3fs.core.S3FileSystem
#365
Comments
pyarrow.filesystem.S3FSWrapper
and s3fs.core.S3FileSystem
This is the same as #350 , as has been fixed upstream in aiobotocore. I believe upgrading to aiobotocore 1.1.1 should solve - but you can also set the AWS_DEFAULT_REGION environment variable as a workaround. |
That didn't help. Did you try to reproduce this?
|
I do not have petastorm... Are you saying that if If so, I would suggest asking petastorm not to use the pyarrow wrappers (or indeed try to parse URLs) but just use fsspec directly. Also, you might try with fsspec/s3fs master, which changed what happens to coroutines in some situations. |
Alright, I'll try to work up a repro case that doesn't include |
Here's what I got, based on the versions that were (likely) installed when the error was first encountered. These are from my notes when I patched it by disabling multithreaded metadata discovery in our fork of Setuppip install pyarrow==0.17.1
pip install s3fs==0.4.2 Test Caseimport pyarrow.parquet as pq
from s3fs import S3FileSystem
from pyarrow.filesystem import S3FSWrapper
fs = S3FSWrapper(S3FileSystem())
dataset_url = "s3://our-bucket/series/of/prefixes/partition1=foo/partition2=bar"
filters = [[('partition3', '=', 'baz')]]
# Slow, but does actually make meaningful progress
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, filters=filters, metadata_nthreads=1)
# Seems to be deadlocked, given that connections to S3 are all in `CLOSE_WAIT`, indefinitely
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, filters=filters, metadata_nthreads=10) AnalysisHow do I know it's deadlocked? Well, as mentioned, the first time I reproduced it, I let it run overnight (8 hours) to be sure it wasn't just incredibly slow. Upon further investigation, it's done will all meaningful work (according to $ lsof -p 88248 | grep s3
python3.7 88248 dmcguire 5u IPv4 0xa1009c7791331a23 0t0 TCP 192.168.1.114:58325->s3-1-w.amazonaws.com:https (CLOSE_WAIT) |
This is updated to use the latest Setuppip install pyarrow==1.0.1
pip install s3fs==0.4.2 Test Caseimport pyarrow.parquet as pq
from s3fs import S3FileSystem
from pyarrow.filesystem import S3FSWrapper
fs = S3FSWrapper(S3FileSystem())
dataset_url = "s3://our-bucket/series/of/prefixes/partition1=foo/partition2=bar"
filters = [[('partition3', '=', 'baz')]]
# Just go straight to multithreaded metadata discovery, just to be sure
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, filters=filters, metadata_nthreads=10) AnalysisAgain, it's done will all meaningful work (according to $ lsof -p 88827 | grep s3
python3.7 88827 dmcguire 5u IPv4 0xa1009c77bb142ee3 0t0 TCP 192.168.1.114:58543->s3-1-w.amazonaws.com:https (CLOSE_WAIT)
python3.7 88827 dmcguire 6u IPv4 0xa1009c77be0fca23 0t0 TCP 192.168.1.114:58551->s3-1-w.amazonaws.com:https (CLOSE_WAIT)
python3.7 88827 dmcguire 7u IPv4 0xa1009c77bebaba23 0t0 TCP 192.168.1.114:58552->s3-1-w.amazonaws.com:https (CLOSE_WAIT)
python3.7 88827 dmcguire 8u IPv4 0xa1009c77bec84663 0t0 TCP 192.168.1.114:58545->s3-1-w.amazonaws.com:https (CLOSE_WAIT)
python3.7 88827 dmcguire 9u IPv4 0xa1009c772ef692a3 0t0 TCP 192.168.1.114:58544->s3-1-w.amazonaws.com:https (CLOSE_WAIT)
python3.7 88827 dmcguire 10u IPv4 0xa1009c772ef69c83 0t0 TCP 192.168.1.114:58546->s3-1-w.amazonaws.com:https (CLOSE_WAIT)
python3.7 88827 dmcguire 11u IPv4 0xa1009c77bebab043 0t0 TCP 192.168.1.114:58548->s3-1-w.amazonaws.com:https (CLOSE_WAIT)
python3.7 88827 dmcguire 13u IPv4 0xa1009c77bebaa663 0t0 TCP 192.168.1.114:58547->s3-1-w.amazonaws.com:https (CLOSE_WAIT)
python3.7 88827 dmcguire 14u IPv4 0xa1009c77beba9c83 0t0 TCP 192.168.1.114:58549->s3-1-w.amazonaws.com:https (CLOSE_WAIT)
python3.7 88827 dmcguire 15u IPv4 0xa1009c77beba7ee3 0t0 TCP 192.168.1.114:58550->s3-1-w.amazonaws.com:https (CLOSE_WAIT) |
@martindurant $ pip install pyarrow==0.17.1
$ pip install s3fs==0.4.2
$ pip freeze | grep aiobotocore # no hits and $ pip install --upgrade pyarrrow==1.0.1
$ pip freeze | grep aiobotocore # *still* no hits It might be related to my defect against the later versions, which does seem to install |
Alright, I was able to confirm that this has to do with the import pyarrow.parquet as pq
from s3fs import S3FileSystem
from pyarrow.filesystem import S3FSWrapper
fs = S3FSWrapper(S3FileSystem())
dataset_url = "s3://our-bucket/series/of/prefixes/partition1=foo/partition2=bar/partition3=baz"
# This completes as soon as all the `ESTABLISHED` connections turn to `CLOSE_WAIT`
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, metadata_nthreads=10) |
Sorry about the aiobotocore red herring - that seemed likely to cause deadlocking, as it is async stuff. Quick questions:
|
I don't think that will work, because
Yes, it's definitely effected by this. Everything single-threaded seems to work (if very, very slowly). I can watch the effective progress, as I mentioned, by looking for Also, the above |
SummaryCrawling hierarchical partitions (depth >= 2) with Test CasesMultithreadedimport pyarrow.parquet as pq
from s3fs import S3FileSystem
from pyarrow.filesystem import S3FSWrapper
fs = S3FSWrapper(S3FileSystem())
# This hard-codes 2 out of the 4 total partitions in the path (leaving 2 to be crawled)
dataset_url = "s3://our-bucket/series/of/prefixes/partition1=foo/partition2=bar"
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, metadata_nthreads=100) Ran for an hour trying to crawl ~41K objects unsuccessfully. Single-threadedimport pyarrow.parquet as pq
from s3fs import S3FileSystem
from pyarrow.filesystem import S3FSWrapper
fs = S3FSWrapper(S3FileSystem())
# This hard-codes 2 out of the 4 total partitions in the path (leaving 2 to be crawled)
dataset_url = "s3://our-bucket/series/of/prefixes/partition1=foo/partition2=bar"
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, metadata_nthreads=1) Appears to still be making meaningful progress (will update the results after, or timeout at an hour). Reference StatsCount of objects: $ aws s3 ls --recursive s3://our-bucket/series/of/prefixes/partition1=foo/partition2=bar | wc -l
41004 Baseline time to crawl: $ time aws s3 ls --recursive s3://our-bucket/series/of/prefixes/partition1=foo/partition2=bar
...
real 0m20.734s
user 0m7.292s
sys 0m0.325s DiagnosticsWhen killing the hung repro test case, here's the stack trace of where it's waiting: ^CTraceback (most recent call last):
File "repro.py", line 9, in <module>
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, metadata_nthreads=100)
File "./env/lib/python3.7/site-packages/pyarrow/parquet.py", line 1170, in __init__
open_file_func=partial(_open_dataset_file, self._metadata)
File "./env/lib/python3.7/site-packages/pyarrow/parquet.py", line 1348, in _make_manifest
metadata_nthreads=metadata_nthreads)
File "./env/lib/python3.7/site-packages/pyarrow/parquet.py", line 927, in __init__
self._visit_level(0, self.dirpath, [])
File "./env/lib/python3.7/site-packages/pyarrow/parquet.py", line 968, in _visit_level
self._visit_directories(level, filtered_directories, part_keys)
File "./env/lib/python3.7/site-packages/pyarrow/parquet.py", line 998, in _visit_directories
futures.wait(futures_list)
File "./.pyenv/versions/3.7.8/lib/python3.7/concurrent/futures/_base.py", line 301, in wait
waiter.event.wait(timeout)
File "./.pyenv/versions/3.7.8/lib/python3.7/threading.py", line 552, in wait
signaled = self._cond.wait(timeout)
File "./.pyenv/versions/3.7.8/lib/python3.7/threading.py", line 296, in wait
waiter.acquire() That corresponds to this line in the 1.0.x maintanence branch of the |
You were right that they are compatible, but that configuration exhibits the same symptoms as the wrapped filesystem, above, where the number of import pyarrow.parquet as pq
from s3fs import S3FileSystem
fs = S3FileSystem()
# This hard-codes 2 out of the 4 total partitions in the path (leaving 2 to be crawled)
dataset_url = "s3://our-bucket/series/of/prefixes/partition1=foo/partition2=bar"
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, metadata_nthreads=100) This exhibited the same behavior: $ time python repro.py
^CTraceback (most recent call last):
...
real 57m6.142s
user 0m3.385s
sys 0m1.274s |
I wonder if this is still a problem with the async/aiobotocore-based latest version. That ought to be friendly to multithreading. I think that for the original botocore version you are using, we may simply need to say that it isn't threadsafe, and using threads will kill it above some threshold. In Dask, we no longer crawl all the files in a dataset by default, partly because of this kind of problem (such crawling would not be done in parallel on workers). In the absence of a global By the way, tests with the new async version shows up to 100x speedup for concurrent small reads from many (>1000) files on S3. Of course, pyarrow isn't plumbed to use that. |
I have issues using the latest version. See #366. |
@dmcguire81 see https://arrow.apache.org/community/ for several avenues to reach out / open issues. But, I think it is fine to first discuss further here, to try to diagnose the issue, before opening another issue at the Arrow side. Using the (but indeed, petastorm should not use this S3FSWrapper directly)
So it hangs when discovering all the directories (to build up the partition structure) and when doing that with a threadpool. It seems that that code only calls the |
This is probably the way to go, agreed. I don't think I have the time this week. |
@jorisvandenbossche and @martindurant thanks for the feedback. I'll see if I can follow well enough to create the reproducer. |
Pulling in @selitvin to help investigate overlap with |
@martindurant good news (for you): I have a repro test case that is 100% @jorisvandenbossche how should I follow up with this, based on import pyarrow.parquet as pq
import pyarrow.filesystem as fs
class LoggingLocalFileSystem(fs.LocalFileSystem):
def walk(self, path):
print(path)
return super().walk(path)
fs = LoggingLocalFileSystem()
dataset_url = "dataset"
# Viewing the File System *directories* as a tree, one thread is required for every non-leaf node,
# in order to avoid deadlock
# 1) dataset
# 2) dataset/foo=1
# 3) dataset/foo=1/bar=2
# 4) dataset/foo=1/bar=2/baz=0
# 5) dataset/foo=1/bar=2/baz=1
# 6) dataset/foo=1/bar=2/baz=2
# *) dataset/foo=1/bar=2/baz=0/qux=false
# *) dataset/foo=1/bar=2/baz=1/qux=false
# *) dataset/foo=1/bar=2/baz=1/qux=true
# *) dataset/foo=1/bar=2/baz=0/qux=true
# *) dataset/foo=1/bar=2/baz=2/qux=false
# *) dataset/foo=1/bar=2/baz=2/qux=true
# This completes
threads = 6
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False, metadata_nthreads=threads)
print(len(dataset.pieces))
# This hangs indefinitely
threads = 5
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False, metadata_nthreads=threads)
print(len(dataset.pieces)) $ python repro.py
dataset
dataset/foo=1
dataset/foo=1/bar=2
dataset/foo=1/bar=2/baz=0
dataset/foo=1/bar=2/baz=1
dataset/foo=1/bar=2/baz=2
dataset/foo=1/bar=2/baz=0/qux=false
dataset/foo=1/bar=2/baz=0/qux=true
dataset/foo=1/bar=2/baz=1/qux=false
dataset/foo=1/bar=2/baz=1/qux=true
dataset/foo=1/bar=2/baz=2/qux=false
dataset/foo=1/bar=2/baz=2/qux=true
6
dataset
dataset/foo=1
dataset/foo=1/bar=2
dataset/foo=1/bar=2/baz=0
dataset/foo=1/bar=2/baz=1
dataset/foo=1/bar=2/baz=2
^C
...
KeyboardInterrupt
^C
...
KeyboardInterrupt NOTE: this also happens with the un-decorated |
Reported as ARROW-10029. |
@martindurant thanks for help and sorry for the confusion! |
Cool, thanks for further looking into it and figuring it out @dmcguire81 ! |
Please be concise with code posted. See guidelines below on how to provide a good bug report:
Bug reports that follow these guidelines are easier to diagnose, and so are often handled much more quickly.
-->
What happened: Some interaction between
s3fs
,pyarrow
, andpetastorm
causes deadlockWhat you expected to happen:
s3fs
to be threadsafe, ifpyarrow
is using it that wayMinimal Complete Verifiable Example:
Anything else we need to know?:
If your code is not threadsafe, that would appear to be news to
pyarrow
. Also reported to Petastorm. Will be reported to PyArrow.Environment:
0.4.2
3.7.8
pip install s3fs==0.4.2
The text was updated successfully, but these errors were encountered: