New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16413: [Python] Certain dataset APIs hang with a python filesystem #13033
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not sure what's going on with the CI failures - they seem unrelated.
4e3ea4c
to
a0e0ad8
Compare
python/pyarrow/_dataset.pyx
Outdated
CExpression c_partition_expression = partition_expression.unwrap() | ||
|
||
with nogil: | ||
c_result = self.format.MakeFragment( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't able to write a test that needs this change for make_fragment
. But I also suppose it doesn't hurt to add the with nogil
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If MakeFragment
is self-contained and doesn't call into arbitrary IO code then it shouldn't be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, it seems that MakeFragment
just takes the input path / source and filesystem and puts that in a FileFragment, and for example doesn't actually check the file for the schema (that is only done the first time when accessing the schema in ReadPhysicalSchema
, and that part in cython is correctly releasing the GIL)
Did you find out where the code was hanging previously? |
python/pyarrow/tests/test_dataset.py
Outdated
pq.write_table(table, out) | ||
|
||
# read using fsspec filesystem | ||
import s3fs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this test get skipped if s3fs
is not installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, need to add some skips for this
python/pyarrow/tests/test_dataset.py
Outdated
def _create_parquet_dataset_simple(root_path, filesystem=None): | ||
""" | ||
Creates a simple (flat files, no nested partitioning) Parquet dataset | ||
""" | ||
metadata_collector = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If metadata_collector
is an empty list then metadata.append_row_groups
is never called above? Am I reading this wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list gets passed to (and populated by) write_to_dataset
a few lines below. It's not the greatest API, but that is how it is currently needs to be done.
python/pyarrow/tests/test_dataset.py
Outdated
metadata_path = str(root_path / '_metadata') | ||
metadata_path = str(root_path) + '/_metadata' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be the same thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For S3, we are not using a pathlib.Path anymore, so the /
version doesn't work. I should maybe use os.path.join
to make it more robust though instead of hardcoding the /
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, no, if it's an abstract path then it needs /
indeed.
python/pyarrow/tests/test_dataset.py
Outdated
metadata_path, table = _create_parquet_dataset_simple(root_path, fs) | ||
|
||
# read using fsspec filesystem | ||
import s3fs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question re: skipping.
The issue was caused by the gil being held? |
Basically: the main thread in Python would call Inspect() without releasing the GIL; the Parquet Inspect() implementation would kick off a file read in a background thread and block, waiting for the future to complete. The background thread would try to call into Python to complete the read but would get stuck acquiring the GIL since the main thread was still holding it. Short term we can release the GIL, longer term it might be nice if we could avoid a background thread for synchronous situations like this. (Without having to duplicate all code paths.) |
@pytest.mark.parquet | ||
@pytest.mark.s3 | ||
def test_file_format_inspect_fsspec(s3_filesystem): | ||
# https://issues.apache.org/jira/browse/ARROW-16413 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simply use a fsspec local filesystem to avoid the S3 scaffolding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just trying exactly the same :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it does work (I checked the test hangs with a local fs as well, before applying the fix), with the caveat that I need to manually construct the PyFileSystem with FSSpecHandler, because if you pass an actual fsspec local filesystem, we internally convert it into a arrow native local filesystem:
Lines 109 to 113 in 7a0f00c
if isinstance(filesystem, fsspec.AbstractFileSystem): | |
if type(filesystem).__name__ == 'LocalFileSystem': | |
# In case its a simple LocalFileSystem, use native arrow one | |
return LocalFileSystem(use_mmap=use_mmap) | |
return PyFileSystem(FSSpecHandler(filesystem)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually also the reason that our nightly integration tests with dask didn't catch this, because we only were running the main parquet tests that didn't use S3 but only a local filesystem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, actually, we can perhaps even remove fsspec
out of the equation and instead use PyFileSystem(ProxyHandler(LocalFileSystem()))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's something I tried before, and apparently that isn't sufficient to trigger it (I was expecting that just needing to call into python would be sufficient)
@github-actions crossbow submit test-conda-python-3.9-dask-latest test-conda-python-3.9-dask-master |
Revision: e9c308e Submitted crossbow builds: ursacomputing/crossbow @ actions-1997
|
I browsed through our dataset cython code, and I think there are no other remaining cases where there is filesystem interaction without releasing the GIL. arrow/python/pyarrow/_dataset.pyx Lines 2606 to 2616 in 3c3e68c
But my understanding is that this constructor itself doesn't yet read anything, and it's only when consuming the reader that this happens ( |
@github-actions crossbow submit test-conda-python-3.9-dask-latest test-conda-python-3.9-dask-master |
Revision: cd24003 Submitted crossbow builds: ursacomputing/crossbow @ actions-2000
|
f4133ec
to
c7f2368
Compare
Hmm, the AppVeyor failure is actually not unrelated at the moment:
|
To get this finalized:
|
Although based on this comment (another arrow/python/pyarrow/tests/test_dataset.py Lines 3261 to 3264 in 3c3e68c
|
@jorisvandenbossche what the status of this PR? Do we expect the crossbow builds to pass? |
Yes, they should pass now (because I reverted the changes to it, which caused the failures above, see second bullet point in summary at #13033 (comment)). But will trigger them again to be sure. |
@github-actions crossbow submit test-conda-python-3.9-dask-latest test-conda-python-3.9-dask-master |
And if we are fine with skipping the test on Windows for now (see #13033 (comment)), I think this is ready to go. |
Revision: 1bca56e Submitted crossbow builds: ursacomputing/crossbow @ actions-2001
|
Should we file follow-up issues for the Windows issues? If there's already a TODO in the code it would be good to link it to a Jira for future investigation |
Yes, I will open follow-up JIRAs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Joris!
Closes #13033 from jorisvandenbossche/ARROW-16413 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
As follow-ups, I opened:
|
Benchmark runs are scheduled for baseline = 26f2d87 and contender = d897716. d897716 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
No description provided.