ARROW-16413: [Python] Certain dataset APIs hang with a python filesystem #13033

jorisvandenbossche · 2022-04-29T16:12:18Z

No description provided.

github-actions · 2022-04-29T16:12:39Z

https://issues.apache.org/jira/browse/ARROW-16413

github-actions · 2022-04-29T16:12:41Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

lidavidm

Hmm, not sure what's going on with the CI failures - they seem unrelated.

python/pyarrow/tests/test_dataset.py

jorisvandenbossche · 2022-05-03T08:42:32Z

python/pyarrow/_dataset.pyx

+            CExpression c_partition_expression = partition_expression.unwrap()
+
+        with nogil:
+            c_result = self.format.MakeFragment(


I wasn't able to write a test that needs this change for make_fragment. But I also suppose it doesn't hurt to add the with nogil ?

If MakeFragment is self-contained and doesn't call into arbitrary IO code then it shouldn't be necessary.

Indeed, it seems that MakeFragment just takes the input path / source and filesystem and puts that in a FileFragment, and for example doesn't actually check the file for the schema (that is only done the first time when accessing the schema in ReadPhysicalSchema, and that part in cython is correctly releasing the GIL)

pitrou · 2022-05-03T11:17:06Z

Did you find out where the code was hanging previously?

pitrou · 2022-05-03T11:19:29Z

python/pyarrow/tests/test_dataset.py

+        pq.write_table(table, out)
+
+    # read using fsspec filesystem
+    import s3fs


Does this test get skipped if s3fs is not installed?

Good catch, need to add some skips for this

pitrou · 2022-05-03T11:21:34Z

python/pyarrow/tests/test_dataset.py

+def _create_parquet_dataset_simple(root_path, filesystem=None):
+    """
+    Creates a simple (flat files, no nested partitioning) Parquet dataset
+    """
    metadata_collector = []


If metadata_collector is an empty list then metadata.append_row_groups is never called above? Am I reading this wrong?

This list gets passed to (and populated by) write_to_dataset a few lines below. It's not the greatest API, but that is how it is currently needs to be done.

pitrou · 2022-05-03T11:21:49Z

python/pyarrow/tests/test_dataset.py

-    metadata_path = str(root_path / '_metadata')
+    metadata_path = str(root_path) + '/_metadata'


This should be the same thing?

For S3, we are not using a pathlib.Path anymore, so the / version doesn't work. I should maybe use os.path.join to make it more robust though instead of hardcoding the /

Ah, no, if it's an abstract path then it needs / indeed.

pitrou · 2022-05-03T11:22:42Z

python/pyarrow/tests/test_dataset.py

+    metadata_path, table = _create_parquet_dataset_simple(root_path, fs)
+
+    # read using fsspec filesystem
+    import s3fs


Same question re: skipping.

kszucs · 2022-05-03T11:46:30Z

The issue was caused by the gil being held?

lidavidm · 2022-05-03T11:53:16Z

Basically: the main thread in Python would call Inspect() without releasing the GIL; the Parquet Inspect() implementation would kick off a file read in a background thread and block, waiting for the future to complete. The background thread would try to call into Python to complete the read but would get stuck acquiring the GIL since the main thread was still holding it.

Short term we can release the GIL, longer term it might be nice if we could avoid a background thread for synchronous situations like this. (Without having to duplicate all code paths.)

ci/scripts/integration_dask.sh

pitrou · 2022-05-03T11:57:00Z

python/pyarrow/tests/test_dataset.py

+@pytest.mark.parquet
+@pytest.mark.s3
+def test_file_format_inspect_fsspec(s3_filesystem):
+    # https://issues.apache.org/jira/browse/ARROW-16413


Can we simply use a fsspec local filesystem to avoid the S3 scaffolding?

I was just trying exactly the same :)

So it does work (I checked the test hangs with a local fs as well, before applying the fix), with the caveat that I need to manually construct the PyFileSystem with FSSpecHandler, because if you pass an actual fsspec local filesystem, we internally convert it into a arrow native local filesystem:

arrow/python/pyarrow/fs.py

Lines 109 to 113 in 7a0f00c

if isinstance(filesystem, fsspec.AbstractFileSystem):

if type(filesystem).__name__ == 'LocalFileSystem':

# In case its a simple LocalFileSystem, use native arrow one

return LocalFileSystem(use_mmap=use_mmap)

return PyFileSystem(FSSpecHandler(filesystem))

That's actually also the reason that our nightly integration tests with dask didn't catch this, because we only were running the main parquet tests that didn't use S3 but only a local filesystem

Hmm, actually, we can perhaps even remove fsspec out of the equation and instead use PyFileSystem(ProxyHandler(LocalFileSystem()))?

That's something I tried before, and apparently that isn't sufficient to trigger it (I was expecting that just needing to call into python would be sufficient)

jorisvandenbossche · 2022-05-03T12:30:39Z

@github-actions crossbow submit test-conda-python-3.9-dask-latest test-conda-python-3.9-dask-master

github-actions · 2022-05-03T12:31:59Z

Revision: e9c308e

Submitted crossbow builds: ursacomputing/crossbow @ actions-1997

Task	Status
test-conda-python-3.9-dask-latest
test-conda-python-3.9-dask-master

jorisvandenbossche · 2022-05-03T12:50:52Z

I browsed through our dataset cython code, and I think there are no other remaining cases where there is filesystem interaction without releasing the GIL.
I was only doubting a bit about:

arrow/python/pyarrow/_dataset.pyx

Lines 2606 to 2616 in 3c3e68c

    
               def to_reader(self): 
        
                   """Consume this scanner as a RecordBatchReader. 
        
                   Returns 
        
                   ------- 
        
                   RecordBatchReader 
        
                   """ 
        
                   cdef RecordBatchReader reader 
        
                   reader = RecordBatchReader.__new__(RecordBatchReader) 
        
                   reader.reader = GetResultValue(self.scanner.ToRecordBatchReader()) 
        
                   return reader

But my understanding is that this constructor itself doesn't yet read anything, and it's only when consuming the reader that this happens (read_next_batch, which releases the gil).

jorisvandenbossche · 2022-05-03T13:11:13Z

@github-actions crossbow submit test-conda-python-3.9-dask-latest test-conda-python-3.9-dask-master

github-actions · 2022-05-03T13:12:27Z

Revision: cd24003

Submitted crossbow builds: ursacomputing/crossbow @ actions-2000

Task	Status
test-conda-python-3.9-dask-latest
test-conda-python-3.9-dask-master

jorisvandenbossche · 2022-05-03T13:25:15Z

Hmm, the AppVeyor failure is actually not unrelated at the moment:

____________________ test_parquet_dataset_factory_fsspec _____________________
tempdir = WindowsPath('C:/Users/appveyor/AppData/Local/Temp/1/pytest-of-appveyor/pytest-0/test_parquet_dataset_factory_f0')
    @pytest.mark.parquet
    def test_parquet_dataset_factory_fsspec(tempdir):
        # https://issues.apache.org/jira/browse/ARROW-16413
        fsspec = pytest.importorskip("fsspec")
    
        # create dataset with pyarrow
        root_path = tempdir / "test_parquet_dataset"
        metadata_path, table = _create_parquet_dataset_simple(root_path)
    
        # read using fsspec filesystem
        fsspec_fs = fsspec.filesystem("file")
        # manually creating a PyFileSystem, because passing the local fsspec
        # filesystem would internally be converted to native LocalFileSystem
        filesystem = fs.PyFileSystem(fs.FSSpecHandler(fsspec_fs))
        dataset = ds.parquet_dataset(metadata_path, filesystem=filesystem)
        assert dataset.schema.equals(table.schema)
        assert len(dataset.files) == 4
>       result = dataset.to_table()
pyarrow\tests\test_dataset.py:3140: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyarrow\_dataset.pyx:304: in pyarrow._dataset.Dataset.to_table
    return self.scanner(**kwargs).to_table()
pyarrow\_dataset.pyx:2549: in pyarrow._dataset.Scanner.to_table
    return pyarrow_wrap_table(GetResultValue(result))
pyarrow\error.pxi:144: in pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
pyarrow\_fs.pyx:1190: in pyarrow._fs._cb_open_input_file
    stream = handler.open_input_file(frombytes(path))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <pyarrow.fs.FSSpecHandler object at 0x000002CA4A397D70>
path = 'f07c005726c84dc69f13ce79116d3304-0.parquet'
    def open_input_file(self, path):
        from pyarrow import PythonFile
    
        if not self.fs.isfile(path):
>           raise FileNotFoundError(path)
E           FileNotFoundError: f07c005726c84dc69f13ce79116d3304-0.parquet
pyarrow\fs.py:400: FileNotFoundError

jorisvandenbossche · 2022-05-03T14:40:12Z

To get this finalized:

The test_parquet_dataset_factory_fsspec is failing on Windows (Appveyor). It seems that the file listing of the dataset is using file paths relative to the root of the dataset folder (where the _metadata file lives). And thus when trying to read it, it gives a FileNotFound error (since those files are located in some temporary directory, and relative paths don't work).
While if I test this locally, I properly get absolute paths (also when using fsspec). And also with s3 instead of local filesystem (earlier on this PR), the windows tests were passing.
We are already normalizing the metadata path inside ds.parquet_dataset(..) (and inside FileFromRowGroup in C++, where we combine the path defined in the _metadata file with the root path), so my assumption is that this is some issue with the fsspec filesystem on Windows.
So maybe we can skip this test for now for Windows, and open a follow-up JIRA for investigating it further? (and potentially open upstream issue) In practice users shouldn't run into this failure, as internally we translate a local fsspec filesystem to a native one.
The dask integration tests are still failing, because of the test_s3.py tests I added (the moto server timed out). I ensure locally that those are passing now, so this can probably also be done as a follow-up.

jorisvandenbossche · 2022-05-03T14:43:27Z

Although based on this comment (another parquet_dataset related test, and one that uses a PyFileSystem as well, but wrapping our own LocalFileSystem to add some logging in the middle), it seems we got similar issues in the past, and it might not necessarily be related to fsspec:

arrow/python/pyarrow/tests/test_dataset.py

Lines 3261 to 3264 in 3c3e68c

    
           # FIXME(bkietz) on Windows this results in FileNotFoundErrors. 
        
           # but actually scanning does open files 
        
           # with assert_opens([f.path for f in fragments]): 
        
           #    dataset.to_table()

kszucs · 2022-05-03T15:43:33Z

@jorisvandenbossche what the status of this PR? Do we expect the crossbow builds to pass?

jorisvandenbossche · 2022-05-03T16:16:44Z

Yes, they should pass now (because I reverted the changes to it, which caused the failures above, see second bullet point in summary at #13033 (comment)). But will trigger them again to be sure.

jorisvandenbossche · 2022-05-03T16:16:55Z

@github-actions crossbow submit test-conda-python-3.9-dask-latest test-conda-python-3.9-dask-master

jorisvandenbossche · 2022-05-03T16:17:33Z

And if we are fine with skipping the test on Windows for now (see #13033 (comment)), I think this is ready to go.

github-actions · 2022-05-03T16:18:13Z

Revision: 1bca56e

Submitted crossbow builds: ursacomputing/crossbow @ actions-2001

Task	Status
test-conda-python-3.9-dask-latest
test-conda-python-3.9-dask-master

lidavidm · 2022-05-03T16:31:56Z

Should we file follow-up issues for the Windows issues? If there's already a TODO in the code it would be good to link it to a Jira for future investigation

jorisvandenbossche · 2022-05-03T16:38:15Z

Yes, I will open follow-up JIRAs

kszucs

Thanks Joris!

Closes #13033 from jorisvandenbossche/ARROW-16413 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

jorisvandenbossche · 2022-05-04T13:51:01Z

As follow-ups, I opened:

[ARROW-16458] [Python] Run S3 tests in the nightly dask integration build
[ARROW-16460] [Python] Some dataset tests using PyFileSystem are failing on Windows

ursabot · 2022-05-07T06:11:14Z

Benchmark runs are scheduled for baseline = 26f2d87 and contender = d897716. d897716 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.27% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.36% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.2% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] d8977165 ec2-t3-xlarge-us-east-2
[Finished] d8977165 test-mac-arm
[Finished] d8977165 ursa-i9-9960x
[Finished] d8977165 ursa-thinkcentre-m75q
[Finished] 26f2d877 ec2-t3-xlarge-us-east-2
[Finished] 26f2d877 test-mac-arm
[Finished] 26f2d877 ursa-i9-9960x
[Finished] 26f2d877 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2022-05-07T06:11:23Z

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

github-actions bot added the Component: Python label Apr 29, 2022

lidavidm reviewed Apr 29, 2022

View reviewed changes

python/pyarrow/tests/test_dataset.py Show resolved Hide resolved

jorisvandenbossche added 3 commits May 3, 2022 10:40

ARROW-16413: [Python] Certain dataset APIs hang with a python filesystem

639189a

add test for parquet_dataset

cd9b58d

add make_fragment to test

a0e0ad8

jorisvandenbossche force-pushed the ARROW-16413 branch from 4e3ea4c to a0e0ad8 Compare May 3, 2022 08:40

jorisvandenbossche commented May 3, 2022

View reviewed changes

jorisvandenbossche marked this pull request as ready for review May 3, 2022 08:42

jorisvandenbossche requested a review from pitrou May 3, 2022 08:43

jorisvandenbossche mentioned this pull request May 3, 2022

Force nightly pyarrow in the upstream build dask/dask#8993

Merged

pitrou reviewed May 3, 2022

View reviewed changes

jorisvandenbossche added 2 commits May 3, 2022 13:30

add import skip

9792f5e

ensure we run s3 tests in dask integration

07c64a6

pitrou reviewed May 3, 2022

View reviewed changes

ci/scripts/integration_dask.sh Outdated Show resolved Hide resolved

pitrou reviewed May 3, 2022

View reviewed changes

jorisvandenbossche added 4 commits May 3, 2022 14:17

simplify tests to use local filesystem instead of s3

bba05d6

add comment to integration tests

b11a17c

undo change for make_fragment

3e35b90

remove s3 marks + clean-up tests

e9c308e

additional test deps for dask tests

cd24003

debug windows tests

c7f2368

jorisvandenbossche force-pushed the ARROW-16413 branch from f4133ec to c7f2368 Compare May 3, 2022 13:25

jorisvandenbossche added 2 commits May 3, 2022 16:47

skip on windows - clean-up debugging code

95bba99

undo dask integration tests change

1bca56e

lidavidm approved these changes May 3, 2022

View reviewed changes

kszucs approved these changes May 3, 2022

View reviewed changes

kszucs closed this in d897716 May 3, 2022

jorisvandenbossche deleted the ARROW-16413 branch May 4, 2022 13:44

		metadata_path = str(root_path / '_metadata')
		metadata_path = str(root_path) + '/_metadata'

	if isinstance(filesystem, fsspec.AbstractFileSystem):
	if type(filesystem).__name__ == 'LocalFileSystem':
	# In case its a simple LocalFileSystem, use native arrow one
	return LocalFileSystem(use_mmap=use_mmap)
	return PyFileSystem(FSSpecHandler(filesystem))

ARROW-16413: [Python] Certain dataset APIs hang with a python filesystem #13033

ARROW-16413: [Python] Certain dataset APIs hang with a python filesystem #13033

Conversation

jorisvandenbossche commented Apr 29, 2022

github-actions bot commented Apr 29, 2022

github-actions bot commented Apr 29, 2022

lidavidm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented May 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kszucs commented May 3, 2022

lidavidm commented May 3, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented May 3, 2022

github-actions bot commented May 3, 2022

jorisvandenbossche commented May 3, 2022 • edited

jorisvandenbossche commented May 3, 2022

github-actions bot commented May 3, 2022

jorisvandenbossche commented May 3, 2022

jorisvandenbossche commented May 3, 2022

jorisvandenbossche commented May 3, 2022

kszucs commented May 3, 2022

jorisvandenbossche commented May 3, 2022

jorisvandenbossche commented May 3, 2022

jorisvandenbossche commented May 3, 2022

github-actions bot commented May 3, 2022

lidavidm commented May 3, 2022

jorisvandenbossche commented May 3, 2022

kszucs left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented May 4, 2022

ursabot commented May 7, 2022

ursabot commented May 7, 2022

lidavidm commented May 3, 2022 •

edited

jorisvandenbossche commented May 3, 2022 •

edited