Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unpin pyarrow in CI #6754

Closed
TomAugspurger opened this issue Oct 20, 2020 · 13 comments · Fixed by #6772
Closed

Unpin pyarrow in CI #6754

TomAugspurger opened this issue Oct 20, 2020 · 13 comments · Fixed by #6772

Comments

@TomAugspurger
Copy link
Member

TomAugspurger commented Oct 20, 2020

We've temporarily pinned pyarrow<2 in the CI environments as there were some changes that broke our tests. We should test against pyarrow 2.


https://github.com/dask/dask/runs/1281335438?check_suite_focus=true#step:6:211

================================== FAILURES ===================================
_________________________ test_get_pyarrow_filesystem _________________________

    def test_get_pyarrow_filesystem():
        from fsspec.implementations.local import LocalFileSystem
    
        pa = pytest.importorskip("pyarrow")
    
        fs = LocalFileSystem()
>       assert isinstance(fs, pa.filesystem.FileSystem)
E       AssertionError: assert False
E        +  where False = isinstance(<fsspec.implementations.local.LocalFileSystem object at 0x00000224F12C08C8>, <class 'pyarrow.filesystem.FileSystem'>)
E        +    where <class 'pyarrow.filesystem.FileSystem'> = <module 'pyarrow.filesystem' from 'C:\\Miniconda3\\envs\\test-environment\\lib\\site-packages\\pyarrow\\filesystem.py'>.FileSystem
E        +      where <module 'pyarrow.filesystem' from 'C:\\Miniconda3\\envs\\test-environment\\lib\\site-packages\\pyarrow\\filesystem.py'> = <module 'pyarrow' from 'C:\\Miniconda3\\envs\\test-environment\\lib\\site-packages\\pyarrow\\__init__.py'>.filesystem

dask\bytes\tests\test_local.py:366: AssertionError
___________________________ test_get_pyarrow_fs_s3 ____________________________

s3 = <s3fs.core.S3FileSystem object at 0x00000224805B9BC8>

    def test_get_pyarrow_fs_s3(s3):
        pa = pytest.importorskip("pyarrow")
        fs = DaskS3FileSystem(anon=True)
>       assert isinstance(fs, pa.filesystem.FileSystem)
E       AssertionError: assert False
E        +  where False = isinstance(<s3fs.core.S3FileSystem object at 0x0000022482EA3BC8>, <class 'pyarrow.filesystem.FileSystem'>)
E        +    where <class 'pyarrow.filesystem.FileSystem'> = <module 'pyarrow.filesystem' from 'C:\\Miniconda3\\envs\\test-environment\\lib\\site-packages\\pyarrow\\filesystem.py'>.FileSystem
E        +      where <module 'pyarrow.filesystem' from 'C:\\Miniconda3\\envs\\test-environment\\lib\\site-packages\\pyarrow\\filesystem.py'> = <module 'pyarrow' from 'C:\\Miniconda3\\envs\\test-environment\\lib\\site-packages\\pyarrow\\__init__.py'>.filesystem

Some additional failures like test_append_with_partition[pyarrow] further down I think.

This is perhaps related to pyarrow 2.0 being released yesterday?

@jrbourbeau
Copy link
Member

cc @dask/io for visibility

@rjzamora
Copy link
Member

I'll start looking into this - Seems that a new pyarrow package was released.

@TomAugspurger TomAugspurger changed the title Windows CI Failures Unpin pyarrow in CI Oct 27, 2020
@rjzamora
Copy link
Member

Just starting to look into this now - Sorry for the delay. It looks like pyarrow-2.0 deprecated the pyarrow.FileSystem API, so we can no longer check that the fsspec file-system object is an instance of pyarrow.FileSystem (which is done in several tests). Beyond that, it seems that pyarrow's new file-system API is causing issues (but haven't dug into the specific reasons yet) - Perhaps @martindurant and/or @jorisvandenbossche already know what the problem(s) may be.

@jorisvandenbossche
Copy link
Member

The two tests shown above can be updated. It's actually a change in fsspec (but triggered by pyarrow 2.0 release) that fsspec will no longer inherit from the indeed deprecated python pyarrow.filesystem objects (see fsspec/filesystem_spec#411). In pyarrow 2.0, all relevant code paths should now also accept plain fsspec filesystems, and thus subclassing from the pyarrow.filesystem is no longer needed.

@rjzamora
Copy link
Member

The two tests shown above can be updated. It's actually a change in fsspec (but triggered by pyarrow 2.0 release) that fsspec will no longer inherit from the indeed deprecated python pyarrow.filesystem objects (see fsspec/filesystem_spec#411)

Agreed - The failure in those particular tests makes perfect sense. Unfortunately, there are 22 tests failing, and some of them seem to go beyond type checking (from a high-level glance).

@jorisvandenbossche
Copy link
Member

The other failures in the parquet tests, I cannot reproduce locally (I have some others though, related to categorical dtype). We also run the dask tests nightly on arrow's CI, where we don't see any failures

@rjzamora
Copy link
Member

The other failures in the parquet tests, I cannot reproduce locally (I have some others though, related to categorical dtype). We also run the dask tests nightly on arrow's CI, where we don't see any failures

Thanks - That is good to know. I also cannot reproduce the failures locally (they may be Windows specific)

@rjzamora rjzamora mentioned this issue Oct 27, 2020
2 tasks
@jorisvandenbossche
Copy link
Member

Ah, I missed the fact that they're on windows. Yes, it's certainly likely that it is windows specific. We had regularly path issues on windows, it might be we missed some

@jorisvandenbossche
Copy link
Member

I could reproduce the test failures on Windows, and diagnose the issue: https://issues.apache.org/jira/browse/ARROW-10462

See the arrow issue for the longer story, but in short: because of switching fsspec's LocalFileSystem (passed by dask to ParquetDataset) with our own LocalFileSystem, we were generating ParquetDatasetPiece.path with a mixture of \\ and /. And dask is using that exact path as a key in a dictionary here:

full_path = (
fs.sep.join([data_path, filename])
if filename != ""
else data_path # This is a single file
)
pkeys = partition_keys.get(full_path, None)

When creating the full_path above, it was using fs.sep, which is / even on Windows when using fsspec.
But somewhere in pyarrow, we started to use \\ to when creating the piece's path. Example output from debugging session from one of the dask parquet tests:

(Pdb) dataset
<pyarrow.parquet.ParquetDataset object at 0x00000290D7506308>
(Pdb) dataset.paths
'C:/Users/joris/AppData/Local/Temp/pytest-of-joris/pytest-25/test_partition_on_pyarrow_0'
(Pdb) dataset.pieces[0].path
'C:/Users/joris/AppData/Local/Temp/pytest-of-joris/pytest-25/test_partition_on_pyarrow_0\\a1=A\\a2=X\\part.0.parquet'

So you see the mixture of / and \\ here, while dask was expecting a path with all /.

We can fix that on the pyarrow side (and hopefully put that in a 2.0.1 release).

@martindurant
Copy link
Member

(fsspec does have a function in the LocalFileSystem module for making paths posixy)

@jorisvandenbossche
Copy link
Member

Yep, and that's what dask is using to pass a base path in posix style to pyarrow. That's also the reason the base style in the example output above is still posix style, but so we have a "bug" when joining the additional file path (the bug is already present in older versions of pyarrow, but only surfaced now in dask due to a change in how we handled fsspec filesystems)

@rjzamora
Copy link
Member

rjzamora commented Nov 2, 2020

We can fix that on the pyarrow side (and hopefully put that in a 2.0.1 release).

Thank you for looking into this @jorisvandenbossche ! Getting the fix into 2.0.1 would be great :)

@jorisvandenbossche
Copy link
Member

So the reason that the bug surfaced now is because the new fsspec filesystems no longer inherit from pyarrow.filesystem.FileSystem, and because of that we take a different code path where we are swapping fsspec's LocalFileSystem with our own LocalFileSystem (which has a different path separator on Windows). While before (when fsspec subclassed pyarrow), we didn't actually take this code path ..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants