Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] ParquetDatasetPiece's path broken when using fsspec fs on Windows #26438

Closed
asfimport opened this issue Nov 2, 2020 · 2 comments
Closed

Comments

@asfimport
Copy link

Dask reported some failures starting with the pyarrow 2.0 release, and specifically on Windows: dask/dask#6754

After some investigation, it seems that this is due to the ParquetDatasetPiece its path attribute now returning a path with a mixture of \\ and / in it.

It specifically happens when dask is passing a posix-style base path pointing to the dataset base directory (so using all /), and passing an fsspec-based (local) filesystem.
From a debugging output during one of the dask tests:

(Pdb) dataset
<pyarrow.parquet.ParquetDataset object at 0x00000290D7506308>
(Pdb) dataset.paths
'C:/Users/joris/AppData/Local/Temp/pytest-of-joris/pytest-25/test_partition_on_pyarrow_0'
(Pdb) dataset.pieces[0].path
'C:/Users/joris/AppData/Local/Temp/pytest-of-joris/pytest-25/test_partition_on_pyarrow_0\\a1=A\\a2=X\\part.0.parquet'

So you can see that the result here has a mix of \\ and /. Using pyarrow 1.0, this was consistently using /.

The reason for the change is that in pyarrow 2.0 we started to replace fsspec LocalFileSystem with our own LocalFileSystem (assuming for a local filesystem that should be equivalent). But it seems that our own LocalFileSystem has a pathsep} property that equals to os.path.sep, which is \\ on Windows (

@property
def pathsep(self):
return os.path.sep
.

So note that while this started being broken in pyarrow 2.0 when using fsspec filesystem, this was already "broken" before when using our own local filesystem (or when not passing any filesystem). But, 1) dask always passes an fsspec filesystem, and 2) dask uses the piece's path as dictionary key and is thus especially sensitive to the change (using it as a file path to read something in, it will probably still work even with the mixture of path separators).

Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Joris Van den Bossche / @jorisvandenbossche

PRs and other links:

Note: This issue was originally created as ARROW-10462. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
The actual trigger that caused this bug to surface was not a change in pyarrow 2.0 itself, but is that fssec stopped subclassing pyarrow.filesystem.FileSystem when pyarrow >= 2.0 is installed (because we now better handle fsspec filesystems, and because pyarrow.filesystem is deprecated, so this was a good change).
But the code that swaps a fsspec LocalFileSystem with our own LocalFileSystem was only reached when the passed filesystems was not a pyarrow filesystem subclass. So with pyarrow 1.0 (where fsspec-based filesystems subclassed pyarrow), this code was never run in practice .. But now with pyarrow 2.0 and the latest fsspec, we started running this.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
Issue resolved by pull request 8539
#8539

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants