-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] The new Dataset API will not work with files on Azure Blob #25582
Comments
Joris Van den Bossche / @jorisvandenbossche: So when using the Dataset API, Azure is not yet supported natively (see ARROW-2034, ARROW-9611). But in theory it should indeed be supported through the fsspec wrapper. However, it seems you ran into a bug (we don't test the fsspec integration with Azure, only some basic tests with local and S3). It might also be a bug in the fsspec implementation, though. Because the fsspec docs indicate that the |
Martin Durant / @martindurant: |
Lance Dacey / @ldacey: |
Martin Durant / @martindurant: |
Lance Dacey / @ldacey:
read_table() on Azure Blob also worked on 0.17.1 pyarrow, so if I check the differences between 0.17.1 and 1.0.0 I can see that this file was changed and has some references to fsspec: https://github.com/apache/arrow/blob/apache-arrow-1.0.0/python/pyarrow/fs.py
I see reference to detail=True within the get_file_info_selector method inside class FSSpecHandler(FileSystemHandler): selected_files = self.fs.find(selector.base_dir, maxdepth=maxdepth, withdirs=True, detail=True)
These classes do not exist within 0.17.1 at all: https://github.com/apache/arrow/blob/apache-arrow-0.17.1/python/pyarrow/fs.py
So it looks like the detail kwarg is popped in the fsspec.find function that pyarrow references, but there is a detail=True specified in the self.walk function. Is this the issue, perhaps? def find(self, path, maxdepth=None, withdirs=False, **kwargs): |
Martin Durant / @martindurant: |
Joris Van den Bossche / @jorisvandenbossche: |
Joris Van den Bossche / @jorisvandenbossche: |
Joris Van den Bossche / @jorisvandenbossche: |
Lance Dacey / @ldacey:
|
Lance Dacey / @ldacey: |
Joris Van den Bossche / @jorisvandenbossche: |
I tried using pyarrow.dataset and pq.ParquetDataset(use_legacy_system=False) and my connection to Azure Blob fails.
I know the documentation says only hdfs and s3 are implemented, but I have been using Azure Blob by using fsspec as the filesystem when reading and writing parquet files/datasets with Pyarrow (with use_legacy_system=True). Also, Dask works with storage_options.
I am hoping that Azure Blob will be supported because I'd really like to try out the new row filtering and non-hive partitioning schemes.
This is what I use for the filesystem when using read_table() or write_to_dataset():
It seems like the class _ParquetDatasetV2 has a section that the original ParquetDataset does not have. Perhaps this is why the fsspec filesystem fails when I turn off the legacy system?
Line 1423 in arrow/python/pyarrow/parquet.py:
if filesystem is not None:
filesystem = pyarrow.fs._ensure_filesystem(filesystem, use_mmap=memory_map)
EDIT -
I got this to work using fsspec on single files on Azure Blob:
When I try to use this on a partitioned file I made using write_to_dataset, I run into an error though. I tried this with the same code as above and also with the partitioning='hive' option.
Environment: Ubuntu 18.04
Reporter: Lance Dacey / @ldacey
Note: This issue was originally created as ARROW-9514. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: