[Python] The new Dataset API will not work with files on Azure Blob #25582

asfimport · 2020-07-17T15:49:11Z

I tried using pyarrow.dataset and pq.ParquetDataset(use_legacy_system=False) and my connection to Azure Blob fails.

I know the documentation says only hdfs and s3 are implemented, but I have been using Azure Blob by using fsspec as the filesystem when reading and writing parquet files/datasets with Pyarrow (with use_legacy_system=True). Also, Dask works with storage_options.

I am hoping that Azure Blob will be supported because I'd really like to try out the new row filtering and non-hive partitioning schemes.

This is what I use for the filesystem when using read_table() or write_to_dataset():

fs = fsspec.filesystem(protocol='abfs', 
 account_name=base.login, 
 account_key=base.password)

It seems like the class _ParquetDatasetV2 has a section that the original ParquetDataset does not have. Perhaps this is why the fsspec filesystem fails when I turn off the legacy system?

Line 1423 in arrow/python/pyarrow/parquet.py:

if filesystem is not None:
filesystem = pyarrow.fs._ensure_filesystem(filesystem, use_mmap=memory_map)

EDIT -

I got this to work using fsspec on single files on Azure Blob:

import pyarrow.dataset as ds
import fsspec

fs = fsspec.filesystem(protocol='abfs', 
                       account_name=login, 
                       account_key=password)

dataset = ds.dataset("abfs://analytics/test/test..parquet", format="parquet", filesystem=fs)
dataset.to_table(columns=['ticket_id', 'event_value'], filter=ds.field('event_value') == 'closed').to_pandas().drop_duplicates('ticket_id')

When I try to use this on a partitioned file I made using write_to_dataset, I run into an error though. I tried this with the same code as above and also with the partitioning='hive' option.

TypeError                                 Traceback (most recent call last)
<ipython-input-174-f44e707aa83e> in <module>
----> 1 dataset = ds.dataset("abfs://analytics/test/tickets-audits/", format="parquet", filesystem=fs, partitioning="hive", )

~/.local/lib/python3.7/site-packages/pyarrow/dataset.py in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
    665     # TODO(kszucs): support InMemoryDataset for a table input
    666     if _is_path_like(source):
--> 667         return _filesystem_dataset(source, **kwargs)
    668     elif isinstance(source, (tuple, list)):
    669         if all(_is_path_like(elem) for elem in source):

~/.local/lib/python3.7/site-packages/pyarrow/dataset.py in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
    430         selector_ignore_prefixes=selector_ignore_prefixes
    431     )
--> 432     factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options)
    433 
    434     return factory.finish(schema)

~/.local/lib/python3.7/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.FileSystemDatasetFactory.__init__()

~/.local/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

~/.local/lib/python3.7/site-packages/pyarrow/_fs.pyx in pyarrow._fs._cb_get_file_info_selector()

~/.local/lib/python3.7/site-packages/pyarrow/fs.py in get_file_info_selector(self, selector)
    159         infos = []
    160         selected_files = self.fs.find(
--> 161             selector.base_dir, maxdepth=maxdepth, withdirs=True, detail=True
    162         )
    163         for path, info in selected_files.items():

/opt/conda/lib/python3.7/site-packages/fsspec/spec.py in find(self, path, maxdepth, withdirs, **kwargs)
    369         # TODO: allow equivalent of -name parameter
    370         out = set()
--> 371         for path, dirs, files in self.walk(path, maxdepth, **kwargs):
    372             if withdirs:
    373                 files += dirs

/opt/conda/lib/python3.7/site-packages/fsspec/spec.py in walk(self, path, maxdepth, **kwargs)
    324 
    325         try:
--> 326             listing = self.ls(path, detail=True, **kwargs)
    327         except (FileNotFoundError, IOError):
    328             return [], [], []

TypeError: ls() got multiple values for keyword argument 'detail'

Environment: Ubuntu 18.04
Reporter: Lance Dacey / @ldacey

_{Note: This issue was originally created as ARROW-9514. Please see the migration documentation for further details.}

asfimport · 2020-08-04T12:38:07Z

Joris Van den Bossche / @jorisvandenbossche:
@ldacey thanks for trying this out and for the issue report!

So when using the Dataset API, Azure is not yet supported natively (see ARROW-2034, ARROW-9611). But in theory it should indeed be supported through the fsspec wrapper. However, it seems you ran into a bug (we don't test the fsspec integration with Azure, only some basic tests with local and S3).

It might also be a bug in the fsspec implementation, though. Because the fsspec docs indicate that the find method supports the kwargs of ls, which has a detail method. But in practice it doesn't seem to be correctly passed through, resulting in a duplicate detail keyword (based on the error traceback). But maybe the docs are wrong instead.

cc @martindurant

asfimport · 2020-08-04T12:52:26Z

Martin Durant / @martindurant:
Fix appreciated :)

asfimport · 2020-08-05T15:16:07Z

Lance Dacey / @ldacey:
Does this seem like an issue with pyarrow or fsspec at this point?

asfimport · 2020-08-05T15:30:36Z

Martin Durant / @martindurant:
Not sure, probably fsspec. The "details" kwarg is explicitly removed here ( https://github.com/intake/filesystem_spec/blob/master/fsspec/spec.py#L401 ), so not sure why it's still in the kwargs.

asfimport · 2020-08-05T17:59:38Z

Lance Dacey / @ldacey:
I see that fsspec 0.7.4 (which is the version I use) has the same line of code to remove the 'details' kwarg.

read_table() on Azure Blob also worked on 0.17.1 pyarrow, so if I check the differences between 0.17.1 and 1.0.0 I can see that this file was changed and has some references to fsspec:

https://github.com/apache/arrow/blob/apache-arrow-1.0.0/python/pyarrow/fs.py

I see reference to detail=True within the get_file_info_selector method inside class FSSpecHandler(FileSystemHandler):

selected_files = self.fs.find(selector.base_dir, maxdepth=maxdepth, withdirs=True, detail=True)

These classes do not exist within 0.17.1 at all:

https://github.com/apache/arrow/blob/apache-arrow-0.17.1/python/pyarrow/fs.py

So it looks like the detail kwarg is popped in the fsspec.find function that pyarrow references, but there is a detail=True specified in the self.walk function. Is this the issue, perhaps?

def find(self, path, maxdepth=None, withdirs=False, **kwargs):
detail = kwargs.pop("detail", False)
for path, dirs, files in self.walk(path, maxdepth, detail=True, **kwargs):

asfimport · 2020-08-05T19:21:34Z

Martin Durant / @martindurant:
both find and walk allow for the detail= kwargs via a pop() (added a number of months ago, before 0.7.4), so I can't see how it can survive to the call to ls. Perhaps a pdb would help. Certainly .find(..., detail=True) does the right thing (I don't have access to azure).

asfimport · 2020-08-05T20:23:43Z

Joris Van den Bossche / @jorisvandenbossche:
@ldacey BTW, if you want the old behaviour in pyarrow 1.0.0, you can specify use_legacy_dataset=True in read_table, which should give the same as in 0.17 normally (instead of reverting to 0.17).
The FSSpecHandler is indeed new in 1.0.0, which allows to use the newer pyarrow.dataset functionality in read_table (which eg gives the row group filtering); in 0.17.0 the fsspec filesystem is used directly in a different way.

asfimport · 2020-08-05T20:29:03Z

Joris Van den Bossche / @jorisvandenbossche:
The error is indeed bizarre from looking at the traceback: there are clearly detail = kwargs.pop("detail", False) calls, but apparently the kwargs afterwards still include a "detail" key ...

asfimport · 2020-08-05T20:30:45Z

Joris Van den Bossche / @jorisvandenbossche:
Actually, the traceback doesn't show this detail = kwargs.pop("detail", False) line where it should be present in the surrounding code. So @ldacey it might be you are using an older version of fsspec that doesn't yet handle this correctly. What is the fsspec version you have installed? And does it work with the latest version?

asfimport · 2020-08-05T20:36:47Z

Lance Dacey / @ldacey:
I am closing this issue. I assume that there was a conflicting package so I basically tore everything down and set things up again.

I downgraded pyarrow to 0.17.1 and re-upgraded to 1.0.0 (this should have zero impact but who knows?)
I uninstalled adlfs completely and reinstalled fsspec 0.7.4 which downloaded adlfs 0.2.5
I uninstalled a package called dask-azureblobfs
I uninstalled a package called dask-adlfs

Here is a pip freeze with some of the packages which might impact the success of read_table and dataset.

adlfs==0.2.5
azure-common==1.1.23
azure-core==1.7.0
azure-storage-blob==2.1.0
azure-storage-common==2.1.0
dask==2.22.0
fsspec==0.7.4
pandas==1.1.0
pyarrow==1.0.0

I tested the following methods and they all work - I am quite pleased!

fs = fsspec.filesystem(protocol='abfs', account_name=login, account_key=password)

ds.dataset("abfs://analytics/test", format="parquet", filesystem=fs)

pq.ParquetDataset(path_or_paths='analytics/test', filesystem=fs, use_legacy_dataset=False)

pq.read_table(source='analytics/test/zendesk/tickets-audits/', filesystem=fs)

asfimport · 2020-08-05T20:41:55Z

Lance Dacey / @ldacey:
This issue was most likely caused by additional unnecessary packages which might have downgraded fsspec potentially. Version 0.7.4 fsspec and 1.0.0 pyarrow are working as intended though now.

asfimport · 2020-08-06T07:20:32Z

Joris Van den Bossche / @jorisvandenbossche:
@ldacey thanks for checking, and happy it works now!

asfimport closed this as completed Aug 5, 2020

asfimport added this to the 1.0.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] The new Dataset API will not work with files on Azure Blob #25582

[Python] The new Dataset API will not work with files on Azure Blob #25582

asfimport commented Jul 17, 2020

asfimport commented Aug 4, 2020

asfimport commented Aug 4, 2020

asfimport commented Aug 5, 2020

asfimport commented Aug 5, 2020

asfimport commented Aug 5, 2020

asfimport commented Aug 5, 2020

asfimport commented Aug 5, 2020

asfimport commented Aug 5, 2020

asfimport commented Aug 5, 2020

asfimport commented Aug 5, 2020

asfimport commented Aug 5, 2020

asfimport commented Aug 6, 2020

[Python] The new Dataset API will not work with files on Azure Blob #25582

[Python] The new Dataset API will not work with files on Azure Blob #25582

Comments

asfimport commented Jul 17, 2020

asfimport commented Aug 4, 2020

asfimport commented Aug 4, 2020

asfimport commented Aug 5, 2020

asfimport commented Aug 5, 2020

asfimport commented Aug 5, 2020

asfimport commented Aug 5, 2020

asfimport commented Aug 5, 2020

asfimport commented Aug 5, 2020

asfimport commented Aug 5, 2020

asfimport commented Aug 5, 2020

asfimport commented Aug 5, 2020

asfimport commented Aug 6, 2020