Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] The new Dataset API will not work with files on Azure Blob #25582

Closed
asfimport opened this issue Jul 17, 2020 · 12 comments
Closed

[Python] The new Dataset API will not work with files on Azure Blob #25582

asfimport opened this issue Jul 17, 2020 · 12 comments

Comments

@asfimport
Copy link

I tried using  pyarrow.dataset and pq.ParquetDataset(use_legacy_system=False) and my connection to Azure Blob fails. 

I know the documentation says only hdfs and s3 are implemented, but I have been using Azure Blob by using fsspec as the filesystem when reading and writing parquet files/datasets with Pyarrow (with use_legacy_system=True). Also, Dask works with storage_options.

I am hoping that Azure Blob will be supported because I'd really like to try out the new row filtering and non-hive partitioning schemes.

This is what I use for the filesystem when using read_table() or write_to_dataset():

fs = fsspec.filesystem(protocol='abfs', 
 account_name=base.login, 
 account_key=base.password)

 
It seems like the class _ParquetDatasetV2 has a section that the original ParquetDataset does not have. Perhaps this is why the fsspec filesystem fails when I turn off the legacy system?

Line 1423 in arrow/python/pyarrow/parquet.py:

if filesystem is not None:
    filesystem = pyarrow.fs._ensure_filesystem(filesystem, use_mmap=memory_map) 

EDIT -

I got this to work using fsspec on single files on Azure Blob:

import pyarrow.dataset as ds
import fsspec

fs = fsspec.filesystem(protocol='abfs', 
                       account_name=login, 
                       account_key=password)

dataset = ds.dataset("abfs://analytics/test/test..parquet", format="parquet", filesystem=fs)
dataset.to_table(columns=['ticket_id', 'event_value'], filter=ds.field('event_value') == 'closed').to_pandas().drop_duplicates('ticket_id')

When I try to use this on a partitioned file I made using write_to_dataset, I run into an error though. I tried this with the same code as above and also with the partitioning='hive' option.

TypeError                                 Traceback (most recent call last)
<ipython-input-174-f44e707aa83e> in <module>
----> 1 dataset = ds.dataset("abfs://analytics/test/tickets-audits/", format="parquet", filesystem=fs, partitioning="hive", )

~/.local/lib/python3.7/site-packages/pyarrow/dataset.py in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
    665     # TODO(kszucs): support InMemoryDataset for a table input
    666     if _is_path_like(source):
--> 667         return _filesystem_dataset(source, **kwargs)
    668     elif isinstance(source, (tuple, list)):
    669         if all(_is_path_like(elem) for elem in source):

~/.local/lib/python3.7/site-packages/pyarrow/dataset.py in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
    430         selector_ignore_prefixes=selector_ignore_prefixes
    431     )
--> 432     factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options)
    433 
    434     return factory.finish(schema)

~/.local/lib/python3.7/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.FileSystemDatasetFactory.__init__()

~/.local/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

~/.local/lib/python3.7/site-packages/pyarrow/_fs.pyx in pyarrow._fs._cb_get_file_info_selector()

~/.local/lib/python3.7/site-packages/pyarrow/fs.py in get_file_info_selector(self, selector)
    159         infos = []
    160         selected_files = self.fs.find(
--> 161             selector.base_dir, maxdepth=maxdepth, withdirs=True, detail=True
    162         )
    163         for path, info in selected_files.items():

/opt/conda/lib/python3.7/site-packages/fsspec/spec.py in find(self, path, maxdepth, withdirs, **kwargs)
    369         # TODO: allow equivalent of -name parameter
    370         out = set()
--> 371         for path, dirs, files in self.walk(path, maxdepth, **kwargs):
    372             if withdirs:
    373                 files += dirs

/opt/conda/lib/python3.7/site-packages/fsspec/spec.py in walk(self, path, maxdepth, **kwargs)
    324 
    325         try:
--> 326             listing = self.ls(path, detail=True, **kwargs)
    327         except (FileNotFoundError, IOError):
    328             return [], [], []

TypeError: ls() got multiple values for keyword argument 'detail'
 

Environment: Ubuntu 18.04
Reporter: Lance Dacey / @ldacey

Note: This issue was originally created as ARROW-9514. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
@ldacey thanks for trying this out and for the issue report!

So when using the Dataset API, Azure is not yet supported natively (see ARROW-2034, ARROW-9611). But in theory it should indeed be supported through the fsspec wrapper. However, it seems you ran into a bug (we don't test the fsspec integration with Azure, only some basic tests with local and S3).

It might also be a bug in the fsspec implementation, though. Because the fsspec docs indicate that the find method supports the kwargs of ls, which has a detail method. But in practice it doesn't seem to be correctly passed through, resulting in a duplicate detail keyword (based on the error traceback). But maybe the docs are wrong instead.

cc @martindurant

@asfimport
Copy link
Author

Martin Durant / @martindurant:
Fix appreciated :)

@asfimport
Copy link
Author

Lance Dacey / @ldacey:
Does this seem like an issue with pyarrow or fsspec at this point?

@asfimport
Copy link
Author

Martin Durant / @martindurant:
Not sure, probably fsspec. The "details" kwarg is explicitly removed here ( https://github.com/intake/filesystem_spec/blob/master/fsspec/spec.py#L401 ), so not sure why it's still in the kwargs.

@asfimport
Copy link
Author

Lance Dacey / @ldacey:
I see that fsspec 0.7.4 (which is the version I use) has the same line of code to remove the 'details' kwarg.

 

read_table() on Azure Blob also worked on 0.17.1 pyarrow, so if I check the differences between 0.17.1 and 1.0.0 I can see that this file was changed and has some references to fsspec:

https://github.com/apache/arrow/blob/apache-arrow-1.0.0/python/pyarrow/fs.py

 

I see reference to detail=True within the get_file_info_selector method inside class FSSpecHandler(FileSystemHandler):

selected_files = self.fs.find(selector.base_dir, maxdepth=maxdepth, withdirs=True, detail=True)

 

These classes do not exist within 0.17.1 at all:

https://github.com/apache/arrow/blob/apache-arrow-0.17.1/python/pyarrow/fs.py

 

So it looks like the detail kwarg is popped in the fsspec.find function that pyarrow references, but there is a detail=True specified in the self.walk function. Is this the issue, perhaps?

def find(self, path, maxdepth=None, withdirs=False, **kwargs):
detail = kwargs.pop("detail", False)
for path, dirs, files in self.walk(path, maxdepth, detail=True, **kwargs):

@asfimport
Copy link
Author

Martin Durant / @martindurant:
both find and walk allow for the detail= kwargs via a pop() (added a number of months ago, before 0.7.4), so I can't see how it can survive to the call to ls. Perhaps a pdb would help. Certainly .find(..., detail=True) does the right thing (I don't have access to azure).

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
@ldacey BTW, if you want the old behaviour in pyarrow 1.0.0, you can specify use_legacy_dataset=True in read_table, which should give the same as in 0.17 normally (instead of reverting to 0.17).
The FSSpecHandler is indeed new in 1.0.0, which allows to use the newer pyarrow.dataset functionality in read_table (which eg gives the row group filtering); in 0.17.0 the fsspec filesystem is used directly in a different way.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
The error is indeed bizarre from looking at the traceback: there are clearly detail = kwargs.pop("detail", False) calls, but apparently the kwargs afterwards still include a "detail" key ...

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
Actually, the traceback doesn't show this detail = kwargs.pop("detail", False) line where it should be present in the surrounding code. So @ldacey it might be you are using an older version of fsspec that doesn't yet handle this correctly. What is the fsspec version you have installed? And does it work with the latest version?

@asfimport
Copy link
Author

Lance Dacey / @ldacey:
I am closing this issue. I assume that there was a conflicting package so I basically tore everything down and set things up again.

 

  • I downgraded pyarrow to 0.17.1 and re-upgraded to 1.0.0 (this should have zero impact but who knows?)

  • I uninstalled adlfs completely and reinstalled fsspec 0.7.4 which downloaded adlfs 0.2.5

  • I uninstalled a package called dask-azureblobfs

  • I uninstalled a package called dask-adlfs

     

    Here is a pip freeze with some of the packages which might impact the success of read_table and dataset.

    adlfs==0.2.5
    azure-common==1.1.23
    azure-core==1.7.0
    azure-storage-blob==2.1.0
    azure-storage-common==2.1.0
    dask==2.22.0
    fsspec==0.7.4
    pandas==1.1.0
    pyarrow==1.0.0

     

     

    I tested the following methods and they all work - I am quite pleased!

    fs = fsspec.filesystem(protocol='abfs',  account_name=login,  account_key=password)

    ds.dataset("abfs://analytics/test", format="parquet", filesystem=fs)

    pq.ParquetDataset(path_or_paths='analytics/test', filesystem=fs, use_legacy_dataset=False)

    pq.read_table(source='analytics/test/zendesk/tickets-audits/', filesystem=fs)

     

@asfimport
Copy link
Author

Lance Dacey / @ldacey:
This issue was most likely caused by additional unnecessary packages which might have downgraded fsspec potentially. Version 0.7.4 fsspec and 1.0.0 pyarrow are working as intended though now.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
@ldacey thanks for checking, and happy it works now!

@asfimport asfimport added this to the 1.0.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant