Errors with pyarrow ds.write_dataset() using 0.6.0 #171

ldacey · 2021-02-01T14:01:07Z

What happened:
I run into "The specified blob already exists" errors when trying to save a pyarrow dataset while adlfs 0.6.0 is installed. Reverting to 0.5.9 fixes this issue.

What you expected to happen:
There should be no error that the container exists when I am writing the dataset - it should exist beforehand.

Minimal Complete Verifiable Example:

ds.write_dataset("dev/example", filesystem=fs, partitioning=partitioning)

Pinning adlfs at 0.5.9 works:

[2021-02-01 03:14:27,746] INFO -
adlfs v0.5.9
fsspec 0.8.5
azure.storage.blob 12.7.1
adal 1.2.6
pandas 1.2.1
pyarrow 3.0.0

[2021-02-01 03:14:27,750] {spec.py:696} INFO - Returning a list of containers in the azure blob storage account
[2021-02-01 03:14:27,753]  INFO - Saving 118825 rows and 74 columns

Same code fails when adlfs 0.6.0 is installed:

[2021-02-01 01:06:25,805] INFO -
adlfs v0.6.0
fsspec 0.8.5
azure.storage.blob 12.6.0
adal 1.2.6
pandas 1.2.1
pyarrow 3.0.0

[2021-02-01 01:06:25,809] {spec.py:696} INFO - Returning a list of containers in the azure blob storage account
[2021-02-01 01:06:25,865] ERROR - Failed to process 20210128T020210.json due to: Cannot overwrite existing Azure container -- dev already exists.                         with Azure error The specified blob already exists.
RequestId:94a2ad7a-901e-0082-3b36-f84fc8000000
Time:2021-02-01T01:06:25.8695968Z
ErrorCode:BlobAlreadyExists


  File "pyarrow/_dataset.pyx", line 2343, in pyarrow._dataset._filesystemdataset_write
  File "pyarrow/_fs.pyx", line 1032, in pyarrow._fs._cb_create_dir
  File "/opt/conda/lib/python3.8/site-packages/pyarrow/fs.py", line 259, in create_dir
    self.fs.mkdir(path, create_parents=recursive)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 121, in wrapper
    return maybe_sync(func, self, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 100, in maybe_sync
    return sync(loop, func, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
    raise exc.with_traceback(tb)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f
    result[0] = await future
  File "/opt/conda/lib/python3.8/site-packages/adlfs/spec.py", line 1033, in _mkdir
    raise FileExistsError(
FileExistsError: Cannot overwrite existing Azure container -- dev already exists.

Anything else we need to know?:
I narrowed this down when I downgraded pyarrow to 2.0 and noticed that 0.5.9 adlfs was also installed and there were no errors. I then installed pyarrow 3.0 and pinned 0.5.9 adlfs and it worked.

Environment:

Python version: 3.8
Operating System: Ubuntu 18.04
Install method (conda, pip, source): conda-forge

The text was updated successfully, but these errors were encountered:

ldacey · 2021-02-01T14:02:07Z

@jorisvandenbossche FYI I will post this as a pyarrow issue as well

jorisvandenbossche · 2021-02-01T18:53:31Z

@ldacey I suspect this is an adlfs issue. From the traceback, it is doing a fs.mkdir(path, create_parents=True). Could you try to do that manually with an adlfs filesystem object to create a path in the dev container?

hayesgb · 2021-02-07T14:59:09Z

@ldacey and @jorisvandenbossche -- I think the issue here is related to a change in the API with 0.6.0 in an effort to align the adlfs API to Python os.mkdir(). See referenced here. Can anyone comment on if s3fs works?

ldacey · 2021-02-14T21:42:24Z

Is this a change which will be needed to be done in pyarrow? I have no access to S3.

fs.mkdir("dev/0.6/test", create_parents=True)

This created the 0.6 folder within the "dev" container, but "test" is a 0 bytes file:

When I upgrade to 0.6.2, if I run the command a single time it works but it does not create a folder, "test3" is an empty file. If I run the command twice, I get an error:

pyarrow 3.0.0
fsspec 0.8.5
adlfs v0.6.2
logging 0.5.1.2
pandas 1.2.2
numpy 1.20.1
turbodbc 4.1.2
adal 1.2.6

fs.mkdir("dev/test3", create_parents=True)

FileExistsError                           Traceback (most recent call last)
<ipython-input-7-3f4fbc7b3e81> in <module>
----> 1 fs.mkdir("dev/test3", create_parents=True)

~/.local/lib/python3.8/site-packages/fsspec/asyn.py in wrapper(*args, **kwargs)
    119     def wrapper(*args, **kwargs):
    120         self = obj or args[0]
--> 121         return maybe_sync(func, self, *args, **kwargs)
    122 
    123     return wrapper

~/.local/lib/python3.8/site-packages/fsspec/asyn.py in maybe_sync(func, self, *args, **kwargs)
     98         if inspect.iscoroutinefunction(func):
     99             # run the awaitable on the loop
--> 100             return sync(loop, func, *args, **kwargs)
    101         else:
    102             # just call the blocking function

~/.local/lib/python3.8/site-packages/fsspec/asyn.py in sync(loop, func, callback_timeout, *args, **kwargs)
     69     if error[0]:
     70         typ, exc, tb = error[0]
---> 71         raise exc.with_traceback(tb)
     72     else:
     73         return result[0]

~/.local/lib/python3.8/site-packages/fsspec/asyn.py in f()
     53             if callback_timeout is not None:
     54                 future = asyncio.wait_for(future, callback_timeout)
---> 55             result[0] = await future
     56         except Exception:
     57             error[0] = sys.exc_info()

/opt/conda/lib/python3.8/site-packages/adlfs/spec.py in _mkdir(self, path, create_parents, delimiter, **kwargs)
   1040                 pass
   1041             else:
-> 1042                 raise FileExistsError(
   1043                     f"Cannot overwrite existing Azure container -- {container_name} already exists. \
   1044                         with Azure error {e}"

FileExistsError: Cannot overwrite existing Azure container -- dev already exists.                         with Azure error The specified blob already exists.
RequestId:cb0de68e-c01e-0012-5819-03da84000000
Time:2021-02-14T21:38:47.3122004Z
ErrorCode:BlobAlreadyExists
Error:None

hayesgb · 2021-02-15T01:57:34Z

Can you try this branch. Should fix the issue.

ldacey · 2021-02-15T10:26:59Z

No errors running fs.mkdir("dev/test3", create_parents=True) over and over again using that branch, and I was able to use write_dataset() multiple times to the same location.

hayesgb · 2021-02-16T14:43:36Z

Implemented in release v0.6.3

hayesgb · 2021-03-22T02:42:15Z

@ldacey -- Can you take a look at the mkdir_noop branch? It aligns to the fsspec discussion referenced here, but should also address this issue, and eliminate the trailing delimiter for directory identification.

ldacey · 2021-03-22T09:12:03Z

I was able to write a table with asynchronous=False and then read it as a pyarrow table

fs.info("dev/noop/month_id=202101")
{'name': 'dev/noop/month_id=202101', 'size': None, 'type': 'directory'}


fs.info("dev/noop/month_id=202101/date_id=20210103/partition-1-20210322083136-0.parquet")
{'metadata': {'is_directory': 'false'},
 'creation_time': datetime.datetime(2021, 3, 22, 8, 31, 37, tzinfo=datetime.timezone.utc),
 'deleted': None,
 'deleted_time': None,
 'last_modified': datetime.datetime(2021, 3, 22, 8, 31, 37, tzinfo=datetime.timezone.utc),
 'content_settings': {'content_type': 'application/octet-stream', 'content_encoding': None, 'content_language': None, 'content_md5': None, 'content_disposition': None, 'cache_control': None},
 'remaining_retention_days': None,
 'archive_status': None,
 'last_accessed_on': None,
 'etag': '0x8D8ED0CE8A5F9D4',
 'tags': None,
 'tag_count': None,
 'name': 'dev/noop/month_id=202101/date_id=20210103/partition-1-20210322083136-0.parquet',
 'size': 575,
 'type': 'file'}

fs.ls("dev/noop")
['dev/noop/month_id=202101/', 'dev/noop/month_id=202102/']

Reading a large dataset (23,000 fragments) did not have any issues, and fs.find() did not show any empty blobs.

hayesgb closed this as completed Feb 16, 2021

hayesgb reopened this Mar 22, 2021

hayesgb closed this as completed Mar 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors with pyarrow ds.write_dataset() using 0.6.0 #171

Errors with pyarrow ds.write_dataset() using 0.6.0 #171

ldacey commented Feb 1, 2021 •

edited

Loading

ldacey commented Feb 1, 2021

jorisvandenbossche commented Feb 1, 2021

hayesgb commented Feb 7, 2021

ldacey commented Feb 14, 2021

hayesgb commented Feb 15, 2021

ldacey commented Feb 15, 2021

hayesgb commented Feb 16, 2021

hayesgb commented Mar 22, 2021

ldacey commented Mar 22, 2021

Errors with pyarrow ds.write_dataset() using 0.6.0 #171

Errors with pyarrow ds.write_dataset() using 0.6.0 #171

Comments

ldacey commented Feb 1, 2021 • edited Loading

ldacey commented Feb 1, 2021

jorisvandenbossche commented Feb 1, 2021

hayesgb commented Feb 7, 2021

ldacey commented Feb 14, 2021

hayesgb commented Feb 15, 2021

ldacey commented Feb 15, 2021

hayesgb commented Feb 16, 2021

hayesgb commented Mar 22, 2021

ldacey commented Mar 22, 2021

ldacey commented Feb 1, 2021 •

edited

Loading