Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors with pyarrow ds.write_dataset() using 0.6.0 #171

Closed
ldacey opened this issue Feb 1, 2021 · 9 comments
Closed

Errors with pyarrow ds.write_dataset() using 0.6.0 #171

ldacey opened this issue Feb 1, 2021 · 9 comments

Comments

@ldacey
Copy link

ldacey commented Feb 1, 2021

What happened:
I run into "The specified blob already exists" errors when trying to save a pyarrow dataset while adlfs 0.6.0 is installed. Reverting to 0.5.9 fixes this issue.

What you expected to happen:
There should be no error that the container exists when I am writing the dataset - it should exist beforehand.

Minimal Complete Verifiable Example:

ds.write_dataset("dev/example", filesystem=fs, partitioning=partitioning)

Pinning adlfs at 0.5.9 works:

[2021-02-01 03:14:27,746] INFO -
adlfs v0.5.9
fsspec 0.8.5
azure.storage.blob 12.7.1
adal 1.2.6
pandas 1.2.1
pyarrow 3.0.0

[2021-02-01 03:14:27,750] {spec.py:696} INFO - Returning a list of containers in the azure blob storage account
[2021-02-01 03:14:27,753]  INFO - Saving 118825 rows and 74 columns

Same code fails when adlfs 0.6.0 is installed:

[2021-02-01 01:06:25,805] INFO -
adlfs v0.6.0
fsspec 0.8.5
azure.storage.blob 12.6.0
adal 1.2.6
pandas 1.2.1
pyarrow 3.0.0

[2021-02-01 01:06:25,809] {spec.py:696} INFO - Returning a list of containers in the azure blob storage account
[2021-02-01 01:06:25,865] ERROR - Failed to process 20210128T020210.json due to: Cannot overwrite existing Azure container -- dev already exists.                         with Azure error The specified blob already exists.
RequestId:94a2ad7a-901e-0082-3b36-f84fc8000000
Time:2021-02-01T01:06:25.8695968Z
ErrorCode:BlobAlreadyExists


  File "pyarrow/_dataset.pyx", line 2343, in pyarrow._dataset._filesystemdataset_write
  File "pyarrow/_fs.pyx", line 1032, in pyarrow._fs._cb_create_dir
  File "/opt/conda/lib/python3.8/site-packages/pyarrow/fs.py", line 259, in create_dir
    self.fs.mkdir(path, create_parents=recursive)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 121, in wrapper
    return maybe_sync(func, self, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 100, in maybe_sync
    return sync(loop, func, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
    raise exc.with_traceback(tb)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f
    result[0] = await future
  File "/opt/conda/lib/python3.8/site-packages/adlfs/spec.py", line 1033, in _mkdir
    raise FileExistsError(
FileExistsError: Cannot overwrite existing Azure container -- dev already exists. 

Anything else we need to know?:
I narrowed this down when I downgraded pyarrow to 2.0 and noticed that 0.5.9 adlfs was also installed and there were no errors. I then installed pyarrow 3.0 and pinned 0.5.9 adlfs and it worked.

Environment:

  • Python version: 3.8
  • Operating System: Ubuntu 18.04
  • Install method (conda, pip, source): conda-forge
@ldacey
Copy link
Author

ldacey commented Feb 1, 2021

@jorisvandenbossche FYI I will post this as a pyarrow issue as well

@jorisvandenbossche
Copy link

@ldacey I suspect this is an adlfs issue. From the traceback, it is doing a fs.mkdir(path, create_parents=True). Could you try to do that manually with an adlfs filesystem object to create a path in the dev container?

@hayesgb
Copy link
Collaborator

hayesgb commented Feb 7, 2021

@ldacey and @jorisvandenbossche -- I think the issue here is related to a change in the API with 0.6.0 in an effort to align the adlfs API to Python os.mkdir(). See referenced here. Can anyone comment on if s3fs works?

@ldacey
Copy link
Author

ldacey commented Feb 14, 2021

Is this a change which will be needed to be done in pyarrow? I have no access to S3.

fs.mkdir("dev/0.6/test", create_parents=True)

This created the 0.6 folder within the "dev" container, but "test" is a 0 bytes file:

image

When I upgrade to 0.6.2, if I run the command a single time it works but it does not create a folder, "test3" is an empty file. If I run the command twice, I get an error:

pyarrow 3.0.0
fsspec 0.8.5
adlfs v0.6.2
logging 0.5.1.2
pandas 1.2.2
numpy 1.20.1
turbodbc 4.1.2
adal 1.2.6

fs.mkdir("dev/test3", create_parents=True)

FileExistsError                           Traceback (most recent call last)
<ipython-input-7-3f4fbc7b3e81> in <module>
----> 1 fs.mkdir("dev/test3", create_parents=True)

~/.local/lib/python3.8/site-packages/fsspec/asyn.py in wrapper(*args, **kwargs)
    119     def wrapper(*args, **kwargs):
    120         self = obj or args[0]
--> 121         return maybe_sync(func, self, *args, **kwargs)
    122 
    123     return wrapper

~/.local/lib/python3.8/site-packages/fsspec/asyn.py in maybe_sync(func, self, *args, **kwargs)
     98         if inspect.iscoroutinefunction(func):
     99             # run the awaitable on the loop
--> 100             return sync(loop, func, *args, **kwargs)
    101         else:
    102             # just call the blocking function

~/.local/lib/python3.8/site-packages/fsspec/asyn.py in sync(loop, func, callback_timeout, *args, **kwargs)
     69     if error[0]:
     70         typ, exc, tb = error[0]
---> 71         raise exc.with_traceback(tb)
     72     else:
     73         return result[0]

~/.local/lib/python3.8/site-packages/fsspec/asyn.py in f()
     53             if callback_timeout is not None:
     54                 future = asyncio.wait_for(future, callback_timeout)
---> 55             result[0] = await future
     56         except Exception:
     57             error[0] = sys.exc_info()

/opt/conda/lib/python3.8/site-packages/adlfs/spec.py in _mkdir(self, path, create_parents, delimiter, **kwargs)
   1040                 pass
   1041             else:
-> 1042                 raise FileExistsError(
   1043                     f"Cannot overwrite existing Azure container -- {container_name} already exists. \
   1044                         with Azure error {e}"

FileExistsError: Cannot overwrite existing Azure container -- dev already exists.                         with Azure error The specified blob already exists.
RequestId:cb0de68e-c01e-0012-5819-03da84000000
Time:2021-02-14T21:38:47.3122004Z
ErrorCode:BlobAlreadyExists
Error:None

@hayesgb
Copy link
Collaborator

hayesgb commented Feb 15, 2021

Can you try this branch. Should fix the issue.

@ldacey
Copy link
Author

ldacey commented Feb 15, 2021

No errors running fs.mkdir("dev/test3", create_parents=True) over and over again using that branch, and I was able to use write_dataset() multiple times to the same location.

@hayesgb
Copy link
Collaborator

hayesgb commented Feb 16, 2021

Implemented in release v0.6.3

@hayesgb hayesgb closed this as completed Feb 16, 2021
@hayesgb
Copy link
Collaborator

hayesgb commented Mar 22, 2021

@ldacey -- Can you take a look at the mkdir_noop branch? It aligns to the fsspec discussion referenced here, but should also address this issue, and eliminate the trailing delimiter for directory identification.

@hayesgb hayesgb reopened this Mar 22, 2021
@ldacey
Copy link
Author

ldacey commented Mar 22, 2021

I was able to write a table with asynchronous=False and then read it as a pyarrow table

fs.info("dev/noop/month_id=202101")
{'name': 'dev/noop/month_id=202101', 'size': None, 'type': 'directory'}


fs.info("dev/noop/month_id=202101/date_id=20210103/partition-1-20210322083136-0.parquet")
{'metadata': {'is_directory': 'false'},
 'creation_time': datetime.datetime(2021, 3, 22, 8, 31, 37, tzinfo=datetime.timezone.utc),
 'deleted': None,
 'deleted_time': None,
 'last_modified': datetime.datetime(2021, 3, 22, 8, 31, 37, tzinfo=datetime.timezone.utc),
 'content_settings': {'content_type': 'application/octet-stream', 'content_encoding': None, 'content_language': None, 'content_md5': None, 'content_disposition': None, 'cache_control': None},
 'remaining_retention_days': None,
 'archive_status': None,
 'last_accessed_on': None,
 'etag': '0x8D8ED0CE8A5F9D4',
 'tags': None,
 'tag_count': None,
 'name': 'dev/noop/month_id=202101/date_id=20210103/partition-1-20210322083136-0.parquet',
 'size': 575,
 'type': 'file'}

fs.ls("dev/noop")
['dev/noop/month_id=202101/', 'dev/noop/month_id=202102/']

Reading a large dataset (23,000 fragments) did not have any issues, and fs.find() did not show any empty blobs.

@hayesgb hayesgb closed this as completed Mar 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants