Skip to content

Issue with parallel uploads to the same blob #462

@M0dEx

Description

@M0dEx

There seems to be an issue when 2 instances of this file system write to the same blob from 2 different processes in parallel, where one of the uploads fails with:

Azure error
    File "/code/.venv/lib/python3.10/site-packages/our_package/connector/storage/blob.py", line 117, in _save
        with self._fs.open(
    File "/code/.venv/lib/python3.10/site-packages/fsspec/spec.py", line 1963, in __exit__
        self.close()
    File "/code/.venv/lib/python3.10/site-packages/adlfs/spec.py", line 1908, in close
        super().close()
    File "/code/.venv/lib/python3.10/site-packages/fsspec/spec.py", line 1930, in close
        self.flush(force=True)
    File "/code/.venv/lib/python3.10/site-packages/fsspec/spec.py", line 1801, in flush
        if self._upload_chunk(final=force) is not False:
    File "/code/.venv/lib/python3.10/site-packages/fsspec/asyn.py", line 118, in wrapper
        return sync(self.loop, func, *args, **kwargs)
    File "/code/.venv/lib/python3.10/site-packages/fsspec/asyn.py", line 103, in sync
        raise return_result
    File "/code/.venv/lib/python3.10/site-packages/fsspec/asyn.py", line 56, in _runner
        result[0] = await coro
    File "/code/.venv/lib/python3.10/site-packages/adlfs/spec.py", line 2068, in _async_upload_chunk
        await bc.commit_block_list(
    File "/code/.venv/lib/python3.10/site-packages/azure/core/tracing/decorator_async.py", line 77, in wrapper_use_tracer
        return await func(*args, **kwargs)
    File "/code/.venv/lib/python3.10/site-packages/azure/storage/blob/aio/_blob_client_async.py", line 1861, in commit_block_list
        process_storage_error(error)
    File "/code/.venv/lib/python3.10/site-packages/azure/storage/blob/_shared/response_handlers.py", line 184, in process_storage_error
        exec("raise error from None")   # pylint: disable=exec-used # nosec
    File "<string>", line 1, in <module>
    
azure.core.exceptions.HttpResponseError: The specified block list is invalid.
RequestId:<request_id>
Time:2024-02-13T12:15:05.1957595Z
ErrorCode:InvalidBlockList
Content: <?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidBlockList</Code><Message>The specified block list is invalid.

From our limited investigation, this seems to likely be caused by the way AzureBlobFile calculates the IDs of the uploaded blocks:

adlfs/adlfs/spec.py

Lines 2102 to 2103 in 576fb7a

block_id = len(self._block_list)
block_id = f"{block_id:07d}"

Could this be changed to a hash of the content or something similar, which would correspond to the actual contents of the uploaded block?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions