support `max_concurrency` in `upload_blob` and `download_blob` operations #420

pmrowla · 2023-08-03T06:42:28Z

Exposes max_concurrency=None kwarg in async fs methods that use the azure SDK upload_blob and download_blob (which azure then uses to parallelize async chunk uploads/downloads)
Adds fs level AzureBlobFileSystem.max_concurrency attribute which is used whenever method-level max_concurrency is not set
- Defaults to fsspec.asyn._get_batch_size()
For fs methods that support multiple file batching, max_concurrency defaults to 1 for the individual file upload/download.
- So by default batch_size=... is used to parallelize uploads/downloads at the file level (in something like fs._get() or fs._put()), and no additional parallelization is done for chunks within each file.
- To get both file level and chunk level parallelization the user can set both batch_size and max_concurrency. i.e. fs.get(path, batch_size=4, max_concurrency=2) would download up to 4 files at a time, and up to 2 chunks at a time within each file, giving an overall concurrency of up to 8 async download coroutines being run in the loop at a time.

Closes #268 (and supercedes the changes in the concurrent_io branch)

This PR is incompatible with #329 (but there was discussion in that PR regarding changing the name of the parameter used to something other than max_concurrency since it conflicts with the the azure SDK parameter)

pmrowla · 2023-08-03T07:03:21Z

adlfs/tests/test_spec.py

-        account_name=storage.account_name, connection_string=CONN_STR
+        account_name=storage.account_name,
+        connection_string=CONN_STR,
+        max_concurrency=1,


Concurrency cannot be used for this test, otherwise storage.insert_time will be the timestamp of the first completed chunk and creation_time will be the timestamp of the finished operation (after all chunks are uploaded).

efiop · 2023-08-04T01:12:08Z

@hayesgb @TomAugspurger Hey folks. Could you take a look, please? Just want to make sure you are fine with this.

TomAugspurger

Looks good. Just one question but feel free to merge.

adlfs/spec.py

efiop · 2023-08-04T13:02:56Z

Btw, for the record, some comparisons to show how much this patch speeds things up for us iterative/dvc-azure#54 (comment)

pmrowla added 2 commits August 3, 2023 16:01

spec: expose max_concurrency param in blob_upload/blob_download

22aafed

prefer batch_size in batched async methods

a7eeb89

pmrowla force-pushed the max-concurrency branch from b329028 to a7eeb89 Compare August 3, 2023 07:01

pmrowla commented Aug 3, 2023

View reviewed changes

pmrowla mentioned this pull request Aug 3, 2023

dvc push to Azure Blob storage is very slow iterative/dvc-azure#54

Closed

efiop approved these changes Aug 4, 2023

View reviewed changes

TomAugspurger approved these changes Aug 4, 2023

View reviewed changes

adlfs/spec.py Show resolved Hide resolved

pmrowla mentioned this pull request Aug 4, 2023

make fsspec.asyn._get_batch_size() public fsspec/filesystem_spec#1327

Open

TomAugspurger merged commit 55ba981 into fsspec:main Aug 5, 2023
4 checks passed

pmrowla deleted the max-concurrency branch August 5, 2023 01:10

pmrowla mentioned this pull request Feb 12, 2024

put_file: support concurrent multipart uploads with max_concurrency fsspec/s3fs#848

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support `max_concurrency` in `upload_blob` and `download_blob` operations #420

support `max_concurrency` in `upload_blob` and `download_blob` operations #420

pmrowla commented Aug 3, 2023 •

edited

Loading

pmrowla Aug 3, 2023 •

edited

Loading

efiop commented Aug 4, 2023

TomAugspurger left a comment

efiop commented Aug 4, 2023

support max_concurrency in upload_blob and download_blob operations #420

support max_concurrency in upload_blob and download_blob operations #420

Conversation

pmrowla commented Aug 3, 2023 • edited Loading

pmrowla Aug 3, 2023 • edited Loading

Choose a reason for hiding this comment

efiop commented Aug 4, 2023

TomAugspurger left a comment

Choose a reason for hiding this comment

efiop commented Aug 4, 2023

support `max_concurrency` in `upload_blob` and `download_blob` operations #420

support `max_concurrency` in `upload_blob` and `download_blob` operations #420

pmrowla commented Aug 3, 2023 •

edited

Loading

pmrowla Aug 3, 2023 •

edited

Loading