put_file: support concurrent multipart uploads with max_concurrency #848

pmrowla · 2024-02-12T07:42:33Z

Adds max_concurrency parameter which can be used to increase the concurrency for multipart uploads during S3FileSystem._put_file() (behaves the same as max_concurrency for uploads in adlfs)
- max_concurrency can be set at the fs instance level or at as a parameter passed to _put_file()
- When set, this will stack multiplicatively with fs.put(batch_size=...), so you will end up with a maximum of max_concurrency * batch_size parts being transferred at once
This does not affect downloads/_get_file().

pmrowla · 2024-02-12T07:45:00Z

For reference, I live in Seoul and am physically close to the ap-northeast-2 datacenter. When testing against a bucket located in ap-northeast-2, a single put_file() maxes out my 1Gbps connection regardless of concurrency (i.e. both s3fs main and this PR with any concurrency value hit something like 900+ Mbps upload).

However, for any other region, increasing the concurrency makes a noticeable difference for me. Using us-west-2 as an example, I get a maximum of ~160Mbps per connection, but increasing the concurrency will allow me to get closer to maxing out my available bandwidth.

Given this script and a bucket in us-west-2:

s3 = s3fs.S3FileSystem(cache_regions=True)
for i in (None, 32):
    s3.put_file(
        "ubuntu-22.04.2-live-server-arm64.iso",
        f"s3://{BUCKET}/s3fs-test/ubuntu-22.04.2-live-server-arm64.iso",
        callback=TqdmCallback(tqdm_kwargs={"desc": f"put file max_concurrency={i}", "unit": "B", "unit_scale": True}),
        max_concurrency=i,
    )

$ python test.py
put file max_concurrency=None: 100%|███████████████████████████| 1.94G/1.94G [01:53<00:00, 17.1MB/s]
put file max_concurrency=32: 100%|█████████████████████████████| 1.94G/1.94G [00:31<00:00, 62.1MB/s]

For reference, aws s3 cp takes ~40s to transfer the same file to the same bucket:

$ time aws s3 cp ubuntu-22.04.2-live-server-arm64.iso s3://{BUCKET}/s3fs-test/ubuntu-22.04.2-live-server-arm64.iso
upload: ./ubuntu-22.04.2-live-server-arm64.iso to s3://{BUCKET}/s3fs-test/ubuntu-22.04.2-live-server-arm64.iso
aws s3 cp ubuntu-22.04.2-live-server-arm64.iso   7.53s user 4.03s system 26% cpu 43.334 total

s3fs/core.py

martindurant · 2024-02-12T14:24:42Z

However, for any other region, increasing the concurrency makes a noticeable difference for me. Using us-west-2 as an example, I get a maximum of ~160Mbps per connection, but increasing the concurrency will allow me to max out the full 1Gbps bandwidth.

Any idea why this happens? The upload should in theory saturate the bandwidth whether on a single call for the whole massive file (after one count of latency) or on many concurrent calls (that all wait ~1 count of latency at the same time).

pmrowla · 2024-02-12T15:05:07Z

Any idea why this happens? The upload should in theory saturate the bandwidth whether on a single call for the whole massive file (after one count of latency) or on many concurrent calls (that all wait ~1 count of latency at the same time).

I'm not sure, and as I mentioned I do get the max theoretical bandwidth when I'm guaranteed to have good routing to the datacenter, whether or not it's a concurrent upload.

But we have seen users with the same issues with other storage providers as well, so this isn't limited to AWS, see adlfs/azure issue:
fsspec/adlfs#420
iterative/dvc-azure#54 (comment)

It's probably worth noting that boto3/s3transfer also do multipart uploads in concurrent threads rather than sequentially

martindurant · 2024-02-12T16:12:43Z

I do get the max theoretical bandwidth when I'm guaranteed to have good routing to the datacenter,

OK, let's assume there is some component of per-packet latency too, then. Perhaps it's an SSL thing. Thanks for digging.

martindurant

I have a couple of thoughts on naming, and when/if we can apply the same strategy to the other high-bandwidth operations.

The only substantial comment is about the batching strategy. It may not matter, and we can keep this approach for now, since it already produces an improvement.

martindurant · 2024-02-14T15:53:01Z

s3fs/core.py

@@ -1201,6 +1206,54 @@ async def _put_file(
            self.invalidate_cache(rpath)
            rpath = self._parent(rpath)

+    async def _upload_part_concurrent(


Can we please rename this to indicate it is uploading from a file, as opposed to bytes. Or can it be generalised to support pipe() too?

s3fs/core.py

martindurant · 2024-02-14T15:58:31Z

s3fs/core.py

+        while True:
+            chunks = []
+            for i in range(max_concurrency):
+                chunk = f0.read(chunksize)


Somewhere we need a caveat, that increasing concurrency will lead to high memory use.

s3fs/core.py

martindurant · 2024-02-16T14:25:04Z

s3fs/core.py

+        If given, the maximum number of concurrent transfers to use for a
+        multipart upload. Defaults to 1 (multipart uploads will be done sequentially).
+        Note that when used in conjunction with ``S3FileSystem.put(batch_size=...)``
+        the result will be a maximum of ``max_concurrency * batch_size`` concurrent
+        transfers.


Suggested change

If given, the maximum number of concurrent transfers to use for a

multipart upload. Defaults to 1 (multipart uploads will be done sequentially).

Note that when used in conjunction with ``S3FileSystem.put(batch_size=...)``

the result will be a maximum of ``max_concurrency * batch_size`` concurrent

transfers.

The maximum number of concurrent transfers to use per file for a

multipart upload (``put()``) operations. Defaults to 1 (sequential).

When used in conjunction with ``S3FileSystem.put(batch_size=...)``

the maximum number of simultaneous connections is ``max_concurrency * batch_size``.

We may extend this parameter to affect ``pipe()``, ``cat()`` and ``get()``.

martindurant · 2024-02-16T14:26:37Z

s3fs/core.py

+    ):
+        max_concurrency = max_concurrency or self.max_concurrency
+        if max_concurrency is None or max_concurrency < 1:
+            max_concurrency = 1


Why not max_concurrency=1 as the default in __init__?

fsspec/s3fs#848 added a `max_concurrency` kwarg, released in s3fs 2024.3, which seems to break something in `dvc pull`: https://github.com/hudcostreets/nj-crashes/actions/runs/8316430240/job/22755877957#step:11:31

put_file: support concurrent multipart uploads with max_concurrency

8f0f85f

pmrowla force-pushed the put-file-concurrent branch from b2a8eb4 to 8f0f85f Compare February 12, 2024 07:45

This was referenced Feb 12, 2024

s3 performance is slow iterative/dvc-s3#23

Closed

cloud/fs concurrency for large files iterative/dvc#9893

Open

skshetry reviewed Feb 12, 2024

View reviewed changes

s3fs/core.py Show resolved Hide resolved

martindurant reviewed Feb 14, 2024

View reviewed changes

martindurant reviewed Feb 16, 2024

View reviewed changes

review updates

f8ed021

pmrowla requested a review from martindurant February 26, 2024 06:18

martindurant merged commit caf15c8 into fsspec:main Feb 27, 2024
21 checks passed

pmrowla deleted the put-file-concurrent branch February 28, 2024 01:34

ryan-williams mentioned this pull request Mar 17, 2024

Possible issue with latest s3fs (2024.3.0): max_concurrency kwarg iterative/dvc-s3#80

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

put_file: support concurrent multipart uploads with max_concurrency #848

put_file: support concurrent multipart uploads with max_concurrency #848

pmrowla commented Feb 12, 2024

pmrowla commented Feb 12, 2024 •

edited

Loading

martindurant commented Feb 12, 2024

pmrowla commented Feb 12, 2024 •

edited

Loading

martindurant commented Feb 12, 2024

martindurant left a comment

martindurant Feb 14, 2024

martindurant Feb 14, 2024

martindurant Feb 16, 2024

martindurant Feb 16, 2024

put_file: support concurrent multipart uploads with max_concurrency #848

put_file: support concurrent multipart uploads with max_concurrency #848

Conversation

pmrowla commented Feb 12, 2024

pmrowla commented Feb 12, 2024 • edited Loading

martindurant commented Feb 12, 2024

pmrowla commented Feb 12, 2024 • edited Loading

martindurant commented Feb 12, 2024

martindurant left a comment

Choose a reason for hiding this comment

martindurant Feb 14, 2024

Choose a reason for hiding this comment

martindurant Feb 14, 2024

Choose a reason for hiding this comment

martindurant Feb 16, 2024

Choose a reason for hiding this comment

martindurant Feb 16, 2024

Choose a reason for hiding this comment

pmrowla commented Feb 12, 2024 •

edited

Loading

pmrowla commented Feb 12, 2024 •

edited

Loading