Skip to content

Commit

Permalink
ARROW-8365: [C++] Error when writing files to S3 larger than 5 GB
Browse files Browse the repository at this point in the history
The part upload threshold could be bumped too frequently
(on any Write() call, even if it doesn't trigger a part upload).

Report and diagnosis by Juan Galvez (thank you!).

Closes #6864 from pitrou/ARROW-8365-fix-part-upload-threshold

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
  • Loading branch information
pitrou committed Apr 7, 2020
1 parent 6fc67cf commit 197a3c2
Showing 1 changed file with 16 additions and 15 deletions.
31 changes: 16 additions & 15 deletions cpp/src/arrow/filesystem/s3fs.cc
Original file line number Diff line number Diff line change
Expand Up @@ -635,21 +635,6 @@ class ObjectOutputStream : public io::OutputStream {
return Status::Invalid("Operation on closed stream");
}

// With up to 10000 parts in an upload (S3 limit), a stream writing chunks
// of exactly 5MB would be limited to 50GB total. To avoid that, we bump
// the upload threshold every 100 parts. So the pattern is:
// - part 1 to 99: 5MB threshold
// - part 100 to 199: 10MB threshold
// - part 200 to 299: 15MB threshold
// ...
// - part 9900 to 9999: 500MB threshold
// So the total size limit is 2475000MB or ~2.4TB, while keeping manageable
// chunk sizes and avoiding too much buffering in the common case of a small-ish
// stream. If the limit's not enough, we can revisit.
if (part_number_ % 100 == 0) {
part_upload_threshold_ += kMinimumPartUpload;
}

if (!current_part_ && nbytes >= part_upload_threshold_) {
// No current part and data large enough, upload it directly
// (without copying if the buffer is owned)
Expand Down Expand Up @@ -751,7 +736,23 @@ class ObjectOutputStream : public io::OutputStream {
++upload_state_->parts_in_progress;
client_->UploadPartAsync(req, handler);
}

++part_number_;
// With up to 10000 parts in an upload (S3 limit), a stream writing chunks
// of exactly 5MB would be limited to 50GB total. To avoid that, we bump
// the upload threshold every 100 parts. So the pattern is:
// - part 1 to 99: 5MB threshold
// - part 100 to 199: 10MB threshold
// - part 200 to 299: 15MB threshold
// ...
// - part 9900 to 9999: 500MB threshold
// So the total size limit is 2475000MB or ~2.4TB, while keeping manageable
// chunk sizes and avoiding too much buffering in the common case of a small-ish
// stream. If the limit's not enough, we can revisit.
if (part_number_ % 100 == 0) {
part_upload_threshold_ += kMinimumPartUpload;
}

return Status::OK();
}

Expand Down

0 comments on commit 197a3c2

Please sign in to comment.