[C++] Error when writing files to S3 larger than 5 GB #24550

asfimport · 2020-04-07T14:47:28Z

When purely using the arrow-cpp library to write to S3, I get the following error when trying to write a large Arrow table to S3 (resulting in a file size larger than 5 GB):

../src/arrow/io/interfaces.cc:219: Error ignored when destroying file of type N5arrow2fs12_GLOBAL__N_118ObjectOutputStreamE: IOError: When uploading part for key 'test01.parquet/part-00.parquet' in bucket 'test': AWS Error [code 100]: Unable to parse ExceptionName: EntityTooLarge Message: Your proposed upload exceeds the maximum allowed size with address : 52.219.100.32

I have diagnosed the problem by looking at and modifying the code in s3fs.cc. The code uses multipart upload, and uses 5 MB chunks for the first 100 parts. After it has submitted the first 100 parts, it is supposed to increase the size of the chunks to 10 MB (the part upload threshold or part_upload_threshold_). The issue is that the threshold is increased inside DoWrite, and DoWrite can be called multiple times before the current part is uploaded, which ultimately causes the threshold to keep getting increased indefinitely, and the last part ends up surpassing the 5 GB part upload limit of AWS/S3.

This issue where the last part is much larger than it should I'm pretty sure can happen every time a multi-part upload exceeds 100 parts, but the error is only thrown if the last part is larger than 5 GB. Therefore this is only observed with very large uploads.

I can confirm that the bug does not happen if I move this:

if (part_number_ % 100 == 0) {
part_upload_threshold_ += kMinimumPartUpload;}}
}

and do it in a different method, right before the line that does: ++part_number_

Reporter: Juan Galvez
Assignee: Antoine Pitrou / @pitrou

PRs and other links:

GitHub Pull Request #6864

_{Note: This issue was originally created as ARROW-8365. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2020-04-07T14:58:02Z

Antoine Pitrou / @pitrou:
Thanks for the thorough report and diagnosis!

asfimport · 2020-04-07T15:13:04Z

Antoine Pitrou / @pitrou:
[~jjgalvez] I've submitted #6864 . Can you check whether the diff looks good to you?

asfimport · 2020-04-07T15:21:52Z

Juan Galvez:
Done. Thanks!

asfimport · 2020-04-07T16:33:28Z

Antoine Pitrou / @pitrou:
Issue resolved by pull request 6864
#6864

asfimport closed this as completed Apr 7, 2020

asfimport assigned pitrou Jan 10, 2023

asfimport added this to the 0.17.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Error when writing files to S3 larger than 5 GB #24550

[C++] Error when writing files to S3 larger than 5 GB #24550

asfimport commented Apr 7, 2020

asfimport commented Apr 7, 2020

asfimport commented Apr 7, 2020

asfimport commented Apr 7, 2020

asfimport commented Apr 7, 2020

[C++] Error when writing files to S3 larger than 5 GB #24550

[C++] Error when writing files to S3 larger than 5 GB #24550

Comments

asfimport commented Apr 7, 2020

PRs and other links:

asfimport commented Apr 7, 2020

asfimport commented Apr 7, 2020

asfimport commented Apr 7, 2020

asfimport commented Apr 7, 2020