-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add S3FileSystem Write support #6021
Add S3FileSystem Write support #6021
Conversation
✅ Deploy Preview for meta-velox canceled.
|
CC: @akashsha1, @paul-amonson, @tigrux for early feedback on the design. |
request.SetBucket(awsString(bucket_)); | ||
request.SetKey(awsString(key_)); | ||
auto objectMetadata = client_->HeadObject(request); | ||
VELOX_CHECK(!objectMetadata.IsSuccess(), "File already exists"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this error message be "S3 object already exists"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will Fix.
7ca0135
to
1c9f36b
Compare
97f2109
to
5b14e7e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clean code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:)
5b14e7e
to
1417696
Compare
VELOX_CHECK(!closed_, "File is closed"); | ||
// 'flush' API should trigger uploadPart. | ||
// But upload part if the maximum part size is reached. | ||
if (currentPartSize_ + data.size() > kMaxPartSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be >=
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>=
should work according to the documentation. Will do that.
8978da9
to
7d8716e
Compare
@tigrux @akashsha1 @paul-amonson I did some tests at scale and had to fix the semantics a bit. |
7d8716e
to
d654cb8
Compare
With the new semantics, I verified that we can now write files larger than 5GiB. |
I also looked at the |
e75551e
to
29769d5
Compare
The linux-build failure is unrelated. |
@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Summary: S3WriteFile uses the Apache Arrow implementation as a reference. AWS C++ SDK allows streaming writes via the MultiPart upload API. Multipart upload allows you to upload a single object as a set of parts. Each part is a contiguous portion of the object's data. While AWS and Minio support different sizes for each part (only requiring a minimum of 5MB), Certain object stores require that every part be exactly equal (except for the last part). We set this to 10 MiB, so that in combination with the maximum number of parts of 10,000, this gives a file limit of 100k MiB (or about 98 GiB). You can upload these object parts independently and in any order. After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object. S3WriteFile is not thread-safe. UploadPart is currently synchronous during append. Flush is no-op as append handles all the uploads. Resolves: facebookincubator#4805 Pull Request resolved: facebookincubator#6021 Reviewed By: kgpai Differential Revision: D49324662 Pulled By: pedroerp fbshipit-source-id: f26479058f576a63f7d4fe4527b57bd0aa87ab30
Summary: S3WriteFile uses the Apache Arrow implementation as a reference. AWS C++ SDK allows streaming writes via the MultiPart upload API. Multipart upload allows you to upload a single object as a set of parts. Each part is a contiguous portion of the object's data. While AWS and Minio support different sizes for each part (only requiring a minimum of 5MB), Certain object stores require that every part be exactly equal (except for the last part). We set this to 10 MiB, so that in combination with the maximum number of parts of 10,000, this gives a file limit of 100k MiB (or about 98 GiB). You can upload these object parts independently and in any order. After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object. S3WriteFile is not thread-safe. UploadPart is currently synchronous during append. Flush is no-op as append handles all the uploads. Resolves: facebookincubator#4805 Pull Request resolved: facebookincubator#6021 Reviewed By: kgpai Differential Revision: D49324662 Pulled By: pedroerp fbshipit-source-id: f26479058f576a63f7d4fe4527b57bd0aa87ab30
Summary: S3WriteFile uses the Apache Arrow implementation as a reference. AWS C++ SDK allows streaming writes via the MultiPart upload API. Multipart upload allows you to upload a single object as a set of parts. Each part is a contiguous portion of the object's data. While AWS and Minio support different sizes for each part (only requiring a minimum of 5MB), Certain object stores require that every part be exactly equal (except for the last part). We set this to 10 MiB, so that in combination with the maximum number of parts of 10,000, this gives a file limit of 100k MiB (or about 98 GiB). You can upload these object parts independently and in any order. After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object. S3WriteFile is not thread-safe. UploadPart is currently synchronous during append. Flush is no-op as append handles all the uploads. Resolves: facebookincubator#4805 Pull Request resolved: facebookincubator#6021 Reviewed By: kgpai Differential Revision: D49324662 Pulled By: pedroerp fbshipit-source-id: f26479058f576a63f7d4fe4527b57bd0aa87ab30
Summary: S3WriteFile uses the Apache Arrow implementation as a reference. AWS C++ SDK allows streaming writes via the MultiPart upload API. Multipart upload allows you to upload a single object as a set of parts. Each part is a contiguous portion of the object's data. While AWS and Minio support different sizes for each part (only requiring a minimum of 5MB), Certain object stores require that every part be exactly equal (except for the last part). We set this to 10 MiB, so that in combination with the maximum number of parts of 10,000, this gives a file limit of 100k MiB (or about 98 GiB). You can upload these object parts independently and in any order. After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object. S3WriteFile is not thread-safe. UploadPart is currently synchronous during append. Flush is no-op as append handles all the uploads. Resolves: facebookincubator#4805 Pull Request resolved: facebookincubator#6021 Reviewed By: kgpai Differential Revision: D49324662 Pulled By: pedroerp fbshipit-source-id: f26479058f576a63f7d4fe4527b57bd0aa87ab30
S3WriteFile uses the Apache Arrow implementation as a reference.
AWS C++ SDK allows streaming writes via the MultiPart upload API.
Multipart upload allows you to upload a single object as a set of parts.
Each part is a contiguous portion of the object's data.
While AWS and Minio support different sizes for each
part (only requiring a minimum of 5MB), Certain object stores require that every
part be exactly equal (except for the last part). We set this to 10 MiB, so
that in combination with the maximum number of parts of 10,000, this gives a
file limit of 100k MiB (or about 98 GiB).
You can upload these object parts independently and in any order.
After all parts of your object are uploaded, Amazon S3 assembles these parts
and creates the object.
S3WriteFile is not thread-safe.
UploadPart is currently synchronous during append. Flush is no-op as append
handles all the uploads.
Resolves: #4805