New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rgw: Introduce S3Mirror capability #37212
Conversation
transparently mirrors a bucket of a remote S3 endpoint
@pkoutsov This looks interesting, and joins other efforts around caching in RGW (D3N and D3N++). Ultimately, we'll probably want to at least explore converging these. The current dispatch mechanism in this PR is an area we should discuss further--the feature as a whole looks a lot like what we've imagined an S3 zipper (object api) layer might be doing. |
src/rgw/rgw_s3_mirror_sync_tasks.h
Outdated
std::shared_ptr<CustomStreamBuffer> customStreamBuf = nullptr; | ||
std::basic_iostream<char, std::char_traits<char>> *ioStream; | ||
// S3 object properties to dump the headers | ||
ceph::time_detail::real_clock::time_point expires; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ceph::time_detail::real_clock::time_point expires; | |
real_clock::time_point expires; |
fix compilation error (fedora 32/clang 10):
FAILED: src/rgw/CMakeFiles/rgw_common.dir/rgw_op.cc.o
...
In file included from ../src/rgw/rgw_op.cc:77:
In file included from ../src/rgw/rgw_s3_mirror.h:46:
../src/rgw/rgw_s3_mirror_sync_tasks.h:108:3: error: no member named 'real_clock' in namespace 'ceph::time_detail'; did you mean simply 'real_clock'?
ceph::time_detail::real_clock::time_point expires;
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
real_clock
../src/common/ceph_time.h:72:7: note: 'real_clock' declared here
class real_clock {
^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why my gcc on debian never complained about that, but thanks for letting me know
src/rgw/rgw_s3_mirror_sync_tasks.h
Outdated
std::basic_iostream<char, std::char_traits<char>> *ioStream; | ||
// S3 object properties to dump the headers | ||
ceph::time_detail::real_clock::time_point expires; | ||
ceph::time_detail::real_clock::time_point last_modified; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ceph::time_detail::real_clock::time_point last_modified; | |
real_clock::time_point last_modified; |
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
filling data into the cluster with hsbench [1]
|
Thanks! I will look into that and try reproduce the segfault using hsbench. Just a quick question about the remote COS in your test. Is it another instance of RGW? or an actual public (over the network) COS? |
With my evaluation scheme I couldn't reproduce this behavior RGW+S3Mirror <-> backend MinIO (put_eval.zip). Any hint/tip on the evaluation scheme, so I could reproduce this behavior? |
yes, the remote is another instance of RGW on another machine in the same lab. |
Possibly it is load related, in the reproducer environment the load is much higher (3600-3800 IO/s) than in the
|
The RGW+S3Mirror and the plain RGW point to different OSDs? Or both are part of the same ceph cluster? |
Ok I tried to reach the same load levels in a way more powerful machine but I reached as high as ~140 IO/s... I kinda suspect that this has to do with the fact that for debugging purposes I am setting up the CEPH cluster with vstart.sh. However, this is necessary because from the reported trace I can't exactly tell what went wrong. To this end, what are the spec of the reproducer environment? Also, would it be possible to get a gdb backtrace when the segfault occurs in you reproducer environment, so I can pinpoint the problem? |
Ok that is alarming at least. I would expect a momentary memory footprint increase because S3Mirror in case of local object miss calls the remote_download_object function. When the remote object is downloaded by S3Mirror it stays in memory as long as it "hot" and gets written in the OSDs afterwards (when it turns cold). This is a solution to serve as fast as possible (object is in-memory) any consecutive requests of the same object, for which we just had a local miss, because we noticed that the concurrent writing and reading of an OSD can result in a performance degradation. That said, this can be easily changed. However, on the put object path, where S3Mirror reflects back the changes to the remote S3 endpoint, there shouldn't be any memory issue. I will investigate the allocations and de-allocations and try to see this through. Thanks for the feedback |
I repeated the stresstest with valgrind attached and except one missing shutdown of aws sdk, which I implemented and pushed, I didn't come across anything alarming valgrind-out.zip. Probably this has to do with the high load you are able to achieve with your stresstest. Moreover, in my environment I am using aws sdk v1.8.0 because I had some weird issues with later releases. Also since you achieve ~3500 op/sec this probably means that this limit here might be causing these issues. Would it be possible to re-check with v1.8.0 and this limit somewhat above your ops/sec (e.g. 4096)? Thx |
can not test build aws-cli v1.8.0 - it fails it's compilation tests:
build fine at aws-cli version Next will try to tune the suggested limit at: Line 89 in d0b63d7
and update with memory utilization results. (if will be effecting would suggest making it configurable via ceph.conf) |
Interestingly, I compiled aws-sdk successfully with following command on debian for x86 and ppc64le, but I will update to 1.8.67 and repeat the valgrind execution to investigate for any apparent memory leaks.
Thanks for the time you invest on that, of course we can make this limit configure through ceph.conf |
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
un-stahl, at least for now |
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
This PR implements S3Mirror capability for RadosGW. S3Mirror transforms Ceph, through RadosGW, to an intermediate cache that stands between a remote S3 COS and transparently serves the workload at hand. We found that having such an intermediate cache can significantly improve the data locality and by extension provide performance gains.
Some high-level key features of this PR are the following:
We validated the operation of this PR with TensorFlow2 (TF2) and Resnet:
Except for the significant gains we observe for epochs 2+ because our cache is hot and contains locally all objects, we even observe significant gains for epoch 1, where the cache is cold. This happens because this PR does only a single request to fetch the object where the (TF2) built-in s3 access mechanism dissects and fetches the objects in multiple small chunks and the overhead of the HTTP request becomes measurable. On the contrary, when the objects are particularly small (in the scale of KB), as it for the case of MLPerf Object Detection, for the cold cache we observe an 8% overhead but again for the remaining training epochs, where the cache is hot, we do observe performance gains.
Moreover, this PR is cloud-computing ready as it is 100% compatible and tested with Rook. This was done in the context of Dataset Lifecycle Framework as a cache-plugin where Ceph, with this PR in place, is deployed on the fly and acts like the aforementioned intermediate cache.
Please have a look and provide feedback on this PR.
PS: this PR requires the AWS Library to be installed system-wide. We tried to introduce it as a git submodule and include it through the current CMake procedure of Ceph but the project is not compatible, as is (more details here). However, we are willing to try and support this if this PR is found to be interesting.
Signed-off: Panagiotis Koutsovasilis <koutsovasilis.panagiotis1@ibm.com>
Signed-off: Christian Pinto <christian.pinto@ibm.com>
Acknowledgment
This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 825061, H2020 Evolve.