Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw: Introduce S3Mirror capability #37212

Closed
wants to merge 3 commits into from
Closed

Conversation

pkoutsov
Copy link

@pkoutsov pkoutsov commented Sep 17, 2020

This PR implements S3Mirror capability for RadosGW. S3Mirror transforms Ceph, through RadosGW, to an intermediate cache that stands between a remote S3 COS and transparently serves the workload at hand. We found that having such an intermediate cache can significantly improve the data locality and by extension provide performance gains.

Some high-level key features of this PR are the following:

  • Serve on the fly the clients when downloading objects from the remote S3 COS (cache-miss)
  • Take action only when RadosGW reports an error because the object is missing (cache-miss)
  • Asynchronously reflect changes to the remote S3 COS
  • Reflect changes to the remote S3 COS even if the object is not locally present, e.g. delete of an object
  • Introduce a simple hotness accounting mechanism downloaded objects before writing them locally. In this way can serve imminent short-term requests from in-memory and not thrash our OSDs because of increased concurrent writes and reads.

We validated the operation of this PR with TensorFlow2 (TF2) and Resnet:

1 epoch (cold cache): 3697s
2 epoch (hot cache): 1633s
3 epoch (hot cache): 1634s

1 epoch (no cache): 8738s
2 epoch (no cache): 8546s
3 epoch (no cache): 8774s

Except for the significant gains we observe for epochs 2+ because our cache is hot and contains locally all objects, we even observe significant gains for epoch 1, where the cache is cold. This happens because this PR does only a single request to fetch the object where the (TF2) built-in s3 access mechanism dissects and fetches the objects in multiple small chunks and the overhead of the HTTP request becomes measurable. On the contrary, when the objects are particularly small (in the scale of KB), as it for the case of MLPerf Object Detection, for the cold cache we observe an 8% overhead but again for the remaining training epochs, where the cache is hot, we do observe performance gains.

Moreover, this PR is cloud-computing ready as it is 100% compatible and tested with Rook. This was done in the context of Dataset Lifecycle Framework as a cache-plugin where Ceph, with this PR in place, is deployed on the fly and acts like the aforementioned intermediate cache.

Please have a look and provide feedback on this PR.

PS: this PR requires the AWS Library to be installed system-wide. We tried to introduce it as a git submodule and include it through the current CMake procedure of Ceph but the project is not compatible, as is (more details here). However, we are willing to try and support this if this PR is found to be interesting.

Signed-off: Panagiotis Koutsovasilis <koutsovasilis.panagiotis1@ibm.com>
Signed-off: Christian Pinto <christian.pinto@ibm.com>

Acknowledgment
This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 825061, H2020 Evolve.

transparently mirrors a bucket of a remote S3
endpoint
@mattbenjamin
Copy link
Contributor

@pkoutsov This looks interesting, and joins other efforts around caching in RGW (D3N and D3N++). Ultimately, we'll probably want to at least explore converging these. The current dispatch mechanism in this PR is an area we should discuss further--the feature as a whole looks a lot like what we've imagined an S3 zipper (object api) layer might be doing.

std::shared_ptr<CustomStreamBuffer> customStreamBuf = nullptr;
std::basic_iostream<char, std::char_traits<char>> *ioStream;
// S3 object properties to dump the headers
ceph::time_detail::real_clock::time_point expires;
Copy link
Contributor

@mkogan1 mkogan1 Oct 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ceph::time_detail::real_clock::time_point expires;
real_clock::time_point expires;

fix compilation error (fedora 32/clang 10):

FAILED: src/rgw/CMakeFiles/rgw_common.dir/rgw_op.cc.o
...
In file included from ../src/rgw/rgw_op.cc:77:
In file included from ../src/rgw/rgw_s3_mirror.h:46:
../src/rgw/rgw_s3_mirror_sync_tasks.h:108:3: error: no member named 'real_clock' in namespace 'ceph::time_detail'; did you mean simply 'real_clock'?
  ceph::time_detail::real_clock::time_point expires;
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  real_clock
../src/common/ceph_time.h:72:7: note: 'real_clock' declared here
class real_clock {
      ^

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why my gcc on debian never complained about that, but thanks for letting me know

std::basic_iostream<char, std::char_traits<char>> *ioStream;
// S3 object properties to dump the headers
ceph::time_detail::real_clock::time_point expires;
ceph::time_detail::real_clock::time_point last_modified;
Copy link
Contributor

@mkogan1 mkogan1 Oct 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ceph::time_detail::real_clock::time_point last_modified;
real_clock::time_point last_modified;

same as above

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@mkogan1
Copy link
Contributor

mkogan1 commented Oct 15, 2020

filling data into the cluster with hsbench [1]
a segmentation fault occurred on the S3Mirror rgw (not the backend rgw)
during PUT of 100K objects of 16KB size using 8 threads into the cached bucket
~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 16K -d -1 -t 8 -b 1 -n 100000 -m cxip -bp cachebkt
at object # 34060

2020-10-15T16:48:59.379+0300 7f5b2bf50700  1 beast: 0x7f5b2974aa70: 127.0.0.1 - - [2020-10-15T16:48:59.379751+0300] "PUT /cachebkt000000000000/000000034060 HTTP/1.1" 200 16384 - "aws-sdk-go>
2020-10-15T16:48:59.379+0300 7f5b2a74d700  1 ====== starting new request req=0x7f5b2974aa70 =====
2020-10-15T16:48:59.380+0300 7f5b2ff58700  1 ====== starting new request req=0x7f5b296c9a70 =====
2020-10-15T16:48:59.383+0300 7f5b2af4e700 -1 *** Caught signal (Aborted) **
 in thread 7f5b2af4e700 thread_name:radosgw

 ceph version 16.0.0-5570-gd0b63d7a5e (d0b63d7a5e3f4e814c0ffd5b410a053a10a284d4) pacific (dev)
 1: (()+0x14a90) [0x7f5b5a2afa90]
 2: (gsignal()+0x145) [0x7f5b59db39e5]
 3: (abort()+0x127) [0x7f5b59d9c895]
 4: (()+0x9e961) [0x7f5b59ffc961]
 5: (()+0xaa44c) [0x7f5b5a00844c]
 6: (()+0xaa4b7) [0x7f5b5a0084b7]
 7: (()+0xaa43f) [0x7f5b5a00843f]
 8: (spawn::detail::continuation_context::resume()+0x79) [0x7f5b63e23119]
 9: (boost::asio::detail::executor_function<ceph::async::ForwardingHandler<ceph::async::CompletionHandler<spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::e>
 10: (boost::asio::detail::executor_op<boost::asio::executor::function, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_op>
 11: (boost::asio::detail::executor_op<boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::executor_type const>, std::allocator<void>, boost::asio::detail::schedu>
 12: (boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const>
 13: (boost::asio::detail::scheduler::run(boost::system::error_code&)+0x124) [0x7f5b63e0fc14]
 14: (()+0x56f65f) [0x7f5b63e0c65f]
 15: (()+0xd8b84) [0x7f5b5a036b84]
 16: (()+0x9432) [0x7f5b5a2a4432]
 17: (clone()+0x43) [0x7f5b59e78913]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

[1] https://github.com/markhpc/hsbench

@pkoutsovasilis
Copy link

Thanks! I will look into that and try reproduce the segfault using hsbench. Just a quick question about the remote COS in your test. Is it another instance of RGW? or an actual public (over the network) COS?

@pkoutsov
Copy link
Author

filling data into the cluster with hsbench [1]
a segmentation fault occurred on the S3Mirror rgw (not the backend rgw)
during PUT of 100K objects of 16KB size using 8 threads into the cached bucket
~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 16K -d -1 -t 8 -b 1 -n 100000 -m cxip -bp cachebkt
at object # 34060

2020-10-15T16:48:59.379+0300 7f5b2bf50700  1 beast: 0x7f5b2974aa70: 127.0.0.1 - - [2020-10-15T16:48:59.379751+0300] "PUT /cachebkt000000000000/000000034060 HTTP/1.1" 200 16384 - "aws-sdk-go>
2020-10-15T16:48:59.379+0300 7f5b2a74d700  1 ====== starting new request req=0x7f5b2974aa70 =====
2020-10-15T16:48:59.380+0300 7f5b2ff58700  1 ====== starting new request req=0x7f5b296c9a70 =====
2020-10-15T16:48:59.383+0300 7f5b2af4e700 -1 *** Caught signal (Aborted) **
 in thread 7f5b2af4e700 thread_name:radosgw

 ceph version 16.0.0-5570-gd0b63d7a5e (d0b63d7a5e3f4e814c0ffd5b410a053a10a284d4) pacific (dev)
 1: (()+0x14a90) [0x7f5b5a2afa90]
 2: (gsignal()+0x145) [0x7f5b59db39e5]
 3: (abort()+0x127) [0x7f5b59d9c895]
 4: (()+0x9e961) [0x7f5b59ffc961]
 5: (()+0xaa44c) [0x7f5b5a00844c]
 6: (()+0xaa4b7) [0x7f5b5a0084b7]
 7: (()+0xaa43f) [0x7f5b5a00843f]
 8: (spawn::detail::continuation_context::resume()+0x79) [0x7f5b63e23119]
 9: (boost::asio::detail::executor_function<ceph::async::ForwardingHandler<ceph::async::CompletionHandler<spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::e>
 10: (boost::asio::detail::executor_op<boost::asio::executor::function, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_op>
 11: (boost::asio::detail::executor_op<boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::executor_type const>, std::allocator<void>, boost::asio::detail::schedu>
 12: (boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const>
 13: (boost::asio::detail::scheduler::run(boost::system::error_code&)+0x124) [0x7f5b63e0fc14]
 14: (()+0x56f65f) [0x7f5b63e0c65f]
 15: (()+0xd8b84) [0x7f5b5a036b84]
 16: (()+0x9432) [0x7f5b5a2a4432]
 17: (clone()+0x43) [0x7f5b59e78913]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

[1] https://github.com/markhpc/hsbench

With my evaluation scheme I couldn't reproduce this behavior RGW+S3Mirror <-> backend MinIO (put_eval.zip). Any hint/tip on the evaluation scheme, so I could reproduce this behavior?

@mkogan1
Copy link
Contributor

mkogan1 commented Oct 25, 2020

Thanks! I will look into that and try reproduce the segfault using hsbench. Just a quick question about the remote COS in your test. Is it another instance of RGW? or an actual public (over the network) COS?

yes, the remote is another instance of RGW on another machine in the same lab.

@mkogan1
Copy link
Contributor

mkogan1 commented Oct 25, 2020

filling data into the cluster with hsbench [1]
a segmentation fault occurred on the S3Mirror rgw (not the backend rgw)
during PUT of 100K objects of 16KB size using 8 threads into the cached bucket
~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 16K -d -1 -t 8 -b 1 -n 100000 -m cxip -bp cachebkt
at object # 34060

2020-10-15T16:48:59.379+0300 7f5b2bf50700  1 beast: 0x7f5b2974aa70: 127.0.0.1 - - [2020-10-15T16:48:59.379751+0300] "PUT /cachebkt000000000000/000000034060 HTTP/1.1" 200 16384 - "aws-sdk-go>
2020-10-15T16:48:59.379+0300 7f5b2a74d700  1 ====== starting new request req=0x7f5b2974aa70 =====
2020-10-15T16:48:59.380+0300 7f5b2ff58700  1 ====== starting new request req=0x7f5b296c9a70 =====
2020-10-15T16:48:59.383+0300 7f5b2af4e700 -1 *** Caught signal (Aborted) **
 in thread 7f5b2af4e700 thread_name:radosgw

 ceph version 16.0.0-5570-gd0b63d7a5e (d0b63d7a5e3f4e814c0ffd5b410a053a10a284d4) pacific (dev)
 1: (()+0x14a90) [0x7f5b5a2afa90]
 2: (gsignal()+0x145) [0x7f5b59db39e5]
 3: (abort()+0x127) [0x7f5b59d9c895]
 4: (()+0x9e961) [0x7f5b59ffc961]
 5: (()+0xaa44c) [0x7f5b5a00844c]
 6: (()+0xaa4b7) [0x7f5b5a0084b7]
 7: (()+0xaa43f) [0x7f5b5a00843f]
 8: (spawn::detail::continuation_context::resume()+0x79) [0x7f5b63e23119]
 9: (boost::asio::detail::executor_function<ceph::async::ForwardingHandler<ceph::async::CompletionHandler<spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::e>
 10: (boost::asio::detail::executor_op<boost::asio::executor::function, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_op>
 11: (boost::asio::detail::executor_op<boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::executor_type const>, std::allocator<void>, boost::asio::detail::schedu>
 12: (boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const>
 13: (boost::asio::detail::scheduler::run(boost::system::error_code&)+0x124) [0x7f5b63e0fc14]
 14: (()+0x56f65f) [0x7f5b63e0c65f]
 15: (()+0xd8b84) [0x7f5b5a036b84]
 16: (()+0x9432) [0x7f5b5a2a4432]
 17: (clone()+0x43) [0x7f5b59e78913]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

[1] https://github.com/markhpc/hsbench

With my evaluation scheme I couldn't reproduce this behavior RGW+S3Mirror <-> backend MinIO (put_eval.zip). Any hint/tip on the evaluation scheme, so I could reproduce this behavior?

Possibly it is load related, in the reproducer environment the load is much higher (3600-3800 IO/s) than in the put_eval.zip (15-30 IO/s)

~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 16K -d -1 -t 8 -b 1 -n 100000 -m cxip -bp cachebkt
...
2020/10/15 16:48:53 Loop: 0, Int: 2, Dur(s): 1.0, Mode: PUT, Ops: 3965, MB/s: 61.95, IO/s: 3965, Lat(ms): [ min: 1.5, avg: 2.0,                                                               
2020/10/15 16:48:54 Loop: 0, Int: 3, Dur(s): 1.0, Mode: PUT, Ops: 3839, MB/s: 59.98, IO/s: 3839, Lat(ms): [ min: 1.5, avg: 2.1,                                                               
2020/10/15 16:48:55 Loop: 0, Int: 4, Dur(s): 1.0, Mode: PUT, Ops: 3868, MB/s: 60.44, IO/s: 3868, Lat(ms): [ min: 1.5, avg: 2.1,                                                               
2020/10/15 16:48:56 Loop: 0, Int: 5, Dur(s): 1.0, Mode: PUT, Ops: 3731, MB/s: 58.30, IO/s: 3731, Lat(ms): [ min: 1.5, avg: 2.1,                                                               
2020/10/15 16:48:57 Loop: 0, Int: 6, Dur(s): 1.0, Mode: PUT, Ops: 3662, MB/s: 57.22, IO/s: 3662, Lat(ms): [ min: 1.5, avg: 2.2,                                                               
2020/10/15 16:48:58 Loop: 0, Int: 7, Dur(s): 1.0, Mode: PUT, Ops: 3670, MB/s: 57.34, IO/s: 3670, Lat(ms): [ min: 1.5, avg: 2.2, 
less put_eval.out
...
2020/10/19 17:09:51 Loop: 0, Int: 15, Dur(s): 1.0, Mode: PUT, Ops: 27, MB/s: 0.42, IO/s: 27, Lat(ms): [ min: 175.9, avg: 454.6, 99%: 1209.4, max: 1209.4 ], Slowdowns: 0
2020/10/19 17:09:52 Loop: 0, Int: 16, Dur(s): 1.0, Mode: PUT, Ops: 19, MB/s: 0.30, IO/s: 19, Lat(ms): [ min: 185.1, avg: 399.2, 99%: 691.7, max: 691.7 ], Slowdowns: 0
2020/10/19 17:09:53 Loop: 0, Int: 17, Dur(s): 1.0, Mode: PUT, Ops: 19, MB/s: 0.30, IO/s: 19, Lat(ms): [ min: 183.8, avg: 471.9, 99%: 829.9, max: 829.9 ], Slowdowns: 0
2020/10/19 17:09:54 Loop: 0, Int: 18, Dur(s): 1.0, Mode: PUT, Ops: 18, MB/s: 0.28, IO/s: 18, Lat(ms): [ min: 217.6, avg: 371.4, 99%: 500.4, max: 500.4 ], Slowdowns: 0

@pkoutsov
Copy link
Author

pkoutsov commented Oct 27, 2020

yes, the remote is another instance of RGW on another machine in the same lab.

The RGW+S3Mirror and the plain RGW point to different OSDs? Or both are part of the same ceph cluster?

@pkoutsov
Copy link
Author

pkoutsov commented Oct 27, 2020

Possibly it is load related, in the reproducer environment the load is much higher (3600-3800 IO/s) than in the put_eval.zip (15-30 IO/s)

~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 16K -d -1 -t 8 -b 1 -n 100000 -m cxip -bp cachebkt
...
2020/10/15 16:48:53 Loop: 0, Int: 2, Dur(s): 1.0, Mode: PUT, Ops: 3965, MB/s: 61.95, IO/s: 3965, Lat(ms): [ min: 1.5, avg: 2.0,                                                               
2020/10/15 16:48:54 Loop: 0, Int: 3, Dur(s): 1.0, Mode: PUT, Ops: 3839, MB/s: 59.98, IO/s: 3839, Lat(ms): [ min: 1.5, avg: 2.1,                                                               
2020/10/15 16:48:55 Loop: 0, Int: 4, Dur(s): 1.0, Mode: PUT, Ops: 3868, MB/s: 60.44, IO/s: 3868, Lat(ms): [ min: 1.5, avg: 2.1,                                                               
2020/10/15 16:48:56 Loop: 0, Int: 5, Dur(s): 1.0, Mode: PUT, Ops: 3731, MB/s: 58.30, IO/s: 3731, Lat(ms): [ min: 1.5, avg: 2.1,                                                               
2020/10/15 16:48:57 Loop: 0, Int: 6, Dur(s): 1.0, Mode: PUT, Ops: 3662, MB/s: 57.22, IO/s: 3662, Lat(ms): [ min: 1.5, avg: 2.2,                                                               
2020/10/15 16:48:58 Loop: 0, Int: 7, Dur(s): 1.0, Mode: PUT, Ops: 3670, MB/s: 57.34, IO/s: 3670, Lat(ms): [ min: 1.5, avg: 2.2, 
less put_eval.out
...
2020/10/19 17:09:51 Loop: 0, Int: 15, Dur(s): 1.0, Mode: PUT, Ops: 27, MB/s: 0.42, IO/s: 27, Lat(ms): [ min: 175.9, avg: 454.6, 99%: 1209.4, max: 1209.4 ], Slowdowns: 0
2020/10/19 17:09:52 Loop: 0, Int: 16, Dur(s): 1.0, Mode: PUT, Ops: 19, MB/s: 0.30, IO/s: 19, Lat(ms): [ min: 185.1, avg: 399.2, 99%: 691.7, max: 691.7 ], Slowdowns: 0
2020/10/19 17:09:53 Loop: 0, Int: 17, Dur(s): 1.0, Mode: PUT, Ops: 19, MB/s: 0.30, IO/s: 19, Lat(ms): [ min: 183.8, avg: 471.9, 99%: 829.9, max: 829.9 ], Slowdowns: 0
2020/10/19 17:09:54 Loop: 0, Int: 18, Dur(s): 1.0, Mode: PUT, Ops: 18, MB/s: 0.28, IO/s: 18, Lat(ms): [ min: 217.6, avg: 371.4, 99%: 500.4, max: 500.4 ], Slowdowns: 0
2020/10/27 09:30:29 Loop: 0, Int: 8, Dur(s): 1.0, Mode: PUT, Ops: 141, MB/s: 2.20, IO/s: 141, Lat(ms): [ min: 67.3, avg: 232.9, 99%: 1252.4, max: 1270.0 ], Slowdowns: 0
2020/10/27 09:30:30 Loop: 0, Int: 9, Dur(s): 1.0, Mode: PUT, Ops: 138, MB/s: 2.16, IO/s: 138, Lat(ms): [ min: 65.6, avg: 230.8, 99%: 1236.2, max: 1307.2 ], Slowdowns: 0
2020/10/27 09:30:31 Loop: 0, Int: 10, Dur(s): 1.0, Mode: PUT, Ops: 148, MB/s: 2.31, IO/s: 148, Lat(ms): [ min: 64.6, avg: 240.0, 99%: 1147.9, max: 1164.6 ], Slowdowns: 0

Ok I tried to reach the same load levels in a way more powerful machine but I reached as high as ~140 IO/s... I kinda suspect that this has to do with the fact that for debugging purposes I am setting up the CEPH cluster with vstart.sh. However, this is necessary because from the reported trace I can't exactly tell what went wrong. To this end, what are the spec of the reproducer environment? Also, would it be possible to get a gdb backtrace when the segfault occurs in you reproducer environment, so I can pinpoint the problem?

@mkogan1
Copy link
Contributor

mkogan1 commented Oct 27, 2020

Ok I tried to reach the same load levels in a way more powerful machine but I reached as high as ~140 IO/s... I kinda suspect that this has to do with the fact that for debugging purposes I am setting up the CEPH cluster with vstart.sh. However, this is necessary because from the reported trace I can't exactly tell what went wrong. To this end, what are the spec of the reproducer environment? Also, would it be possible to get a gdb backtrace when the segfault occurs in you reproducer environment, so I can pinpoint the problem?

while working on generating a core file noticed that it's size was unusually large.
this prompted checking the memory footprint and discovered that when s3mirror is enabled it's very large:
below please note that this is with s3mirror enabled and the aforementioned hsbench workload, radosgw takes 256GB virtual memory (the server has 128GB of physical ram.
image

versus memory utilization of the same radosgw when the s3mirror is not enabled and same hsbench workload:
(radosgw virtual memory usage is 1 GB)
image

@pkoutsov
Copy link
Author

pkoutsov commented Oct 27, 2020

below please note that this is with s3mirror enabled and the aforementioned hsbench workload, radosgw takes 256GB virtual memory (the server has 128GB of physical ram.

Ok that is alarming at least.

I would expect a momentary memory footprint increase because S3Mirror in case of local object miss calls the remote_download_object function. When the remote object is downloaded by S3Mirror it stays in memory as long as it "hot" and gets written in the OSDs afterwards (when it turns cold). This is a solution to serve as fast as possible (object is in-memory) any consecutive requests of the same object, for which we just had a local miss, because we noticed that the concurrent writing and reading of an OSD can result in a performance degradation. That said, this can be easily changed.

However, on the put object path, where S3Mirror reflects back the changes to the remote S3 endpoint, there shouldn't be any memory issue. I will investigate the allocations and de-allocations and try to see this through. Thanks for the feedback

@pkoutsov
Copy link
Author

I repeated the stresstest with valgrind attached and except one missing shutdown of aws sdk, which I implemented and pushed, I didn't come across anything alarming valgrind-out.zip. Probably this has to do with the high load you are able to achieve with your stresstest.

Moreover, in my environment I am using aws sdk v1.8.0 because I had some weird issues with later releases. Also since you achieve ~3500 op/sec this probably means that this limit here might be causing these issues.

Would it be possible to re-check with v1.8.0 and this limit somewhat above your ops/sec (e.g. 4096)? Thx

@mkogan1
Copy link
Contributor

mkogan1 commented Nov 2, 2020

Would it be possible to re-check with v1.8.0 and this limit somewhat above your ops/sec (e.g. 4096)?

can not test build aws-cli v1.8.0 - it fails it's compilation tests:

git switch -d 1.8.0
mkdir ./build ; cd ./build
cmake ../ -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBUILD_ONLY="s3" -DBUILD_SHARED_LIBS=OFF
make -j 8
...
[----------] Global test environment tear-down 
[==========] 373 tests from 67 test cases ran. (6381 ms total) 
[  PASSED  ] 372 tests. 
[  FAILED  ] 1 test, listed below: 
[  FAILED  ] AES_GCM_TEST.AES_GCM_256_KAT_1 

 1 FAILED TEST 
...
make: *** [Makefile:150: all] Error 2

build fine at aws-cli version git switch -d 1.8.67

Next will try to tune the suggested limit at:

config.maxConnections = 1024;

and update with memory utilization results. (if will be effecting would suggest making it configurable via ceph.conf)

@pkoutsov
Copy link
Author

pkoutsov commented Nov 2, 2020

Interestingly, I compiled aws-sdk successfully with following command on debian for x86 and ppc64le, but I will update to 1.8.67 and repeat the valgrind execution to investigate for any apparent memory leaks.

git clone https://github.com/aws/aws-sdk-cpp.git && \
    cd aws-sdk-cpp && \
    git fetch --all --tags && \
    git checkout 1.8.0 && \
    mkdir build && cd ./build && \
    cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_ONLY="s3" \
             -DBUILD_SHARED_LIBS=OFF && \
    make -j`nproc`

Thanks for the time you invest on that, of course we can make this limit configure through ceph.conf

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@stale
Copy link

stale bot commented Jul 21, 2021

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Jul 21, 2021
@djgalloway djgalloway changed the base branch from master to main July 8, 2022 00:00
@github-actions github-actions bot removed the stale label Jul 15, 2022
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Sep 13, 2022
@mattbenjamin
Copy link
Contributor

un-stahl, at least for now

@github-actions github-actions bot removed the stale label Sep 13, 2022
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Nov 12, 2022
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Dec 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants