New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw: fix not initialized vars which cause rgw crash with ec data pool #16177

Merged
merged 1 commit into from Jul 10, 2017

Conversation

Projects
None yet
5 participants
@agutikov
Contributor

agutikov commented Jul 6, 2017

Fix crash of rgw that uses erasure-coded buckets.data pool.

For ec pools alignment = osd_pool_erasure_code_stripe_unit * data_chunk_count.
(Not sure that is correct, but that is what I've observed)

So, for example for
rgw_obj_stripe_size=4M
rgw_max_chunk_size=4M
osd_pool_erasure_code_stripe_unit=4k
data_chunk_count=3 (k=3, m=2)
RGWRados::get_max_chunk_size(const rgw_pool& pool, uint64_t *max_chunk_size) returns max_chunk_size = 4M-4k

Then if uploading to S3 not multipart object with size=16M

While rgw_obj_stripe_size=4M
That leads (somehow) to entering while (pending_data_bl.length()) {
in int RGWPutObjProcessor_Atomic::complete_writing_data():2710

And there not initialized void *handle leads to invalid pointer librados::AioCompletion::pc
which leads to rgw crash

1: (()+0x1fcee2) [0x55f98c2e0ee2]
2: (()+0x11390) [0x7f43341c8390]
3: (Mutex::Lock(bool)+0xd) [0x7f432b5d1a2d]
4: (librados::AioCompletion::wait_for_safe()+0x15) [0x7f4334425995]
5: (RGWRados::aio_wait(void*)+0x11) [0x55f98c3efd81]
6: (RGWPutObjProcessor_Aio::wait_pending_front()+0x4c) [0x55f98c3fc51c]
7: (RGWPutObjProcessor_Aio::drain_pending()+0x20) [0x55f98c3fc5a0]
8: (RGWPutObjProcessor_Atomic::complete_writing_data()+0x30d) [0x55f98c40f6bd]
9: (RGWPutObjProcessor_Atomic::do_complete(unsigned long, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, std::map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, ceph::buffer::list, std::less<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, ceph::buffer::list> > >&, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, char const, char const*, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const*, std::set<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::less<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > >)+0x67) [0x55f98c440037]
10: (RGWPutObjProcessor::complete(unsigned long, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >
, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, std::map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, ceph::buffer::list, std::less<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, ceph::buffer::list> > >&, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, char const*, char const*, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const*, std::set<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::less<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > >)+0x22) [0x55f98c3eea52]
11: (RGWPutObj::execute()+0x22a9) [0x55f98c3be659]
12: (rgw_process_authenticated(RGWHandler_REST
, RGWOp*&, RGWRequest*, req_state*, bool)+0x165) [0x55f98c3e8eb5]
13: (process_request(RGWRados*, RGWREST*, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*)+0x1abc) [0x55f98c3ead5c]
14: (RGWCivetWebFrontend::process(mg_connection*)+0x371) [0x55f98c299711]
15: (()+0x1ee2a9) [0x55f98c2d22a9]
16: (()+0x1efc79) [0x55f98c2d3c79]
17: (()+0x76ba) [0x7f43341be6ba]
18: (clone()+0x6d) [0x7f4329c453dd]

Fixes: http://tracker.ceph.com/issues/20542
Signed-off-by: Aleksei Gutikov aleksey.gutikov@synesis.ru

@joscollin

LGTM

@mattbenjamin

This comment has been minimized.

Contributor

mattbenjamin commented Jul 7, 2017

@agutikov could you create a tracker ticket for this PR? this will help with backports and downstream coordination; (commit msg should include Fixes: ); tx!

@agutikov

This comment has been minimized.

Contributor

agutikov commented Jul 7, 2017

@mattbenjamin

This comment has been minimized.

Contributor

mattbenjamin commented Jul 7, 2017

@agutikov tx!!

@yehudasa yehudasa self-assigned this Jul 7, 2017

@yehudasa

This comment has been minimized.

Member

yehudasa commented Jul 7, 2017

@agutikov is this fixing the original hang that you describe in the tracker?

@agutikov

This comment has been minimized.

Contributor

agutikov commented Jul 7, 2017

@yehudasa No, this is different issue.

@yehudasa

@agutikov there are a few extra initializations here that I think are not needed. If you think there is an issue with librados not initializing these variables in certain cases then we need to fix librados.

@@ -2729,7 +2729,7 @@ librados::AioCompletion *librados::Rados::aio_create_completion(void *cb_arg,
callback_t cb_complete,
callback_t cb_safe)
{
AioCompletionImpl *c;
AioCompletionImpl *c = nullptr;

This comment has been minimized.

@yehudasa

yehudasa Jul 7, 2017

Member

@agutikov this should be initialized by the next call

@@ -2683,7 +2683,7 @@ int RGWPutObjProcessor_Atomic::complete_writing_data()
obj_len = (uint64_t)first_chunk.length();
}
while (pending_data_bl.length()) {
void *handle;
void *handle = nullptr;

This comment has been minimized.

@yehudasa

yehudasa Jul 7, 2017

Member

@agutikov can have this initialization here, but should modify the following write_data() to initialize it by default to avoid any other similar issues later?

This comment has been minimized.

@agutikov
@@ -3309,7 +3309,7 @@ int RGWRados::get_required_alignment(const rgw_pool& pool, uint64_t *alignment)
return r;
}
bool requires;
bool requires = false;

This comment has been minimized.

@yehudasa

yehudasa Jul 7, 2017

Member

@agutikov should be initialized by the following call, if it doesn't then we need to fix librados

@@ -3322,7 +3322,7 @@ int RGWRados::get_required_alignment(const rgw_pool& pool, uint64_t *alignment)
return 0;
}
uint64_t align;
uint64_t align = 0;

This comment has been minimized.

@yehudasa

yehudasa Jul 7, 2017

Member

@agutikov if it's not initialized by the next call then the fix should be in librados

@agutikov

This comment has been minimized.

Contributor

agutikov commented Jul 7, 2017

@yehudasa I see crashes of rgws on multinode vagrant environment with ceph from deb packages, and hanging in development environment (vstart.sh)

@yehudasa

This comment has been minimized.

Member

yehudasa commented Jul 7, 2017

@agutikov so it's two different symptoms of the same bug?

rgw: fix not initialized pointer which cause rgw crash with ec data pool
In RGWPutObjProcessor_Atomic::complete_writing_data()
with pending_data_bl.length() > 0 and next_part_ofs==data_ofs
not initialized void *handle leads to invalid pointer librados::AioCompletion::pc
which leads to rgw crash.

Fixes: http://tracker.ceph.com/issues/20542
Signed-off-by: Aleksei Gutikov <aleksey.gutikov@synesis.ru>
@agutikov

This comment has been minimized.

Contributor

agutikov commented Jul 7, 2017

@yehudasa Probably they are related, but not same, handle initialization fix crash, but not fix hang.

@yuriw

This comment has been minimized.

@yehudasa yehudasa merged commit c7e869c into ceph:master Jul 10, 2017

4 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details

@agutikov agutikov deleted the Synesis-LLC:fix_rgw_crash_ec_data_pool branch Jul 11, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment