New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
common/buffer: fix stack corruption in rebuild_aligned_size_and_memory() #42112
Conversation
Hi @rzarzynski , in the process of fixing the RBD bug, I found there was a problem with the common buffer. I have written a test case to reproduce this problem. Can you have a look at it. |
CC @idryomov , new issue about pwl/ssd, this to fix https://tracker.ceph.com/issues/51419. |
CC @majianpeng |
@CongMinYin could you summarize the change you made in the commit in the title. see https://github.com/ceph/ceph/blob/master/SubmittingPatches.rst#commit-title "fix SIGABRT" is the effect of this change, but it does not describe the change. |
Yes, please. This needs a detailed explanation of what is wrong with the Separately, I would appreciate if you describe the RBD bug that led you to discovering this. |
use our test case w/o this pr;
in func rebuild_aligned_size_and_memory, it will corrupt stack. This because |
src/common/buffer.cc
Outdated
@@ -1645,7 +1645,6 @@ static ceph::spinlock debug_lock; | |||
curbuf = _buffers.erase_after(curbuf_prev); | |||
_carriage->set_offset(_carriage->offset() + _carriage->length()); | |||
_carriage->set_length(0); | |||
_buffers.push_back(*_carriage); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if we really need to drop this line, the comment above should be reworked as well. Also, I'm worried the patch introduces a memory leak as erase_after()
doesn't dispose the node.
In the meantime, I'm joining the folks asking for more descriptive commit message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if we really need to drop this line, the comment above should be reworked as well. Also, I'm worried the patch introduces a memory leak as
erase_after()
doesn't dispose the node.In the meantime, I joining the folks asking for more descriptive commit message.
I'm not sure about the solution, but the problem does exist. see the issue describe by Jianpeng
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CongMinYin, @majianpeng: I'm not questioning that at all. :-) Just arguing about the way how we fix it.
a)bl has a ptr which length is zero.
This is still valid and can (rarely) happen because of how _carriage
and the non-pointer append()
variants of contiguous_appender
interact:
class contiguous_appender {
// ...
void append(const bufferlist& l) {
if (deep) {
// ...
} else {
flush_and_continue();
bl.append(l);
space = bl.obtain_contiguous_space(0);
out_of_band_offset += l.length();
}
}
My impression is we need to take a look on rebuild_aligned_size_and_memory()
. I can help with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, take a look on the BufferListIterator,iterate_with_empties
testcase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also look forward to discussing with you and seeking the most appropriate solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think the fix could look like this one:
diff --git a/src/common/buffer.cc b/src/common/buffer.cc
index 548abc458fd..9c76cde6eed 100644
--- a/src/common/buffer.cc
+++ b/src/common/buffer.cc
@@ -1258,8 +1258,12 @@ static ceph::spinlock debug_lock;
buffer::create_aligned(unaligned._len, align_memory)));
had_to_rebuild = true;
}
- _buffers.insert_after(p_prev, *ptr_node::create(unaligned._buffers.front()).release());
- _num += 1;
+ if (unaligned.get_num_buffers()) {
+ _buffers.insert_after(p_prev, *ptr_node::create(unaligned._buffers.front()).release());
+ _num += 1;
+ } else {
+ // a bufferlist containing only 0-length bptrs is rebuild as empty
+ }
++p_prev;
}
return had_to_rebuild;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rzarzynski Thank you for your solution. I have updated the code and commit information of PR. Please check.
@idryomov an example where this can be reproduced is when the below fio config is run
|
002bf0c
to
203143f
Compare
203143f
to
bc15029
Compare
jenkins test make check |
@rzarzynski could you take another look? |
bc15029
to
e043b5e
Compare
238b825
to
48c3bd2
Compare
@idryomov made changes as you suggested, please check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just a couple of really minor nits. I assume you have verified that the new test case still leads to a crash if the patch isn't applied?
There is such a bl, which needs to satisfy two conditions: 1)all ptrs' length sum except last ptr is aligned with 4K; 2)the length of last ptr is 0. This bl will cause stack corruption when calling bufferlist::rebuild_aligned_size_and_memory(). Deal with this special scenario in rebuild_aligned_size_and_memory() to solve the bug. And added a specialtest-case to reproduce this scenario. Fixes: https://tracker.ceph.com/issues/51419 Signed-off-by: Yin Congmin <congmin.yin@intel.com>
48c3bd2
to
05a49ae
Compare
Yes, I test the new test case. Without this change in had modified comments, please check. |
Merging based on Kefu's earlier runs since the fix itself hasn't changed. |
There is such a bl, which needs to satisfy two conditions:
This bl will cause stack corruption when calling
bufferlist::rebuild_aligned_size_and_memory().
Deal with this special scenario in rebuild_aligned_size_and_memory() to
solve the bug. And added a specialtest-case to reproduce this scenario.
Fixes: https://tracker.ceph.com/issues/51419
Signed-off-by: Yin Congmin congmin.yin@intel.com
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox