Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

common/buffer: fix stack corruption in rebuild_aligned_size_and_memory() #42112

Merged
merged 1 commit into from Jul 9, 2021

Conversation

CongMinYin
Copy link
Contributor

@CongMinYin CongMinYin commented Jun 30, 2021

There is such a bl, which needs to satisfy two conditions:

  1. all ptr's length sum except last ptr is aligned with 4K;
  2. the length of last ptr is 0.

This bl will cause stack corruption when calling
bufferlist::rebuild_aligned_size_and_memory().

Deal with this special scenario in rebuild_aligned_size_and_memory() to
solve the bug. And added a specialtest-case to reproduce this scenario.

Fixes: https://tracker.ceph.com/issues/51419

Signed-off-by: Yin Congmin congmin.yin@intel.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@CongMinYin
Copy link
Contributor Author

CongMinYin commented Jun 30, 2021

Hi @rzarzynski , in the process of fixing the RBD bug, I found there was a problem with the common buffer. I have written a test case to reproduce this problem. Can you have a look at it.

@CongMinYin
Copy link
Contributor Author

CC @idryomov , new issue about pwl/ssd, this to fix https://tracker.ceph.com/issues/51419.

@CongMinYin
Copy link
Contributor Author

CC @majianpeng

@tchaikov
Copy link
Contributor

tchaikov commented Jun 30, 2021

@CongMinYin could you summarize the change you made in the commit in the title. see https://github.com/ceph/ceph/blob/master/SubmittingPatches.rst#commit-title

"fix SIGABRT" is the effect of this change, but it does not describe the change.

@idryomov
Copy link
Contributor

Yes, please. This needs a detailed explanation of what is wrong with the bufferlist code.

Separately, I would appreciate if you describe the RBD bug that led you to discovering this.

src/test/bufferlist.cc Outdated Show resolved Hide resolved
@majianpeng
Copy link
Member

use our test case w/o this pr;

before splice blbuffer::list(len=20480,
        buffer::ptr(0~124 0x563a79612d00 in raw 0x563a79612d00 len 124 nref 1),
        buffer::ptr(0~3972 0x563a7966c010 in raw 0x563a7966c010 len 8096 nref 1),
        buffer::ptr(0~16384 0x563a79668000 in raw 0x563a79668000 len 16384 nref 2)
)
after splice blbuffer::list(len=12288,
        buffer::ptr(4096~12288 0x563a79669000 in raw 0x563a79668000 len 16384 nref 3),
        buffer::ptr(3972~0 0x563a7966cf94 in raw 0x563a7966c010 len 8096 nref 2)

in func rebuild_aligned_size_and_memory, it will corrupt stack. This because
a)bl has a ptr which length is zero.
b)the first ptr's size and memroy-address aligned w/ 4K.
So https://github.com/ceph/ceph/blob/master/src/common/buffer.cc#L1261 will corrupt stack and cause SIGABRT.

@@ -1645,7 +1645,6 @@ static ceph::spinlock debug_lock;
curbuf = _buffers.erase_after(curbuf_prev);
_carriage->set_offset(_carriage->offset() + _carriage->length());
_carriage->set_length(0);
_buffers.push_back(*_carriage);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if we really need to drop this line, the comment above should be reworked as well. Also, I'm worried the patch introduces a memory leak as erase_after() doesn't dispose the node.

In the meantime, I'm joining the folks asking for more descriptive commit message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if we really need to drop this line, the comment above should be reworked as well. Also, I'm worried the patch introduces a memory leak as erase_after() doesn't dispose the node.

In the meantime, I joining the folks asking for more descriptive commit message.

I'm not sure about the solution, but the problem does exist. see the issue describe by Jianpeng

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CongMinYin, @majianpeng: I'm not questioning that at all. :-) Just arguing about the way how we fix it.

a)bl has a ptr which length is zero.

This is still valid and can (rarely) happen because of how _carriage and the non-pointer append() variants of contiguous_appender interact:

    class contiguous_appender {
      // ...
      void append(const bufferlist& l) {
        if (deep) {
          // ...
        } else {
          flush_and_continue();
          bl.append(l);
          space = bl.obtain_contiguous_space(0);
          out_of_band_offset += l.length();
        }
      }

My impression is we need to take a look on rebuild_aligned_size_and_memory(). I can help with that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, take a look on the BufferListIterator,iterate_with_empties testcase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also look forward to discussing with you and seeking the most appropriate solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think the fix could look like this one:

diff --git a/src/common/buffer.cc b/src/common/buffer.cc
index 548abc458fd..9c76cde6eed 100644
--- a/src/common/buffer.cc
+++ b/src/common/buffer.cc
@@ -1258,8 +1258,12 @@ static ceph::spinlock debug_lock;
             buffer::create_aligned(unaligned._len, align_memory)));
         had_to_rebuild = true;
       }
-      _buffers.insert_after(p_prev, *ptr_node::create(unaligned._buffers.front()).release());
-      _num += 1;
+      if (unaligned.get_num_buffers()) {
+        _buffers.insert_after(p_prev, *ptr_node::create(unaligned._buffers.front()).release());
+        _num += 1;
+      } else {
+        // a bufferlist containing only 0-length bptrs is rebuild as empty
+      }
       ++p_prev;
     }
     return had_to_rebuild;

Copy link
Contributor Author

@CongMinYin CongMinYin Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rzarzynski Thank you for your solution. I have updated the code and commit information of PR. Please check.

src/test/bufferlist.cc Outdated Show resolved Hide resolved
@MahatiC
Copy link
Contributor

MahatiC commented Jul 1, 2021

Separately, I would appreciate if you describe the RBD bug that led you to discovering this.

@idryomov an example where this can be reproduced is when the below fio config is run

[global]

ioengine=rbd
clientname=admin
rw=randwrite
#bs=1m
bs=16k
time_based=1
runtime=20s
iodepth=16
pool=rbd
group_reporting

[volumes]
rbdname=fio-test

@CongMinYin CongMinYin force-pushed the fix-common-buffer-SIGABRT branch 2 times, most recently from 002bf0c to 203143f Compare July 1, 2021 03:46
src/test/bufferlist.cc Outdated Show resolved Hide resolved
@idryomov
Copy link
Contributor

idryomov commented Jul 2, 2021

jenkins test make check

@tchaikov
Copy link
Contributor

tchaikov commented Jul 2, 2021

@rzarzynski could you take another look?

src/test/bufferlist.cc Outdated Show resolved Hide resolved
src/test/bufferlist.cc Outdated Show resolved Hide resolved
@CongMinYin CongMinYin force-pushed the fix-common-buffer-SIGABRT branch 2 times, most recently from 238b825 to 48c3bd2 Compare July 6, 2021 02:55
@CongMinYin
Copy link
Contributor Author

@idryomov made changes as you suggested, please check.

Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just a couple of really minor nits. I assume you have verified that the new test case still leads to a crash if the patch isn't applied?

src/common/buffer.cc Outdated Show resolved Hide resolved
src/test/bufferlist.cc Outdated Show resolved Hide resolved
src/test/bufferlist.cc Outdated Show resolved Hide resolved
There is such a bl, which needs to satisfy two conditions:
1)all ptrs' length sum except last ptr is aligned with 4K;
2)the length of last ptr is 0.
This bl will cause stack corruption when calling
bufferlist::rebuild_aligned_size_and_memory().

Deal with this special scenario in rebuild_aligned_size_and_memory() to
solve the bug. And added a specialtest-case to reproduce this scenario.

Fixes: https://tracker.ceph.com/issues/51419

Signed-off-by: Yin Congmin <congmin.yin@intel.com>
@CongMinYin
Copy link
Contributor Author

Looks good to me, just a couple of really minor nits. I assume you have verified that the new test case still leads to a crash if the patch isn't applied?

Yes, I test the new test case. Without this change in rebuild_aligned_size_and_memory, new unit_test failed., after appling this change, new uint_test passed.

had modified comments, please check.

@idryomov idryomov changed the title common/buffer: fix SIGABRT in rebuild_aligned_size_and_memory. common/buffer: fix stack corruption in rebuild_aligned_size_and_memory() Jul 9, 2021
@idryomov
Copy link
Contributor

idryomov commented Jul 9, 2021

Merging based on Kefu's earlier runs since the fix itself hasn't changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants