Skip to content

Comments

[branch-1.2](scanner) Fix memory out of bound in scanner scheduler#25311

Merged
dataroaring merged 2 commits intoapache:branch-1.2-ltsfrom
xy720:fix-memory-out-of-bound-1.2
Oct 11, 2023
Merged

[branch-1.2](scanner) Fix memory out of bound in scanner scheduler#25311
dataroaring merged 2 commits intoapache:branch-1.2-ltsfrom
xy720:fix-memory-out-of-bound-1.2

Conversation

@xy720
Copy link
Member

@xy720 xy720 commented Oct 11, 2023

Proposed changes

ref #24840

In version 1.2.7, we found that the be core occurs about once a month. The core dump is:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fb96ac13d00 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x000055a571705bf6 in __gthread_mutex_lock (__mutex=0x725f74756f656dd1)
    at /var/local/ldb-toolchain/include/x86_64-linux-gnu/c++/11/bits/gthr-default.h:749
#2  std::mutex::lock (this=0x725f74756f656dd1) at /var/local/ldb-toolchain/include/c++/11/bits/std_mutex.h:100
#3  std::unique_lock<std::mutex>::lock (this=0x7fb87c7efc00) at /var/local/ldb-toolchain/include/c++/11/bits/unique_lock.h:139
#4  std::unique_lock<std::mutex>::unique_lock (__m=..., this=0x7fb87c7efc00)
    at /var/local/ldb-toolchain/include/c++/11/bits/unique_lock.h:69
#5  doris::BlockingQueue<doris::vectorized::ScannerContext*>::blocking_put (val=<synthetic pointer>: <optimized out>, 
    this=0x725f74756f656d69) at /data/doris-1.x/be/src/util/blocking_queue.hpp:103
#6  doris::vectorized::ScannerScheduler::submit (this=<optimized out>, ctx=0x7fb470295e00)
    at /data/doris-1.x/be/src/vec/exec/scan/scanner_scheduler.cpp:100
#7  0x000055a5716c9360 in doris::vectorized::VScanNode::_start_scanners (this=this@entry=0x7fb389e17000, scanners=...)
    at /var/local/ldb-toolchain/include/c++/11/bits/shared_ptr_base.h:1290
#8  0x000055a5716d9e53 in doris::vectorized::VScanNode::open (this=0x7fb389e17000, state=<optimized out>)
    at /data/doris-1.x/be/src/vec/exec/scan/vscan_node.cpp:102
#9  0x000055a56c777eef in doris::PlanFragmentExecutor::open_vectorized_internal (this=this@entry=0x7fb3e9c42680)
    at /var/local/ldb-toolchain/include/c++/11/bits/unique_ptr.h:421
#10 0x000055a56c7795ae in doris::PlanFragmentExecutor::open (this=this@entry=0x7fb3e9c42680)
    at /data/doris-1.x/be/src/runtime/plan_fragment_executor.cpp:261
#11 0x000055a56c751c79 in doris::FragmentExecState::execute (this=0x7fb3e9c42600)
    at /data/doris-1.x/be/src/runtime/fragment_mgr.cpp:260
#12 0x000055a56c7552ce in doris::FragmentMgr::_exec_actual(std::shared_ptr<doris::FragmentExecState>, std::function<void (doris::PlanFragmentExecutor*)>) (this=this@entry=0x7fb8d0e25000, exec_state=..., cb=...)
    at /var/local/ldb-toolchain/include/c++/11/bits/shared_ptr_base.h:1290
#13 0x000055a56c755802 in operator() (__closure=<optimized out>) at /var/local/ldb-toolchain/include/c++/11/ext/atomicity.h:109
#14 std::__invoke_impl<void, doris::FragmentMgr::exec_plan_fragment(const doris::TExecPlanFragmentParams&, doris::FragmentMgr::FinishCallback)::<lambda()>&> (__f=...) at /var/local/ldb-toolchain/include/c++/11/bits/invoke.h:61
#15 std::__invoke_r<void, doris::FragmentMgr::exec_plan_fragment(const doris::TExecPlanFragmentParams&, doris::FragmentMgr::FinishCallback)::<lambda()>&> (__fn=...) at /var/local/ldb-toolchain/include/c++/11/bits/invoke.h:111
#16 std::_Function_handler<void(), doris::FragmentMgr::exec_plan_fragment(const doris::TExecPlanFragmentParams&, doris::FragmentMgr::FinishCallback)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
    at /var/local/ldb-toolchain/include/c++/11/bits/std_function.h:291
#17 0x000055a56ca0d475 in std::function<void ()>::operator()() const (this=<optimized out>)
    at /var/local/ldb-toolchain/include/c++/11/bits/std_function.h:556
#18 doris::FunctionRunnable::run (this=<optimized out>) at /data/doris-1.x/be/src/util/threadpool.cpp:46
#19 doris::ThreadPool::dispatch_thread (this=0x7fb933ceea80) at /data/doris-1.x/be/src/util/threadpool.cpp:535
#20 0x000055a56ca028cf in std::function<void ()>::operator()() const (this=0x7faf3395f078)
    at /var/local/ldb-toolchain/include/c++/11/bits/std_function.h:556
#21 doris::Thread::supervise_thread (arg=0x7faf3395f060) at /data/doris-1.x/be/src/util/thread.cpp:454
#22 0x00007fb96ac11ea5 in start_thread () from /lib64/libpthread.so.0
--Type <RET> for more, q to quit, c to continue without paging--
#23 0x00007fb96af249fd in clone () from /lib64/libc.so.6

The program is segmentation fault at memory address 0x725f74756f656dd1.

(gdb) x 0x725f74756f656dd1
0x725f74756f656dd1:     Cannot access memory at address 0x725f74756f656dd1

Actually this address is not a mutex address, but a wrong address causes by memory out of bound.

Here is the prove:

(gdb) f 6                                                                
#6  doris::vectorized::ScannerScheduler::submit (this=<optimized out>, ctx=0x7fb470295e00)
    at /data/doris-1.x/be/src/vec/exec/scan/scanner_scheduler.cpp:100
100         if (!_pending_queues[ctx->queue_idx]->blocking_put(ctx)) {
(gdb) p ctx->queue_idx                                                   
$2 = -3
(gdb) f 7                                                                
#7  0x000055a5716c9360 in doris::vectorized::VScanNode::_start_scanners (this=this@entry=0x7fb389e17000, scanners=...)
    at /var/local/ldb-toolchain/include/c++/11/bits/shared_ptr_base.h:1290
1290    /var/local/ldb-toolchain/include/c++/11/bits/shared_ptr_base.h: No such file or directory.
(gdb) p _state->_exec_env->_scanner_scheduler->_pending_queues[-3]->_lock
Cannot access memory at address 0x725f74756f656dd1

The program access to address 0x725f74756f656dd1 because the queue_idx is set to -3.

And the reason is the atomic_int _queue_idx in ScannerScheduler is increment overflow to be negative.

So we should use atomic_uint to avoid overflow behavior.

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

Copy link
Contributor

@lide-reed lide-reed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Oct 11, 2023
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. area/vectorization reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants