Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible race condition between Batcher and BatchTranslator #41

Closed
jerinphilip opened this issue Feb 25, 2021 · 5 comments
Closed

Possible race condition between Batcher and BatchTranslator #41

jerinphilip opened this issue Feb 25, 2021 · 5 comments
Assignees
Labels
bug Something isn't working mod: marian Changes affecting marian-dev component

Comments

@jerinphilip
Copy link
Contributor

SIGSEGV: gdb backtrace

Thread 18 "marian-decoder-" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffeec26700 (LWP 56005)]
0x000055555566ac27 in std::__shared_ptr<marian::History, (__gnu_cxx::_Lock_policy)2>::operator= (this=0x8d2df0) at /usr/include/c++/8/bits/shared_ptr_base.h:1078
1078        class __shared_ptr
(gdb) backtrace
#0  0x000055555566ac27 in std::__shared_ptr<marian::History, (__gnu_cxx::_Lock_policy)2>::operator= (this=0x8d2df0) at /usr/include/c++/8/bits/shared_ptr_base.h:1078
#1  0x000055555566ac6f in std::shared_ptr<marian::History>::operator= (this=0x8d2df0) at /usr/include/c++/8/bits/shared_ptr.h:103
#2  0x0000555555669428 in marian::bergamot::Request::processHistory (this=0x55568e2b4840, index=578271, history=std::shared_ptr<marian::History> (use count 3, weak count 0) = {...})
    at /home/jphilip/code/bergamot-translator@integration/src/translator/request.cpp:39
#3  0x0000555555669678 in marian::bergamot::RequestSentence::completeSentence (this=0x55572e949200, history=std::shared_ptr<marian::History> (use count 3, weak count 0) = {...})
    at /home/jphilip/code/bergamot-translator@integration/src/translator/request.cpp:77
#4  0x00005555556395ef in marian::bergamot::Batch::completeBatch (this=0x7fffeec248d0, histories=std::vector of length 256, capacity 256 = {...})
    at /home/jphilip/code/bergamot-translator@integration/src/translator/batch.cpp:24
#5  0x0000555555646f27 in marian::bergamot::BatchTranslator::translate (this=0x555558bdd600, batch=...)
    at /home/jphilip/code/bergamot-translator@integration/src/translator/batch_translator.cpp:96
#6  0x00005555556471b2 in marian::bergamot::BatchTranslator::consumeFrom (this=0x555558bdd600, pcqueue=...)
    at /home/jphilip/code/bergamot-translator@integration/src/translator/batch_translator.cpp:109
#7  0x0000555555625ac6 in marian::bergamot::Service::<lambda()>::operator()(void) const (__closure=0x55555a053fe8)
    at /home/jphilip/code/bergamot-translator@integration/src/translator/service.cpp:41
#8  0x0000555555627436 in std::__invoke_impl<void, marian::bergamot::Service::Service(marian::Ptr<marian::Options>)::<lambda()> >(std::__invoke_other, marian::bergamot::Service::<lambda()> &&)
    (__f=...) at /usr/include/c++/8/bits/invoke.h:60
#9  0x000055555562727f in std::__invoke<marian::bergamot::Service::Service(marian::Ptr<marian::Options>)::<lambda()> >(marian::bergamot::Service::<lambda()> &&) (__fn=...)
    at /usr/include/c++/8/bits/invoke.h:95
#10 0x00005555556287f8 in std::thread::_Invoker<std::tuple<marian::bergamot::Service::Service(marian::Ptr<marian::Options>)::<lambda()> > >::_M_invoke<0>(std::_Index_tuple<0>) (
    this=0x55555a053fe8) at /usr/include/c++/8/thread:244
#11 0x00005555556287ce in std::thread::_Invoker<std::tuple<marian::bergamot::Service::Service(marian::Ptr<marian::Options>)::<lambda()> > >::operator()(void) (this=0x55555a053fe8)
    at /usr/include/c++/8/thread:253
#12 0x00005555556287b2 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::bergamot::Service::Service(marian::Ptr<marian::Options>)::<lambda()> > > >::_M_run(void) (
    this=0x55555a053fe0) at /usr/include/c++/8/thread:196
#13 0x00007ffff78cdd80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#14 0x00007ffff702f6db in start_thread (arg=0x7fffeec26700) at pthread_create.c:463
#15 0x00007ffff6d58a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

histories_[index] = history;

request_->processHistory(index_, history);

sentences_[i].completeSentence(histories[i]);

batch.completeBatch(histories);

Something amiss with a shared_ptr and copy/move. Unsure what is happening, opening issue to keep track.

@jerinphilip jerinphilip added the bug Something isn't working label Feb 25, 2021
@jerinphilip jerinphilip self-assigned this Feb 25, 2021
@jerinphilip
Copy link
Contributor Author

Fresh SIGSEGV

Thread 2 "marian-decoder-" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff6c36700 (LWP 79086)]
0x0000555555659d70 in std::__shared_ptr<marian::data::SubBatch, (__gnu_cxx::_Lock_policy)2>::get (this=0x0)
    at /usr/include/c++/8/bits/shared_ptr_base.h:1308
1308          { return _M_ptr; }
(gdb) backtrace
#0  0x0000555555659d70 in std::__shared_ptr<marian::data::SubBatch, (__gnu_cxx::_Lock_policy)2>::get (this=0x0)
    at /usr/include/c++/8/bits/shared_ptr_base.h:1308
#1  0x000055555565508e in std::__shared_ptr_access<marian::data::SubBatch, (__gnu_cxx::_Lock_policy)2, false, false>::_M_get (this=0x0)
    at /usr/include/c++/8/bits/shared_ptr_base.h:1019
#2  0x00005555556502fa in std::__shared_ptr_access<marian::data::SubBatch, (__gnu_cxx::_Lock_policy)2, false, false>::operator-> (this=0x0)
    at /usr/include/c++/8/bits/shared_ptr_base.h:1013
#3  0x0000555555649de9 in marian::data::CorpusBatch::size (this=0x555565a6d100)
    at /home/jphilip/code/bergamot-translator@integration/3rd_party/marian-dev/src/data/corpus_base.h:271
#4  0x0000555555824166 in marian::BeamSearch::search (this=0x55555bf734a0,
    graph=std::shared_ptr<marian::ExpressionGraph> (use count 11, weak count 378) = {...},
    batch=std::shared_ptr<marian::data::CorpusBatch> (use count 2, weak count 0) = {...})
    at /home/jphilip/code/bergamot-translator@integration/3rd_party/marian-dev/src/translator/beam_search.cpp:259
#5  0x0000555555646ec3 in marian::bergamot::BatchTranslator::translate (this=0x55555a3ecf60, batch=...)
    at /home/jphilip/code/bergamot-translator@integration/src/translator/batch_translator.cpp:95
#6  0x000055555564717c in marian::bergamot::BatchTranslator::consumeFrom (this=0x55555a3ecf60, pcqueue=...)
    at /home/jphilip/code/bergamot-translator@integration/src/translator/batch_translator.cpp:109
#7  0x0000555555625ac6 in marian::bergamot::Service::<lambda()>::operator()(void) const (__closure=0x55555a053de8)
    at /home/jphilip/code/bergamot-translator@integration/src/translator/service.cpp:41
#8  0x0000555555627436 in std::__invoke_impl<void, marian::bergamot::Service::Service(marian::Ptr<marian::Options>)::<lambda()> >(std::__invoke_other,
marian::bergamot::Service::<lambda()> &&) (__f=...) at /usr/include/c++/8/bits/invoke.h:60
#9  0x000055555562727f in std::__invoke<marian::bergamot::Service::Service(marian::Ptr<marian::Options>)::<lambda()> >(marian::bergamot::Service::<lambda()> &&) (__fn=...) at /usr/include/c++/8/bits/invoke.h:95
#10 0x00005555556287f8 in std::thread::_Invoker<std::tuple<marian::bergamot::Service::Service(marian::Ptr<marian::Options>)::<lambda()> > >::_M_invoke<0>(std::_Index_tuple<0>) (this=0x55555a053de8) at /usr/include/c++/8/thread:244
#11 0x00005555556287ce in std::thread::_Invoker<std::tuple<marian::bergamot::Service::Service(marian::Ptr<marian::Options>)::<lambda()> > >::operator()(void) (this=0x55555a053de8) at /usr/include/c++/8/thread:253
#12 0x00005555556287b2 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::bergamot::Service::Service(marian::Ptr<marian::Options>)::<lambda()> > > >::_M_run(void) (this=0x55555a053de0) at /usr/include/c++/8/thread:196
#13 0x00007ffff78cdd80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#14 0x00007ffff702f6db in start_thread (arg=0x7ffff6c36700) at pthread_create.c:463
#15 0x00007ffff6d58a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

jerinphilip pushed a commit that referenced this issue Feb 25, 2021
Through inheritance, a non-threaded and multithreaded Service are
created, both derived of the same ServiceBase class which holds the
common elements.

In preparation to solve SIGSEGV in #41. First inspections gave aborts in
thread part, and repeated SIGSEGV's in lock-policy's of shared_pointers
even in non-threaded paths.

Solving this first, to avoid ifdef or tricky paths. The non-threaded
implementation is not included in WASM builds at all, by separating out
the single-threaded logic. DRY is achieved through inheritance and
operator overloading.
@jerinphilip
Copy link
Contributor Author

Unable to reproduce since thread adjustments, marking as closed.

@jerinphilip jerinphilip mentioned this issue Mar 23, 2021
5 tasks
@jerinphilip
Copy link
Contributor Author

jerinphilip commented Mar 26, 2021

[2021-03-26 21:05:24] Received (3, 44) = 0.053427

Thread 48 "marian-decoder-" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdf77e700 (LWP 80447)]
0x00005555556700fc in marian::bergamot::Request::processHistory(unsigned long, std::shared_ptr<marian::History>) ()
(gdb)
(gdb)
(gdb) backtrace
#0  0x00005555556700fc in marian::bergamot::Request::processHistory(unsigned long, std::shared_ptr<marian::History>) ()
#1  0x00005555556701aa in marian::bergamot::RequestSentence::completeSentence(std::shared_ptr<marian::History>) ()
#2  0x0000555555679974 in marian::bergamot::Batch::completeBatch(std::vector<std::shared_ptr<marian::History>, std::allocator<std::shared_ptr<marian::History> > > const&) ()
#3  0x0000555555660451 in marian::bergamot::BatchTranslator::translate(marian::bergamot::Batch&) ()
#4  0x0000555555653a71 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::bergamot::Service::initialize_async_translators()::{lambda()#1}> > >::_M_run() ()
#5  0x00007ffff7224d80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff7bbd6db in start_thread (arg=0x7fffdf77e700) at pthread_create.c:463
#7  0x00007ffff68cea3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)

Sus on pcqueue and swap routine and Batch&.

@jerinphilip jerinphilip reopened this Mar 26, 2021
@jerinphilip jerinphilip added the mod: marian Changes affecting marian-dev component label Mar 27, 2021
@jerinphilip
Copy link
Contributor Author

The presently existing cleaveBatch is as follows:

bool Batcher::cleaveBatch(Batch &batch) {
// For now simply iterates on buckets and converts batches greedily. This
// has to be enhanced with optimizing over priority. The baseline
// implementation should at least be as fast as marian's maxi-batch with full
// corpus size as maxi-batch size.
batch.clear();
size_t paddedBatchSize = 0;
for (size_t length = 0; length < bucket_.size(); length++) {
auto p = bucket_[length].begin();
while (p != bucket_[length].end()) {
paddedBatchSize = (batch.size() + 1) * length;
if (paddedBatchSize <= miniBatchWords) {
auto q = p++;
batch.add(*q);
bucket_[length].erase(q);
} else {
// Check if elements exist
assert(batch.size() > 0);
return true;
}
}
}
bool isValidBatch = batch.size() > 0;
return isValidBatch;
}

The following code in jp/translate-perftest has been running for about 3 hours now without segfault.

It takes batch address from outside and tries to clear(), then populate the batch variable with the current batch. Absent batchExt and a local batch, during the run, there ends up cases where there are more than the intended elements (ABORTs fail). This was discovered when attempting to generate random batches at an even higher volume than WNGT20.

bool Batcher::nextRandomBatch(Batch &batchExt) {
std::pair<size_t, size_t> batchInfo;
while (exhaustConfig_.next(batchInfo)) {
// Generate batch
size_t B = batchInfo.first;
size_t T = batchInfo.second;
size_t available = bucket_[T].size();
if (available > 0) {
Batch batch;
batch.clear();
ABORT_IF(batch.size() != 0, "Batch size doesn't seem to be 0");
std::uniform_int_distribution<size_t> dist(0, available - 1);
static std::random_device rd;
static std::mt19937 gen(rd());
std::vector<size_t> idxs;
for (size_t b = 0; b < B; b++) {
idxs.push_back(dist(gen));
}
// ABORT_IF(idxs.size() * T >= miniBatchWords,
// "Too large a batch detected.");
std::sort(idxs.begin(), idxs.end());
size_t sentenceIdx = 0;
auto p = bucket_[T].begin();
for (auto &idx : idxs) {
while (sentenceIdx < idx) {
++sentenceIdx;
++p;
ABORT_IF(
p == bucket_[T].end(),
"Somehow we have reached the end of container, this is illegal.");
}
ABORT_IF(idx >= bucket_[T].size(),
"idx out of bounds. Something's wrong!");
ABORT_IF(idx != sentenceIdx, "idx != sentenceIdx. Something's wrong!");
ABORT_IF(p->numTokens() > T, "p->numTokens() > T. Something's wrong!");
batch.add(*p);
}
// LOG(info, "(idxs, T) = ({}, {}), (eB, eT) = ({}, {})", idxs.size(), T,
// batch.size(), batch.maxLength());
ABORT_IF(batch.size() != idxs.size(),
"Something's off, check above block!");
ABORT_IF(batch.maxLength() != T, "Something's off, check above block!");
batch.log();
batchExt = batch;
return true;
}
}
return false;
}

If such unexpected batches are created, this is what is possibly leading segfaults in certain WNGT20 runs as well. batch& is also connected to pcqueue with an std::swap through ProduceSwap.

@jerinphilip jerinphilip changed the title SIGSEGV: bergamot-translator Possible Race Condition between Batcher and BatchTranslator Mar 30, 2021
@jerinphilip jerinphilip changed the title Possible Race Condition between Batcher and BatchTranslator Possible race condition between Batcher and BatchTranslator Mar 30, 2021
@jerinphilip
Copy link
Contributor Author

We no longer have pcqueue, so this shouldn't be happening again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working mod: marian Changes affecting marian-dev component
Projects
None yet
Development

No branches or pull requests

1 participant