Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix corruption with intra-L0 on ingested files #5958

Closed
wants to merge 19 commits into from

Conversation

Little-Wallace
Copy link
Contributor

@Little-Wallace Little-Wallace commented Oct 23, 2019

Problem Description

Our process was abort when it call CheckConsistency. And the information in stderr show that "L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421 ". Here are the causes of the accident I investigated.

  • RocksDB will call CheckConsistency whenever MANIFEST file is update. It will check sequence number interval of every file, except files which were ingested.
  • When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number.
  • CheckConsistency determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal.
  • If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst, the smallest_seqno of this new file will be smaller than his largest_seqno.
    • If more than one ingested file was ingested before memtable schedule flush, and they all compact into one new sstable file by IntraL0Compaction. The sequence interval of this new file will be included in the interval of the memtable. So CheckConsistency will return a Corruption.
    • If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start, but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2). But there was still some data in s1 written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1. Of course, it would also make a Corruption because of overlap of seqno. There is the relationship of the files:

    s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno < s1.largest_seqno

So I skip pick sst file which was ingested in function FindIntraL0Compaction

Reason

Here is my bug report: #5913

There are two situations that can cause the check to fail.

First situation:

  • First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest.

  • If there had been one compaction job which compacted sst from L0 to L1, LevelCompactionPicker would trigger a IntraL0Compaction which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst.

  • Then some data was put into memtable, and memtable was flushed to L0. We called this sst B.

  • RocksDB check consistency , and find the smallest_seqno of B is less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested.

Secondary situaion

  • First we have flushed many sst in L0, we call them [s1, s2, s3].

  • There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1. And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1.

  • m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5.

  • [s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction. We call it s6.

    • compacted 4@0 files to L0
  • When s6 is added into manifest, the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5. But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5.

    • s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
…sted

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this does prevent L0 files from being created that would fail the L0 consistency check, it is still possible that an existing DB already contains files that fail the check. Is that a problem?

Also, we benefit quite a lot from ingested files going through intra-L0 compaction. Ingestion is really the only way our L0 file count grows quickly so it's helpful to have intra-L0 to alleviate the pressure.

The way I solved this in Cockroach is simply turn off that part of the assertion: cockroachdb#67. I don't think it's a particularly great approach as maybe that assertion is useful for catching other bugs, but it's one to consider.

edit: I did not fully understand the secondary situation when writing this reply. The approach I mentioned about turning off assertion is clearly insufficient for the corruption problem.

@siying
Copy link
Contributor

siying commented Oct 23, 2019

Assume we have L0 files
[3 8] and [5 5]
Is it possible that every time we pick [3 8], [5 5] is always picked up and vice verse? Is it going to make IntraL0 compaction code too complicated?

@siying
Copy link
Contributor

siying commented Oct 23, 2019

Or say, expand Intra-L0 compaction in both directions until a clean cut.

@ajkr
Copy link
Contributor

ajkr commented Oct 23, 2019

Hm, what if an intra-L0 is picked including [5 5] (and maybe another file like [4 4]) before [3 8] is flushed? I think then when [3 8] is flushed it'll trigger the same consistency failure.

@siying
Copy link
Contributor

siying commented Oct 23, 2019

@ajkr that is a very good point...

@siying
Copy link
Contributor

siying commented Oct 23, 2019

Then I think remove the assertion feels like a logical solution. One thing worries me is that, stress test seems not in a circumstance where Intra-L0 compaction will happen, so we might not be able to uncover bugs like this in the future.

@ajkr
Copy link
Contributor

ajkr commented Oct 23, 2019

I think after we have @sumeerbhola's fix for #5848 (if it isn't already fixed, I haven't checked recently), then we can set -ingest_external_file_one_in to an aggressively low number and it should trigger frequent intra-L0s.

@ajkr
Copy link
Contributor

ajkr commented Oct 23, 2019

High file ingestion rate could be randomly turned on in db_crashtest.py so only once in a while there'd be high ingestion rate. Then we wouldn't need to add new test setups, and it wouldn't take much coverage away from other features.

@Little-Wallace
Copy link
Contributor Author

Hm, what if an intra-L0 is picked including [5 5] (and maybe another file like [4 4]) before [3 8] is flushed? I think then when [3 8] is flushed it'll trigger the same consistency failure.

Yes, this is what I have described in First situatiom

@siying I think intra-L0 should skip files which largest seqno smaller than smallest seqno in imm or memtable. That is

  • max(largest_seqno(picked file)) < smallest_seqno(imm, mem)

This solution could solve the problem and do not disable intral-l0 compaction for ingested file.

  • If the largest seqno of the picked file is larger than smallest_seqno of imm, it only needs to wait for the next time to be picked, after the imm table has been flushed into L0.

@riversand963
Copy link
Contributor

riversand963 commented Oct 24, 2019

Hm, what if an intra-L0 is picked including [5 5] (and maybe another file like [4 4]) before [3 8] is flushed? I think then when [3 8] is flushed it'll trigger the same consistency failure.

Yes, this is what I have described in First situatiom

@siying I think intra-L0 should skip files which largest seqno smaller than smallest seqno in imm or memtable. That is

  • max(largest_seqno(picked file)) < smallest_seqno(imm, mem)

This solution could solve the problem and do not disable intral-l0 compaction for ingested file.

  • If the largest seqno of the picked file is larger than smallest_seqno of imm, it only needs to wait for the next time to be picked, after the imm table has been flushed into L0.

Thanks @Little-Wallace for the proposal. I chatted with @siying and I have a question. Let me use an example.

Mem                 [4, 5]
L0         [1, 3]                 [6, 6]    [7, 7]
L1

Originally we have L0 [1, 3] and memtable [4, 5] in which 1, 3, 4, 5 are sequence numbers. Then we ingest [6, 6] and [7, 7]. The external SSTs do not have overlapping keys with [4, 5], but overlaps with [1, 3]. Suppose no further flush is triggered. Is it possible for [6, 6] and [7, 7] be compacted using intra-L0 compaction and then moved to L1? Will this be a problem if we search for a key which exists in both [1, 3] and [6, 6]? Just asking because currently I am not fully familiar with all the compaction boundary picking logic.

@Little-Wallace
Copy link
Contributor Author

@riversand963 IntralL0Compaction only move files into L0 , rather than L1. And key in one sst could only show as its latest version.

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Copy link
Contributor

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks sane to me. My only major comment is about adjustments to the logic in FindIntraL0Compaction. Glad to see the added test, though I didn't get a chance to look at that in detail.

@@ -391,6 +391,11 @@ void SuperVersion::Init(MemTable* new_mem, MemTableListVersion* new_imm,
refs.store(1, std::memory_order_relaxed);
}

SequenceNumber SuperVersion::GetSmallestMemSeqno() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think this method should be named GetEarliestMemTableSequenceNumber for consistency with the naming of similar methods elsewhere. Also, this looks very similar to DBImpl::GetEarliestMemTableSequenceNumber.

PS I wouldn't mind if all of the uses of earliest where switched to smallest for consistency with FileMetadata::{smallest,largest}_seqno, but I'll leave it to the FB RocksDB folks to decide if they want to do that.

@@ -929,8 +934,9 @@ bool ColumnFamilyData::NeedsCompaction() const {

Compaction* ColumnFamilyData::PickCompaction(
const MutableCFOptions& mutable_options, LogBuffer* log_buffer) {
SequenceNumber oldest_mem_seqno = super_version_->GetSmallestMemSeqno();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: now we have 3 terms in play: earliest, smallest, and oldest. For consistency, I think this should be earliest_mem_seqno.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll fix it.

@@ -52,7 +59,9 @@ bool FindIntraL0Compaction(const std::vector<FileMetaData*>& level_files,
compact_bytes += static_cast<size_t>(level_files[span_len]->fd.file_size);
compensated_compact_bytes += level_files[span_len]->compensated_file_size;
new_compact_bytes_per_del_file = compact_bytes / span_len;
if (level_files[span_len]->being_compacted ||
if ((level_files[0]->fd.smallest_seqno == level_files[0]->fd.largest_seqno &&
level_files[span_len]->fd.largest_seqno >= oldest_mem_seqno) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check necessary? My understanding is that level_files is sorted in descending largest_seqno order. So if the check at the start of this method succeeded, the check here must also be true. Am I missing something?

I wasn't aware that intra-L0 compactions always started with the newest sstable in L0. Rather than the approach take here, could we instead skip the prefix of level_files that contains sstables with largest_seqno >= oldest_mem_seqno and then build a span from there? Cc @ajkr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I make a mistake. My intention is to check every file, rather than level_files[0]

SequenceNumber oldest_mem_seqno) {

// Do not pick ingested file when there is at least one memtable not flushed which of seqno is overlap with the sst.
if (level_files[0]->fd.smallest_seqno == level_files[0]->fd.largest_seqno &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to remove this conditional? Is there ever a case where we want to pick an sstable for intra-L0 compaction that has a largest seqno that is larger than the smallest unflushed seqno?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I have explained it in my description. If the sstable was produced by ingested, and its key was not overlap with memtable, it would use the global seqno, which is just in [smallest seqno, largest seqno] of memtable.

@@ -39,7 +39,14 @@ bool FindIntraL0Compaction(const std::vector<FileMetaData*>& level_files,
size_t min_files_to_compact,
uint64_t max_compact_bytes_per_del_file,
uint64_t max_compaction_bytes,
CompactionInputFiles* comp_inputs) {
CompactionInputFiles* comp_inputs,
SequenceNumber oldest_mem_seqno) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another naming nit: smallest_unflushed_seqno would make it clearer that it is the fact that this seqno is unflushed which is important.

CompactionInputFiles* comp_inputs,
SequenceNumber oldest_mem_seqno) {

// Do not pick ingested file when there is at least one memtable not flushed which of seqno is overlap with the sst.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would work to pick the span of files for intra-L0 starting at the smallest index i for which level_files[i]->fd.largest_seqno < oldest_mem_seqno?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a nicer phrasing of a suggestion I had in a comment. I don't see anything wrong with this right now, but I'd like to think harder about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I loaded the page before you reviewed and didn't notice the repetition. I haven't thought of any problems with it. I also plan to experiment with relaxing the restriction more generally on intra-L0 picking from the newest file as it might be related to our imports getting stuck.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach looks good to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrap the line to follow 80-char rule. If possible, please try to run "make format".

Copy link
Contributor

@yiwu-arbug yiwu-arbug Oct 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajkr It seems to me it could still order the output to be newer than L0_(i-1) if L0_i is an ingested file and L0_i->largest_seqno < oldest_mem_seqno but L0_i->largest_seqno > L0_(i-1)->largest_seqno.

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in ec3e3c3.

Little-Wallace added a commit to Little-Wallace/rocksdb that referenced this pull request Nov 20, 2019
…5958)

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
@siying
Copy link
Contributor

siying commented Nov 20, 2019

btw I see this failure once:

Note: Google Test filter = DBCompactionTestWithParam/DBCompactionTestWithParam.FlushAfterL0IntraCompactionCheckConsistencyFail/0
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBCompactionTestWithParam/DBCompactionTestWithParam
[ RUN      ] DBCompactionTestWithParam/DBCompactionTestWithParam.FlushAfterL0IntraCompactionCheckConsistencyFail/0
db/db_compaction_test.cc:4921: Failure
Expected equality of these values:
  i + 1
    Which is: 10
  NumTableFilesAtLevel(0)
    Which is: 5
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
  what():  db/db_compaction_test.cc:4921: Failure
Expected equality of these values:
  i + 1
    Which is: 10
  NumTableFilesAtLevel(0)
    Which is: 5

It doesn't reproduce though, but I wonder whether you have any thoughts on what could be the reason.

@siying
Copy link
Contributor

siying commented Nov 20, 2019

And another run is hanging:

(gdb) thread apply all bt

Thread 7 (Thread 0x7fca00ff9700 (LWP 15225)):
#0  0x00007fca155e7f85 in futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fca00ff8a30, expected=0,
    futex_word=0x5580d3ee3f98) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  __pthread_cond_wait_common (abstime=0x7fca00ff8a30, mutex=0x5580d3ee3f28, cond=0x5580d3ee3f70) at pthread_cond_wait.c:539
#2  __pthread_cond_timedwait (cond=cond@entry=0x5580d3ee3f70, mutex=0x5580d3ee3f28, abstime=abstime@entry=0x7fca00ff8a30)
    at pthread_cond_wait.c:667
#3  0x00005580d2d191e9 in rocksdb::port::CondVar::TimedWait (this=this@entry=0x5580d3ee3f70, abs_time_us=<optimized out>)
    at port/port_posix.cc:120
#4  0x00005580d2ce5154 in rocksdb::InstrumentedCondVar::TimedWaitInternal (this=0x5580d3ee3f70, abs_time_us=<optimized out>)
    at monitoring/instrumented_mutex.cc:66
#5  0x00005580d2ce522d in rocksdb::InstrumentedCondVar::TimedWait (this=this@entry=0x5580d3ee3f70,
    abs_time_us=abs_time_us@entry=1574277484269966) at monitoring/instrumented_mutex.cc:55
#6  0x00005580d2b6d97b in rocksdb::RepeatableThread::wait (delay=<optimized out>, this=0x5580d3ee3ed0) at ./util/repeatable_thread.h:90
#7  rocksdb::RepeatableThread::thread (this=0x5580d3ee3ed0) at ./util/repeatable_thread.h:126
#8  rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}::operator()() const (__closure=<optimized out>)
    at ./util/repeatable_thread.h:38
#9  std::__invoke_impl<void, rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}>(std::__invoke_other, rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}&&) (__f=...) at /usr/include/c++/9/bits/invoke.h:60
#10 std::__invoke<rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}>(std::__invoke_result&&, (rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}&&)...) (__fn=...) at /usr/include/c++/9/bits/invoke.h:95
#11 std::thread::_Invoker<std::tuple<rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=<optimized out>) at /usr/include/c++/9/thread:244
#12 std::thread::_Invoker<std::tuple<rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}> >::operator()() (
    this=<optimized out>) at /usr/include/c++/9/thread:251
#13 std::thread::_State_impl<std::thread::_Invoker<std::tuple<rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}> > >::_M_run() (this=<optimized out>) at /usr/include/c++/9/thread:195
#14 0x00007fca14c9e870 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#15 0x00007fca155e16db in start_thread (arg=0x7fca00ff9700) at pthread_create.c:463
#16 0x00007fca1434888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (Thread 0x7fc9f67fc700 (LWP 15224)):
#0  0x00007fca155e7f85 in futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fc9f67fba30, expected=0,
    futex_word=0x5580d3ee1238) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  __pthread_cond_wait_common (abstime=0x7fc9f67fba30, mutex=0x5580d3ee11c8, cond=0x5580d3ee1210) at pthread_cond_wait.c:539
#2  __pthread_cond_timedwait (cond=cond@entry=0x5580d3ee1210, mutex=0x5580d3ee11c8, abstime=abstime@entry=0x7fc9f67fba30)
    at pthread_cond_wait.c:667
#3  0x00005580d2d191e9 in rocksdb::port::CondVar::TimedWait (this=this@entry=0x5580d3ee1210, abs_time_us=<optimized out>)
    at port/port_posix.cc:120
#4  0x00005580d2ce5154 in rocksdb::InstrumentedCondVar::TimedWaitInternal (this=0x5580d3ee1210, abs_time_us=<optimized out>)
    at monitoring/instrumented_mutex.cc:66
#5  0x00005580d2ce522d in rocksdb::InstrumentedCondVar::TimedWait (this=this@entry=0x5580d3ee1210,
    abs_time_us=abs_time_us@entry=1574277484296472) at monitoring/instrumented_mutex.cc:55
#6  0x00005580d2b6d97b in rocksdb::RepeatableThread::wait (delay=<optimized out>, this=0x5580d3ee1170) at ./util/repeatable_thread.h:90
#7  rocksdb::RepeatableThread::thread (this=0x5580d3ee1170) at ./util/repeatable_thread.h:126
#8  rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}::operator()() const (__closure=<optimized out>)
    at ./util/repeatable_thread.h:38
#9  std::__invoke_impl<void, rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}>(std::__invoke_other, rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}&&) (__f=...) at /usr/include/c++/9/bits/invoke.h:60
#10 std::__invoke<rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}>(std::__invoke_result&&, (rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}&&)...) (__fn=...) at /usr/include/c++/9/bits/invoke.h:95
#11 std::thread::_Invoker<std::tuple<rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=<optimized out>) at /usr/include/c++/9/thread:244
#12 std::thread::_Invoker<std::tuple<rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}> >::operator()() (
    this=<optimized out>) at /usr/include/c++/9/thread:251
#13 std::thread::_State_impl<std::thread::_Invoker<std::tuple<rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}> > >::_M_run() (this=<optimized out>) at /usr/include/c++/9/thread:195
#14 0x00007fca14c9e870 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#15 0x00007fca155e16db in start_thread (arg=0x7fc9f67fc700) at pthread_create.c:463
#16 0x00007fca1434888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7fc9f77fe700 (LWP 15142)):
#0  0x00007fca155e79f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5580d3ecd7c8)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x5580d3ecd6c0, cond=0x5580d3ecd7a0) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=cond@entry=0x5580d3ecd7a0, mutex=0x5580d3ecd6c0) at pthread_cond_wait.c:655
#3  0x00005580d2d19169 in rocksdb::port::CondVar::Wait (this=this@entry=0x5580d3ecd7a0) at port/port_posix.cc:106
#4  0x00005580d2ce4f4f in rocksdb::InstrumentedCondVar::WaitInternal (this=0x5580d3ecd7a0) at monitoring/instrumented_mutex.cc:48
#5  rocksdb::InstrumentedCondVar::Wait (this=this@entry=0x5580d3ecd7a0) at monitoring/instrumented_mutex.cc:41
#6  0x00005580d2cc3f40 in rocksdb::DeleteScheduler::BackgroundEmptyTrash (this=0x5580d3ecd6a8) at file/delete_scheduler.cc:194
#7  0x00007fca14c9e870 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007fca155e16db in start_thread (arg=0x7fc9f77fe700) at pthread_create.c:463
#9  0x00007fca1434888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7fca13224700 (LWP 15141)):
#0  0x00007fca155e79f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5580d3ec9930)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x5580d3ec98e0, cond=0x5580d3ec9908) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x5580d3ec9908, mutex=0x5580d3ec98e0) at pthread_cond_wait.c:655
#3  0x00007fca14c98f0c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00005580d2dccf25 in rocksdb::ThreadPoolImpl::Impl::BGThread (this=0x5580d3ec9870, thread_id=1) at util/threadpool_imp.cc:196
#5  0x00005580d2dcd25e in rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper (arg=0x5580d3ee0a80) at util/threadpool_imp.cc:306
#6  0x00007fca14c9e870 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007fca155e16db in start_thread (arg=0x7fca13224700) at pthread_create.c:463
#8  0x00007fca1434888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7fca13a25700 (LWP 15078)):
#0  0x00007fca155e79f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fff3cefd388)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7fff3cefd330, cond=0x7fff3cefd360) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=cond@entry=0x7fff3cefd360, mutex=0x7fff3cefd330) at pthread_cond_wait.c:655
#3  0x00005580d2d19169 in rocksdb::port::CondVar::Wait (this=0x7fff3cefd360) at port/port_posix.cc:106
#4  0x00005580d2a97b18 in rocksdb::test::SleepingBackgroundTask::DoSleep (this=0x7fff3cefd330) at ./test_util/testutil.h:411
#5  rocksdb::test::SleepingBackgroundTask::DoSleepTask (arg=0x7fff3cefd330) at ./test_util/testutil.h:446
#6  0x00005580d2dcd0cb in std::function<void ()>::operator()() const (this=0x7fca13a24b00) at /usr/include/c++/9/bits/std_function.h:683
#7  rocksdb::ThreadPoolImpl::Impl::BGThread (this=0x5580d3ec9b60, thread_id=0) at util/threadpool_imp.cc:265
#8  0x00005580d2dcd25e in rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper (arg=0x5580d3ecc200) at util/threadpool_imp.cc:306
#9  0x00007fca14c9e870 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00007fca155e16db in start_thread (arg=0x7fca13a25700) at pthread_create.c:463
#11 0x00007fca1434888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7fca14226700 (LWP 15077)):
#0  0x00007fca155e79f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5580d3ec9930)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x5580d3ec98e0, cond=0x5580d3ec9908) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x5580d3ec9908, mutex=0x5580d3ec98e0) at pthread_cond_wait.c:655
#3  0x00007fca14c98f0c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00005580d2dccf25 in rocksdb::ThreadPoolImpl::Impl::BGThread (this=0x5580d3ec9870, thread_id=0) at util/threadpool_imp.cc:196
#5  0x00005580d2dcd25e in rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper (arg=0x5580d3ecc060) at util/threadpool_imp.cc:306
#6  0x00007fca14c9e870 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007fca155e16db in start_thread (arg=0x7fca14226700) at pthread_create.c:463
#8  0x00007fca1434888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7fca15c12a80 (LWP 15073)):
#0  0x00007fca155e7449 in futex_wait (private=<optimized out>, expected=12, futex_word=0x7fff3cefd384)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:61
#1  futex_wait_simple (private=<optimized out>, expected=12, futex_word=0x7fff3cefd384) at ../sysdeps/nptl/futex-internal.h:135
#2  __pthread_cond_destroy (cond=0x7fff3cefd360) at pthread_cond_destroy.c:54
#3  0x00005580d2d19129 in rocksdb::port::CondVar::~CondVar (this=<optimized out>, __in_chrg=<optimized out>) at port/port_posix.cc:100
#4  0x00005580d29b7c14 in rocksdb::test::SleepingBackgroundTask::~SleepingBackgroundTask (this=0x7fff3cefd330, __in_chrg=<optimized out>)
    at ./test_util/testutil.h:394
#5  rocksdb::DBCompactionTestWithParam_IntraL0CompactionAfterFlushCheckConsistencyFail_Test::TestBody (this=<optimized out>)
    at db/db_compaction_test.cc:4981
#6  0x00005580d2f01730 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (
    location=0x5580d2f7f4d5 "the test body", method=<optimized out>, object=0x5580d3ec8c60)
    at third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3880
#7  testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=object@entry=0x5580d3ec8c60, method=<optimized out>,
    location=location@entry=0x5580d2f7f4d5 "the test body") at third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3935
#8  0x00005580d2ef5b36 in testing::Test::Run (this=this@entry=0x5580d3ec8c60) at third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3973
#9  0x00005580d2ef5de5 in testing::Test::Run (this=0x5580d3ec8c60) at third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4149
#10 testing::TestInfo::Run (this=0x5580d3ec67b0) at third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4149
#11 0x00005580d2ef5f9c in testing::TestInfo::Run (this=<optimized out>) at third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4124
#12 testing::TestCase::Run (this=0x5580d3eb8a90) at third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4267
#13 0x00005580d2ef666b in testing::TestCase::Run (this=<optimized out>) at third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4253
#14 testing::internal::UnitTestImpl::RunAllTests (this=0x5580d3e7c1f0) at third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:6633
#15 0x00005580d2f01cb0 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (
    location=0x5580d2f806f0 "auxiliary test code (environments or event listeners)", method=<optimized out>, object=0x5580d3e7c1f0)
    at third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3880
#16 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x5580d3e7c1f0,
    method=<optimized out>, location=location@entry=0x5580d2f806f0 "auxiliary test code (environments or event listeners)")
    at third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3935
#17 0x00005580d2ef681f in testing::UnitTest::Run (this=0x5580d3261500 <testing::UnitTest::GetInstance()::instance>)
    at ./third-party/gtest-1.8.1/fused-src/gtest/gtest.h:21103
#18 0x00005580d2a0569e in RUN_ALL_TESTS () at ./third-party/gtest-1.8.1/fused-src/gtest/gtest.h:22104
#19 main (argc=<optimized out>, argv=0x7fff3cefdc98) at db/db_compaction_test.cc:5023

Can you take a look?

@Little-Wallace
Copy link
Contributor Author

@siying Ok, It may be conflict with some change of master.

@Little-Wallace
Copy link
Contributor Author

This failure is not always appear every time? I have not been able to reproduce this failure.

@Little-Wallace
Copy link
Contributor Author

Emmmm, I have not been able to reproduce this first failure, but I guess that was caused by IntraL0 compact so quickly that it finished before I had ingested all sst. I set SYNC_POINT in another way to solve this problem. @siying

@Little-Wallace
Copy link
Contributor Author

The second failure was caused by the first failure. If SleepingBackgroundTask destruct before calling WakeUp(), the flush thread will never be wake up. So if test fails before WakeUp, the process will be hanging.

facebook-github-bot pushed a commit that referenced this pull request Nov 21, 2019
Summary:
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

This PR is to fix unstable unit test added by  (#5958).
I set SYNC_POINT in PickCompaction before. If IntraL0Compaction was trigger,  the compact job which compact sst to base level would start instantly. If the compaction thread run faster than unittest main thread, we may observe the number of files in L0 reduce.
Pull Request resolved: #6061

Differential Revision: D18642301

fbshipit-source-id: 3e4da2ee963532b6e142336951ea3f47d46df148
yiwu-arbug pushed a commit to tikv/rocksdb that referenced this pull request Nov 26, 2019
Summary:

Our process was abort when it call `CheckConsistency`. And the information in  `stderr` show that "`L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421` ".  Here are the causes of the accident I investigated.

* RocksDB will call `CheckConsistency` whenever `MANIFEST` file is update. It will check sequence number interval of every file, except files which were ingested.
* When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number.
* `CheckConsistency`  determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal.
* If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst,  the `smallest_seqno` of this new file will be smaller than his `largest_seqno`.
    * If more than one ingested file was ingested before memtable schedule flush,  and they all compact into one new sstable file by `IntraL0Compaction`. The sequence interval of this new file will be included in the interval of the memtable.  So `CheckConsistency` will return a `Corruption`.
    * If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start,  but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2).  **But there was still some data in s1  written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1.** Of course, it would also make a `Corruption` because of overlap of seqno. There is the relationship of the files:
    > s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno  < s1.largest_seqno

So I skip pick sst file which was ingested in function `FindIntraL0Compaction `

Here is my bug report: facebook#5913

There are two situations that can cause the check to fail.

- First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest.

- If there had been one compaction job which compacted sst from L0 to L1, `LevelCompactionPicker` would trigger a `IntraL0Compaction` which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst.

- Then some data was put into memtable, and memtable was flushed to L0. We called this sst B.
- RocksDB check consistency , and find the `smallest_seqno` of B is  less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested.

- First we have flushed many sst in L0,  we call them [s1, s2, s3].

- There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1.  And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1.

- m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5.

- [s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction.  We call it s6.
  - compacted 4@0 files to L0
- When s6 is added into manifest,  the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5.  But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5.
   - s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno
Pull Request resolved: facebook#5958

Differential Revision: D18601316

fbshipit-source-id: 5fe54b3c9af52a2e1400728f565e895cde1c7267
yiwu-arbug pushed a commit to tikv/rocksdb that referenced this pull request Nov 26, 2019
Summary:
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

This PR is to fix unstable unit test added by  (facebook#5958).
I set SYNC_POINT in PickCompaction before. If IntraL0Compaction was trigger,  the compact job which compact sst to base level would start instantly. If the compaction thread run faster than unittest main thread, we may observe the number of files in L0 reduce.
Pull Request resolved: facebook#6061

Differential Revision: D18642301

fbshipit-source-id: 3e4da2ee963532b6e142336951ea3f47d46df148
yiwu-arbug pushed a commit to tikv/rocksdb that referenced this pull request Nov 27, 2019
Summary:

Our process was abort when it call `CheckConsistency`. And the information in  `stderr` show that "`L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421` ".  Here are the causes of the accident I investigated.

* RocksDB will call `CheckConsistency` whenever `MANIFEST` file is update. It will check sequence number interval of every file, except files which were ingested.
* When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number.
* `CheckConsistency`  determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal.
* If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst,  the `smallest_seqno` of this new file will be smaller than his `largest_seqno`.
    * If more than one ingested file was ingested before memtable schedule flush,  and they all compact into one new sstable file by `IntraL0Compaction`. The sequence interval of this new file will be included in the interval of the memtable.  So `CheckConsistency` will return a `Corruption`.
    * If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start,  but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2).  **But there was still some data in s1  written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1.** Of course, it would also make a `Corruption` because of overlap of seqno. There is the relationship of the files:
    > s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno  < s1.largest_seqno

So I skip pick sst file which was ingested in function `FindIntraL0Compaction `

Here is my bug report: facebook#5913

There are two situations that can cause the check to fail.

- First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest.

- If there had been one compaction job which compacted sst from L0 to L1, `LevelCompactionPicker` would trigger a `IntraL0Compaction` which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst.

- Then some data was put into memtable, and memtable was flushed to L0. We called this sst B.
- RocksDB check consistency , and find the `smallest_seqno` of B is  less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested.

- First we have flushed many sst in L0,  we call them [s1, s2, s3].

- There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1.  And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1.

- m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5.

- [s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction.  We call it s6.
  - compacted 4@0 files to L0
- When s6 is added into manifest,  the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5.  But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5.
   - s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno
Pull Request resolved: facebook#5958

Differential Revision: D18601316

fbshipit-source-id: 5fe54b3c9af52a2e1400728f565e895cde1c7267
yiwu-arbug pushed a commit to tikv/rocksdb that referenced this pull request Nov 27, 2019
Summary:
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

This PR is to fix unstable unit test added by  (facebook#5958).
I set SYNC_POINT in PickCompaction before. If IntraL0Compaction was trigger,  the compact job which compact sst to base level would start instantly. If the compaction thread run faster than unittest main thread, we may observe the number of files in L0 reduce.
Pull Request resolved: facebook#6061

Differential Revision: D18642301

fbshipit-source-id: 3e4da2ee963532b6e142336951ea3f47d46df148
wolfkdy pushed a commit to wolfkdy/rocksdb that referenced this pull request Dec 22, 2019
Summary:
## Problem Description

Our process was abort when it call `CheckConsistency`. And the information in  `stderr` show that "`L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421` ".  Here are the causes of the accident I investigated.

* RocksDB will call `CheckConsistency` whenever `MANIFEST` file is update. It will check sequence number interval of every file, except files which were ingested.
* When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number.
* `CheckConsistency`  determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal.
* If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst,  the `smallest_seqno` of this new file will be smaller than his `largest_seqno`.
    * If more than one ingested file was ingested before memtable schedule flush,  and they all compact into one new sstable file by `IntraL0Compaction`. The sequence interval of this new file will be included in the interval of the memtable.  So `CheckConsistency` will return a `Corruption`.
    * If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start,  but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2).  **But there was still some data in s1  written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1.** Of course, it would also make a `Corruption` because of overlap of seqno. There is the relationship of the files:
    > s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno  < s1.largest_seqno

So I skip pick sst file which was ingested in function `FindIntraL0Compaction `

## Reason

Here is my bug report: facebook#5913

There are two situations that can cause the check to fail.

### First situation:
- First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest.

- If there had been one compaction job which compacted sst from L0 to L1, `LevelCompactionPicker` would trigger a `IntraL0Compaction` which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst.

- Then some data was put into memtable, and memtable was flushed to L0. We called this sst B.
- RocksDB check consistency , and find the `smallest_seqno` of B is  less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested.

### Secondary situaion

- First we have flushed many sst in L0,  we call them [s1, s2, s3].

- There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1.  And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1.

- m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5.

- [s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction.  We call it s6.
  - compacted 4@0 files to L0
- When s6 is added into manifest,  the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5.  But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5.
   - s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno
Pull Request resolved: facebook#5958

Differential Revision: D18601316

fbshipit-source-id: 5fe54b3c9af52a2e1400728f565e895cde1c7267
wolfkdy pushed a commit to wolfkdy/rocksdb that referenced this pull request Dec 22, 2019
Summary:
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

This PR is to fix unstable unit test added by  (facebook#5958).
I set SYNC_POINT in PickCompaction before. If IntraL0Compaction was trigger,  the compact job which compact sst to base level would start instantly. If the compaction thread run faster than unittest main thread, we may observe the number of files in L0 reduce.
Pull Request resolved: facebook#6061

Differential Revision: D18642301

fbshipit-source-id: 3e4da2ee963532b6e142336951ea3f47d46df148
yiwu-arbug pushed a commit to tikv/rocksdb that referenced this pull request Jan 16, 2020
Summary:

Our process was abort when it call `CheckConsistency`. And the information in  `stderr` show that "`L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421` ".  Here are the causes of the accident I investigated.

* RocksDB will call `CheckConsistency` whenever `MANIFEST` file is update. It will check sequence number interval of every file, except files which were ingested.
* When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number.
* `CheckConsistency`  determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal.
* If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst,  the `smallest_seqno` of this new file will be smaller than his `largest_seqno`.
    * If more than one ingested file was ingested before memtable schedule flush,  and they all compact into one new sstable file by `IntraL0Compaction`. The sequence interval of this new file will be included in the interval of the memtable.  So `CheckConsistency` will return a `Corruption`.
    * If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start,  but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2).  **But there was still some data in s1  written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1.** Of course, it would also make a `Corruption` because of overlap of seqno. There is the relationship of the files:
    > s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno  < s1.largest_seqno

So I skip pick sst file which was ingested in function `FindIntraL0Compaction `

Here is my bug report: facebook#5913

There are two situations that can cause the check to fail.

- First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest.

- If there had been one compaction job which compacted sst from L0 to L1, `LevelCompactionPicker` would trigger a `IntraL0Compaction` which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst.

- Then some data was put into memtable, and memtable was flushed to L0. We called this sst B.
- RocksDB check consistency , and find the `smallest_seqno` of B is  less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested.

- First we have flushed many sst in L0,  we call them [s1, s2, s3].

- There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1.  And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1.

- m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5.

- [s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction.  We call it s6.
  - compacted 4@0 files to L0
- When s6 is added into manifest,  the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5.  But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5.
   - s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno
Pull Request resolved: facebook#5958

Differential Revision: D18601316

fbshipit-source-id: 5fe54b3c9af52a2e1400728f565e895cde1c7267
yiwu-arbug pushed a commit to tikv/rocksdb that referenced this pull request Jan 16, 2020
Summary:
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

This PR is to fix unstable unit test added by  (facebook#5958).
I set SYNC_POINT in PickCompaction before. If IntraL0Compaction was trigger,  the compact job which compact sst to base level would start instantly. If the compaction thread run faster than unittest main thread, we may observe the number of files in L0 reduce.
Pull Request resolved: facebook#6061

Differential Revision: D18642301

fbshipit-source-id: 3e4da2ee963532b6e142336951ea3f47d46df148
facebook-github-bot pushed a commit that referenced this pull request Oct 25, 2022
…nos between ingested files and memtable's (#10777)

Summary:
**Context:**
Same as #5958 (comment) but apply the fix to FIFO Compaction case
Repro:
```
COERCE_CONTEXT_SWICH=1 make -j56 db_stress

./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=18 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=1000 --compaction_pri=3 --open_files=-1 --compaction_style=2 --fifo_allow_compaction=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8388607 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test0/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --ingest_external_file_one_in=100 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=20 --ribbon_starting_level=999 --snapshot_hold_ops=1000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=524288 --write_dbid_to_manifest=0 --writepercent=35

put or merge error: Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 file #479 with seqno 23711 29070 vs. file #482 with seqno 27138 29049
```

**Summary:**
FIFO only does intra-L0 compaction in the following four cases. For other cases, FIFO drops data instead of compacting on data, which is irrelevant to the overlapping seqno issue we are solving.
-  [FIFOCompactionPicker::PickSizeCompaction](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L155) when `total size < compaction_options_fifo.max_table_files_size` and `compaction_options_fifo.allow_compaction == true`
   - For this path, we simply reuse the fix in `FindIntraL0Compaction` https://github.com/facebook/rocksdb/pull/5958/files#diff-c261f77d6dd2134333c4a955c311cf4a196a08d3c2bb6ce24fd6801407877c89R56
   - This path was not stress-tested at all. Therefore we covered `fifo.allow_compaction` in stress test to surface the overlapping seqno issue we are fixing here.
- [FIFOCompactionPicker::PickCompactionToWarm](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L313) when `compaction_options_fifo.age_for_warm > 0`
  - For this path, we simply replicate the idea in #5958 (comment) and skip files of largest seqno greater than `earliest_mem_seqno`
  - This path was not stress-tested at all. However covering `age_for_warm` option worths a separate PR to deal with db stress compatibility. Therefore we manually tested this path for this PR
- [FIFOCompactionPicker::CompactRange](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L365) that ends up picking one of the above two compactions
- [CompactionPicker::CompactFiles](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.cc#L378)
    - Since `SanitizeCompactionInputFiles()` will be called [before](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.h#L111-L113) `CompactionPicker::CompactFiles` , we simply replicate the idea in #5958 (comment)  in `SanitizeCompactionInputFiles()`. To simplify implementation, we return `Stats::Abort()` on encountering seqno-overlapped file when doing compaction to L0 instead of skipping the file and proceed with the compaction.

Some additional clean-up included in this PR:
- Renamed `earliest_memtable_seqno` to `earliest_mem_seqno` for consistent naming
- Added comment about `earliest_memtable_seqno` in related APIs
- Made parameter `earliest_memtable_seqno` constant and required

Pull Request resolved: #10777

Test Plan:
- make check
- New unit test `TEST_P(DBCompactionTestFIFOCheckConsistencyWithParam, FlushAfterIntraL0CompactionWithIngestedFile)`corresponding to the above 4 cases, which will fail accordingly without the fix
- Regular CI stress run on this PR + stress test with aggressive value #10761  and on FIFO compaction only

Reviewed By: ajkr

Differential Revision: D40090485

Pulled By: hx235

fbshipit-source-id: 52624186952ee7109117788741aeeac86b624a4f
facebook-github-bot pushed a commit that referenced this pull request Nov 29, 2022
…to overla…" (#10999)

Summary:
**Context/Summary:**

This reverts commit fc74abb and related HISTORY record.

The issue with PR 10777 or general approach using earliest_mem_seqno like #5958 (comment) is that the earliest seqno of memtable of each CFs does not get persisted and will always start with 0 upon Recover(). Later when creating a new memtable in certain CF, we use the last seqno of the whole DB (but not of that CF from previous DB session) for this CF.  This will lead to false positive overlapping seqno and PR 10777 will throw something like https://github.com/facebook/rocksdb/blob/main/db/compaction/compaction_picker.cc#L1002-L1004

Luckily a more elegant and complete solution to the overlapping seqno problem these PR aim to solve does not have above problem, see #10922. It is already being pursued and in the process of review. So we can just revert this PR and focus on getting PR10922 to land.

Pull Request resolved: #10999

Test Plan: make check

Reviewed By: anand1976

Differential Revision: D41572604

Pulled By: hx235

fbshipit-source-id: 9d9bdf594abd235e2137045cef513ca0b14e0a3a
hx235 added a commit to hx235/rocksdb that referenced this pull request Nov 29, 2022
…to overla…" (facebook#10999)

Summary:
**Context/Summary:**

This reverts commit fc74abb and related HISTORY record.

The issue with PR 10777 or general approach using earliest_mem_seqno like facebook#5958 (comment) is that the earliest seqno of memtable of each CFs does not get persisted and will always start with 0 upon Recover(). Later when creating a new memtable in certain CF, we use the last seqno of the whole DB (but not of that CF from previous DB session) for this CF.  This will lead to false positive overlapping seqno and PR 10777 will throw something like https://github.com/facebook/rocksdb/blob/main/db/compaction/compaction_picker.cc#L1002-L1004

Luckily a more elegant and complete solution to the overlapping seqno problem these PR aim to solve does not have above problem, see facebook#10922. It is already being pursued and in the process of review. So we can just revert this PR and focus on getting PR10922 to land.

Pull Request resolved: facebook#10999

Test Plan: make check

Reviewed By: anand1976

Differential Revision: D41572604

Pulled By: hx235

fbshipit-source-id: 9d9bdf594abd235e2137045cef513ca0b14e0a3a
facebook-github-bot pushed a commit that referenced this pull request Dec 13, 2022
Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
-  File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
    - For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
       - insert k1@1 to memtable m1
       - ingest file s1 with k2@2, ingest file s2 with k3@3
        - insert k4@4 to m1
       - compact files s1, s2 and  result in new file s3 of seqno range [2, 3]
       - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
    - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
    - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr  for this example)
        - an existing SST s1 contains only k1@1
        - insert k1@2 to memtable m1
        - ingest file s2 with k3@3, ingest file s3 with k4@4
        - insert single delete k5@5 in m1
        - flush m1 and result in new file s4 of seqno range [2, 5]
        - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
        - compact s4 and result in new file s6 of seqno range [2] due to single delete
    - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`

Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number`  ordering check. This will result in file ordering s1 <  s2 <  s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.

**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
  - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
  - Compaction output file  is assigned with the minimum `epoch_number` among input files'
      - Refit level: reuse refitted file's epoch_number
  -  Other paths needing `epoch_number` treatment:
     - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
     - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
  -  Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or  by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
  - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
   - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
   - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
   - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix #5958 (comment) and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
   - update existing tests with `epoch_number` so make check will pass
   - update #5958 (comment) tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
   - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()

Pull Request resolved: #10922

Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under #5958 (comment)
- [Ongoing] Compatibility test: manually run ajkr@36a5686 (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox`
- [Ongoing] normal db stress test
- [Ongoing] db stress test with aggressive value #10761

Reviewed By: ajkr

Differential Revision: D41063187

Pulled By: hx235

fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants