Fix corruption with intra-L0 on ingested files #5958

Little-Wallace · 2019-10-23T08:07:14Z

Problem Description

Our process was abort when it call CheckConsistency. And the information in stderr show that "L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421 ". Here are the causes of the accident I investigated.

RocksDB will call CheckConsistency whenever MANIFEST file is update. It will check sequence number interval of every file, except files which were ingested.
When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number.
CheckConsistency determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal.
If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst, the smallest_seqno of this new file will be smaller than his largest_seqno.
- If more than one ingested file was ingested before memtable schedule flush, and they all compact into one new sstable file by IntraL0Compaction. The sequence interval of this new file will be included in the interval of the memtable. So CheckConsistency will return a Corruption.
- If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start, but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2). But there was still some data in s1 written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1. Of course, it would also make a Corruption because of overlap of seqno. There is the relationship of the files:
s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno < s1.largest_seqno

So I skip pick sst file which was ingested in function FindIntraL0Compaction

Reason

Here is my bug report: #5913

There are two situations that can cause the check to fail.

First situation：

First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest.
If there had been one compaction job which compacted sst from L0 to L1, LevelCompactionPicker would trigger a IntraL0Compaction which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst.
Then some data was put into memtable, and memtable was flushed to L0. We called this sst B.
RocksDB check consistency , and find the smallest_seqno of B is less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested.

Secondary situaion

First we have flushed many sst in L0, we call them [s1, s2, s3].
There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1. And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1.
m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5.
[s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction. We call it s6.
- compacted 4@0 files to L0
When s6 is added into manifest, the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5. But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5.
- s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

…sted Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

ajkr

While this does prevent L0 files from being created that would fail the L0 consistency check, it is still possible that an existing DB already contains files that fail the check. Is that a problem?

Also, we benefit quite a lot from ingested files going through intra-L0 compaction. Ingestion is really the only way our L0 file count grows quickly so it's helpful to have intra-L0 to alleviate the pressure.

The way I solved this in Cockroach is simply turn off that part of the assertion: cockroachdb#67. I don't think it's a particularly great approach as maybe that assertion is useful for catching other bugs, but it's one to consider.

edit: I did not fully understand the secondary situation when writing this reply. The approach I mentioned about turning off assertion is clearly insufficient for the corruption problem.

siying · 2019-10-23T22:11:05Z

Assume we have L0 files
[3 8] and [5 5]
Is it possible that every time we pick [3 8], [5 5] is always picked up and vice verse? Is it going to make IntraL0 compaction code too complicated?

siying · 2019-10-23T22:14:25Z

Or say, expand Intra-L0 compaction in both directions until a clean cut.

ajkr · 2019-10-23T22:19:37Z

Hm, what if an intra-L0 is picked including [5 5] (and maybe another file like [4 4]) before [3 8] is flushed? I think then when [3 8] is flushed it'll trigger the same consistency failure.

siying · 2019-10-23T22:29:21Z

@ajkr that is a very good point...

siying · 2019-10-23T22:34:48Z

Then I think remove the assertion feels like a logical solution. One thing worries me is that, stress test seems not in a circumstance where Intra-L0 compaction will happen, so we might not be able to uncover bugs like this in the future.

ajkr · 2019-10-23T22:42:14Z

I think after we have @sumeerbhola's fix for #5848 (if it isn't already fixed, I haven't checked recently), then we can set -ingest_external_file_one_in to an aggressively low number and it should trigger frequent intra-L0s.

ajkr · 2019-10-23T22:48:16Z

High file ingestion rate could be randomly turned on in db_crashtest.py so only once in a while there'd be high ingestion rate. Then we wouldn't need to add new test setups, and it wouldn't take much coverage away from other features.

Little-Wallace · 2019-10-24T08:54:21Z

Hm, what if an intra-L0 is picked including [5 5] (and maybe another file like [4 4]) before [3 8] is flushed? I think then when [3 8] is flushed it'll trigger the same consistency failure.

Yes, this is what I have described in First situatiom

@siying I think intra-L0 should skip files which largest seqno smaller than smallest seqno in imm or memtable. That is

max(largest_seqno(picked file)) < smallest_seqno(imm, mem)

This solution could solve the problem and do not disable intral-l0 compaction for ingested file.

If the largest seqno of the picked file is larger than smallest_seqno of imm, it only needs to wait for the next time to be picked, after the imm table has been flushed into L0.

riversand963 · 2019-10-24T17:38:59Z

Hm, what if an intra-L0 is picked including [5 5] (and maybe another file like [4 4]) before [3 8] is flushed? I think then when [3 8] is flushed it'll trigger the same consistency failure.

Yes, this is what I have described in First situatiom

@siying I think intra-L0 should skip files which largest seqno smaller than smallest seqno in imm or memtable. That is

max(largest_seqno(picked file)) < smallest_seqno(imm, mem)

This solution could solve the problem and do not disable intral-l0 compaction for ingested file.

If the largest seqno of the picked file is larger than smallest_seqno of imm, it only needs to wait for the next time to be picked, after the imm table has been flushed into L0.

Thanks @Little-Wallace for the proposal. I chatted with @siying and I have a question. Let me use an example.

Mem                 [4, 5]
L0         [1, 3]                 [6, 6]    [7, 7]
L1

Originally we have L0 [1, 3] and memtable [4, 5] in which 1, 3, 4, 5 are sequence numbers. Then we ingest [6, 6] and [7, 7]. The external SSTs do not have overlapping keys with [4, 5], but overlaps with [1, 3]. Suppose no further flush is triggered. Is it possible for [6, 6] and [7, 7] be compacted using intra-L0 compaction and then moved to L1? Will this be a problem if we search for a key which exists in both [1, 3] and [6, 6]? Just asking because currently I am not fully familiar with all the compaction boundary picking logic.

Little-Wallace · 2019-10-26T05:31:15Z

@riversand963 IntralL0Compaction only move files into L0 , rather than L1. And key in one sst could only show as its latest version.

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

petermattis

The approach looks sane to me. My only major comment is about adjustments to the logic in FindIntraL0Compaction. Glad to see the added test, though I didn't get a chance to look at that in detail.

petermattis · 2019-10-26T18:10:24Z

db/column_family.cc

@@ -391,6 +391,11 @@ void SuperVersion::Init(MemTable* new_mem, MemTableListVersion* new_imm,
  refs.store(1, std::memory_order_relaxed);
 }

+SequenceNumber SuperVersion::GetSmallestMemSeqno() const {


Nit: I think this method should be named GetEarliestMemTableSequenceNumber for consistency with the naming of similar methods elsewhere. Also, this looks very similar to DBImpl::GetEarliestMemTableSequenceNumber.

PS I wouldn't mind if all of the uses of earliest where switched to smallest for consistency with FileMetadata::{smallest,largest}_seqno, but I'll leave it to the FB RocksDB folks to decide if they want to do that.

petermattis · 2019-10-26T18:11:09Z

db/column_family.cc

@@ -929,8 +934,9 @@ bool ColumnFamilyData::NeedsCompaction() const {

 Compaction* ColumnFamilyData::PickCompaction(
    const MutableCFOptions& mutable_options, LogBuffer* log_buffer) {
+  SequenceNumber oldest_mem_seqno = super_version_->GetSmallestMemSeqno();


Nit: now we have 3 terms in play: earliest, smallest, and oldest. For consistency, I think this should be earliest_mem_seqno.

Ok， I'll fix it.

petermattis · 2019-10-26T18:19:57Z

db/compaction/compaction_picker.cc

@@ -52,7 +59,9 @@ bool FindIntraL0Compaction(const std::vector<FileMetaData*>& level_files,
    compact_bytes += static_cast<size_t>(level_files[span_len]->fd.file_size);
    compensated_compact_bytes += level_files[span_len]->compensated_file_size;
    new_compact_bytes_per_del_file = compact_bytes / span_len;
-    if (level_files[span_len]->being_compacted ||
+    if ((level_files[0]->fd.smallest_seqno == level_files[0]->fd.largest_seqno &&
+      level_files[span_len]->fd.largest_seqno >= oldest_mem_seqno) ||


Is this check necessary? My understanding is that level_files is sorted in descending largest_seqno order. So if the check at the start of this method succeeded, the check here must also be true. Am I missing something?

I wasn't aware that intra-L0 compactions always started with the newest sstable in L0. Rather than the approach take here, could we instead skip the prefix of level_files that contains sstables with largest_seqno >= oldest_mem_seqno and then build a span from there? Cc @ajkr.

Sorry, I make a mistake. My intention is to check every file, rather than level_files[0]

petermattis · 2019-10-26T18:20:09Z

db/compaction/compaction_picker.cc

+                           SequenceNumber oldest_mem_seqno) {
+
+  // Do not pick ingested file when there is at least one memtable not flushed which of seqno is overlap with the sst.
+  if (level_files[0]->fd.smallest_seqno == level_files[0]->fd.largest_seqno &&


Would it be better to remove this conditional? Is there ever a case where we want to pick an sstable for intra-L0 compaction that has a largest seqno that is larger than the smallest unflushed seqno?

Yes. I have explained it in my description. If the sstable was produced by ingested, and its key was not overlap with memtable, it would use the global seqno, which is just in [smallest seqno, largest seqno] of memtable.

petermattis · 2019-10-26T18:21:07Z

db/compaction/compaction_picker.cc

@@ -39,7 +39,14 @@ bool FindIntraL0Compaction(const std::vector<FileMetaData*>& level_files,
                           size_t min_files_to_compact,
                           uint64_t max_compact_bytes_per_del_file,
                           uint64_t max_compaction_bytes,
-                           CompactionInputFiles* comp_inputs) {
+                           CompactionInputFiles* comp_inputs,
+                           SequenceNumber oldest_mem_seqno) {


Another naming nit: smallest_unflushed_seqno would make it clearer that it is the fact that this seqno is unflushed which is important.

ajkr · 2019-10-26T18:34:16Z

db/compaction/compaction_picker.cc

+                           CompactionInputFiles* comp_inputs,
+                           SequenceNumber oldest_mem_seqno) {
+
+  // Do not pick ingested file when there is at least one memtable not flushed which of seqno is overlap with the sst.


Do you think it would work to pick the span of files for intra-L0 starting at the smallest index i for which level_files[i]->fd.largest_seqno < oldest_mem_seqno?

That's a nicer phrasing of a suggestion I had in a comment. I don't see anything wrong with this right now, but I'd like to think harder about it.

Sorry I loaded the page before you reviewed and didn't notice the repetition. I haven't thought of any problems with it. I also plan to experiment with relaxing the restriction more generally on intra-L0 picking from the newest file as it might be related to our imports getting stuck.

I would try it.

This approach looks good to me.

wrap the line to follow 80-char rule. If possible, please try to run "make format".

@ajkr It seems to me it could still order the output to be newer than L0_(i-1) if L0_i is an ingested file and L0_i->largest_seqno < oldest_mem_seqno but L0_i->largest_seqno > L0_(i-1)->largest_seqno.

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

facebook-github-bot · 2019-11-19T23:11:53Z

This pull request has been merged in ec3e3c3.

…5958) Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

siying · 2019-11-20T19:12:10Z

btw I see this failure once:

Note: Google Test filter = DBCompactionTestWithParam/DBCompactionTestWithParam.FlushAfterL0IntraCompactionCheckConsistencyFail/0
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBCompactionTestWithParam/DBCompactionTestWithParam
[ RUN      ] DBCompactionTestWithParam/DBCompactionTestWithParam.FlushAfterL0IntraCompactionCheckConsistencyFail/0
db/db_compaction_test.cc:4921: Failure
Expected equality of these values:
  i + 1
    Which is: 10
  NumTableFilesAtLevel(0)
    Which is: 5
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
  what():  db/db_compaction_test.cc:4921: Failure
Expected equality of these values:
  i + 1
    Which is: 10
  NumTableFilesAtLevel(0)
    Which is: 5

It doesn't reproduce though, but I wonder whether you have any thoughts on what could be the reason.

siying · 2019-11-20T19:18:47Z

And another run is hanging:

(gdb) thread apply all bt

Thread 7 (Thread 0x7fca00ff9700 (LWP 15225)):
#0  0x00007fca155e7f85 in futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fca00ff8a30, expected=0,
    futex_word=0x5580d3ee3f98) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  __pthread_cond_wait_common (abstime=0x7fca00ff8a30, mutex=0x5580d3ee3f28, cond=0x5580d3ee3f70) at pthread_cond_wait.c:539
#2  __pthread_cond_timedwait (cond=cond@entry=0x5580d3ee3f70, mutex=0x5580d3ee3f28, abstime=abstime@entry=0x7fca00ff8a30)
    at pthread_cond_wait.c:667
#3  0x00005580d2d191e9 in rocksdb::port::CondVar::TimedWait (this=this@entry=0x5580d3ee3f70, abs_time_us=<optimized out>)
    at port/port_posix.cc:120
#4  0x00005580d2ce5154 in rocksdb::InstrumentedCondVar::TimedWaitInternal (this=0x5580d3ee3f70, abs_time_us=<optimized out>)
    at monitoring/instrumented_mutex.cc:66
#5  0x00005580d2ce522d in rocksdb::InstrumentedCondVar::TimedWait (this=this@entry=0x5580d3ee3f70,
    abs_time_us=abs_time_us@entry=1574277484269966) at monitoring/instrumented_mutex.cc:55
#6  0x00005580d2b6d97b in rocksdb::RepeatableThread::wait (delay=<optimized out>, this=0x5580d3ee3ed0) at ./util/repeatable_thread.h:90
#7  rocksdb::RepeatableThread::thread (this=0x5580d3ee3ed0) at ./util/repeatable_thread.h:126
#8  rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}::operator()() const (__closure=<optimized out>)
    at ./util/repeatable_thread.h:38
#9  std::__invoke_impl<void, rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}>(std::__invoke_other, rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}&&) (__f=...) at /usr/include/c++/9/bits/invoke.h:60
#10 std::__invoke<rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}>(std::__invoke_result&&, (rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}&&)...) (__fn=...) at /usr/include/c++/9/bits/invoke.h:95
#11 std::thread::_Invoker<std::tuple<rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=<optimized out>) at /usr/include/c++/9/thread:244
#12 std::thread::_Invoker<std::tuple<rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}> >::operator()() (
    this=<optimized out>) at /usr/include/c++/9/thread:251
#13 std::thread::_State_impl<std::thread::_Invoker<std::tuple<rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}> > >::_M_run() (this=<optimized out>) at /usr/include/c++/9/thread:195
#14 0x00007fca14c9e870 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#15 0x00007fca155e16db in start_thread (arg=0x7fca00ff9700) at pthread_create.c:463
#16 0x00007fca1434888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (Thread 0x7fc9f67fc700 (LWP 15224)):
#0  0x00007fca155e7f85 in futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fc9f67fba30, expected=0,
    futex_word=0x5580d3ee1238) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  __pthread_cond_wait_common (abstime=0x7fc9f67fba30, mutex=0x5580d3ee11c8, cond=0x5580d3ee1210) at pthread_cond_wait.c:539
#2  __pthread_cond_timedwait (cond=cond@entry=0x5580d3ee1210, mutex=0x5580d3ee11c8, abstime=abstime@entry=0x7fc9f67fba30)
    at pthread_cond_wait.c:667
#3  0x00005580d2d191e9 in rocksdb::port::CondVar::TimedWait (this=this@entry=0x5580d3ee1210, abs_time_us=<optimized out>)
    at port/port_posix.cc:120
#4  0x00005580d2ce5154 in rocksdb::InstrumentedCondVar::TimedWaitInternal (this=0x5580d3ee1210, abs_time_us=<optimized out>)
    at monitoring/instrumented_mutex.cc:66
#5  0x00005580d2ce522d in rocksdb::InstrumentedCondVar::TimedWait (this=this@entry=0x5580d3ee1210,
    abs_time_us=abs_time_us@entry=1574277484296472) at monitoring/instrumented_mutex.cc:55
#6  0x00005580d2b6d97b in rocksdb::RepeatableThread::wait (delay=<optimized out>, this=0x5580d3ee1170) at ./util/repeatable_thread.h:90
#7  rocksdb::RepeatableThread::thread (this=0x5580d3ee1170) at ./util/repeatable_thread.h:126
#8  rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}::operator()() const (__closure=<optimized out>)
    at ./util/repeatable_thread.h:38
#9  std::__invoke_impl<void, rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}>(std::__invoke_other, rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}&&) (__f=...) at /usr/include/c++/9/bits/invoke.h:60
#10 std::__invoke<rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}>(std::__invoke_result&&, (rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}&&)...) (__fn=...) at /usr/include/c++/9/bits/invoke.h:95
#11 std::thread::_Invoker<std::tuple<rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=<optimized out>) at /usr/include/c++/9/thread:244
#12 std::thread::_Invoker<std::tuple<rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}> >::operator()() (
    this=<optimized out>) at /usr/include/c++/9/thread:251
#13 std::thread::_State_impl<std::thread::_Invoker<std::tuple<rocksdb::RepeatableThread::RepeatableThread(std::function<void ()>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, unsigned long, unsigned long)::{lambda()#1}> > >::_M_run() (this=<optimized out>) at /usr/include/c++/9/thread:195
#14 0x00007fca14c9e870 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#15 0x00007fca155e16db in start_thread (arg=0x7fc9f67fc700) at pthread_create.c:463
#16 0x00007fca1434888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7fc9f77fe700 (LWP 15142)):
#0  0x00007fca155e79f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5580d3ecd7c8)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x5580d3ecd6c0, cond=0x5580d3ecd7a0) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=cond@entry=0x5580d3ecd7a0, mutex=0x5580d3ecd6c0) at pthread_cond_wait.c:655
#3  0x00005580d2d19169 in rocksdb::port::CondVar::Wait (this=this@entry=0x5580d3ecd7a0) at port/port_posix.cc:106
#4  0x00005580d2ce4f4f in rocksdb::InstrumentedCondVar::WaitInternal (this=0x5580d3ecd7a0) at monitoring/instrumented_mutex.cc:48
#5  rocksdb::InstrumentedCondVar::Wait (this=this@entry=0x5580d3ecd7a0) at monitoring/instrumented_mutex.cc:41
#6  0x00005580d2cc3f40 in rocksdb::DeleteScheduler::BackgroundEmptyTrash (this=0x5580d3ecd6a8) at file/delete_scheduler.cc:194
#7  0x00007fca14c9e870 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007fca155e16db in start_thread (arg=0x7fc9f77fe700) at pthread_create.c:463
#9  0x00007fca1434888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7fca13224700 (LWP 15141)):
#0  0x00007fca155e79f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5580d3ec9930)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x5580d3ec98e0, cond=0x5580d3ec9908) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x5580d3ec9908, mutex=0x5580d3ec98e0) at pthread_cond_wait.c:655
#3  0x00007fca14c98f0c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00005580d2dccf25 in rocksdb::ThreadPoolImpl::Impl::BGThread (this=0x5580d3ec9870, thread_id=1) at util/threadpool_imp.cc:196
#5  0x00005580d2dcd25e in rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper (arg=0x5580d3ee0a80) at util/threadpool_imp.cc:306
#6  0x00007fca14c9e870 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007fca155e16db in start_thread (arg=0x7fca13224700) at pthread_create.c:463
#8  0x00007fca1434888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7fca13a25700 (LWP 15078)):
#0  0x00007fca155e79f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fff3cefd388)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7fff3cefd330, cond=0x7fff3cefd360) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=cond@entry=0x7fff3cefd360, mutex=0x7fff3cefd330) at pthread_cond_wait.c:655
#3  0x00005580d2d19169 in rocksdb::port::CondVar::Wait (this=0x7fff3cefd360) at port/port_posix.cc:106
#4  0x00005580d2a97b18 in rocksdb::test::SleepingBackgroundTask::DoSleep (this=0x7fff3cefd330) at ./test_util/testutil.h:411
#5  rocksdb::test::SleepingBackgroundTask::DoSleepTask (arg=0x7fff3cefd330) at ./test_util/testutil.h:446
#6  0x00005580d2dcd0cb in std::function<void ()>::operator()() const (this=0x7fca13a24b00) at /usr/include/c++/9/bits/std_function.h:683
#7  rocksdb::ThreadPoolImpl::Impl::BGThread (this=0x5580d3ec9b60, thread_id=0) at util/threadpool_imp.cc:265
#8  0x00005580d2dcd25e in rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper (arg=0x5580d3ecc200) at util/threadpool_imp.cc:306
#9  0x00007fca14c9e870 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00007fca155e16db in start_thread (arg=0x7fca13a25700) at pthread_create.c:463
#11 0x00007fca1434888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7fca14226700 (LWP 15077)):
#0  0x00007fca155e79f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5580d3ec9930)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x5580d3ec98e0, cond=0x5580d3ec9908) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x5580d3ec9908, mutex=0x5580d3ec98e0) at pthread_cond_wait.c:655
#3  0x00007fca14c98f0c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00005580d2dccf25 in rocksdb::ThreadPoolImpl::Impl::BGThread (this=0x5580d3ec9870, thread_id=0) at util/threadpool_imp.cc:196
#5  0x00005580d2dcd25e in rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper (arg=0x5580d3ecc060) at util/threadpool_imp.cc:306
#6  0x00007fca14c9e870 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007fca155e16db in start_thread (arg=0x7fca14226700) at pthread_create.c:463
#8  0x00007fca1434888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7fca15c12a80 (LWP 15073)):
#0  0x00007fca155e7449 in futex_wait (private=<optimized out>, expected=12, futex_word=0x7fff3cefd384)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:61
#1  futex_wait_simple (private=<optimized out>, expected=12, futex_word=0x7fff3cefd384) at ../sysdeps/nptl/futex-internal.h:135
#2  __pthread_cond_destroy (cond=0x7fff3cefd360) at pthread_cond_destroy.c:54
#3  0x00005580d2d19129 in rocksdb::port::CondVar::~CondVar (this=<optimized out>, __in_chrg=<optimized out>) at port/port_posix.cc:100
#4  0x00005580d29b7c14 in rocksdb::test::SleepingBackgroundTask::~SleepingBackgroundTask (this=0x7fff3cefd330, __in_chrg=<optimized out>)
    at ./test_util/testutil.h:394
#5  rocksdb::DBCompactionTestWithParam_IntraL0CompactionAfterFlushCheckConsistencyFail_Test::TestBody (this=<optimized out>)
    at db/db_compaction_test.cc:4981
#6  0x00005580d2f01730 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (
    location=0x5580d2f7f4d5 "the test body", method=<optimized out>, object=0x5580d3ec8c60)
    at third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3880
#7  testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=object@entry=0x5580d3ec8c60, method=<optimized out>,
    location=location@entry=0x5580d2f7f4d5 "the test body") at third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3935
#8  0x00005580d2ef5b36 in testing::Test::Run (this=this@entry=0x5580d3ec8c60) at third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3973
#9  0x00005580d2ef5de5 in testing::Test::Run (this=0x5580d3ec8c60) at third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4149
#10 testing::TestInfo::Run (this=0x5580d3ec67b0) at third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4149
#11 0x00005580d2ef5f9c in testing::TestInfo::Run (this=<optimized out>) at third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4124
#12 testing::TestCase::Run (this=0x5580d3eb8a90) at third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4267
#13 0x00005580d2ef666b in testing::TestCase::Run (this=<optimized out>) at third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4253
#14 testing::internal::UnitTestImpl::RunAllTests (this=0x5580d3e7c1f0) at third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:6633
#15 0x00005580d2f01cb0 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (
    location=0x5580d2f806f0 "auxiliary test code (environments or event listeners)", method=<optimized out>, object=0x5580d3e7c1f0)
    at third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3880
#16 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x5580d3e7c1f0,
    method=<optimized out>, location=location@entry=0x5580d2f806f0 "auxiliary test code (environments or event listeners)")
    at third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3935
#17 0x00005580d2ef681f in testing::UnitTest::Run (this=0x5580d3261500 <testing::UnitTest::GetInstance()::instance>)
    at ./third-party/gtest-1.8.1/fused-src/gtest/gtest.h:21103
#18 0x00005580d2a0569e in RUN_ALL_TESTS () at ./third-party/gtest-1.8.1/fused-src/gtest/gtest.h:22104
#19 main (argc=<optimized out>, argv=0x7fff3cefdc98) at db/db_compaction_test.cc:5023

Can you take a look?

Little-Wallace · 2019-11-21T03:31:07Z

@siying Ok, It may be conflict with some change of master.

Little-Wallace · 2019-11-21T03:56:59Z

This failure is not always appear every time? I have not been able to reproduce this failure.

Little-Wallace · 2019-11-21T06:06:25Z

Emmmm, I have not been able to reproduce this first failure, but I guess that was caused by IntraL0 compact so quickly that it finished before I had ingested all sst. I set SYNC_POINT in another way to solve this problem. @siying

Little-Wallace · 2019-11-21T06:09:04Z

The second failure was caused by the first failure. If SleepingBackgroundTask destruct before calling WakeUp(), the flush thread will never be wake up. So if test fails before WakeUp, the process will be hanging.

Summary: Signed-off-by: Little-Wallace <bupt2013211450@gmail.com> This PR is to fix unstable unit test added by (#5958). I set SYNC_POINT in PickCompaction before. If IntraL0Compaction was trigger, the compact job which compact sst to base level would start instantly. If the compaction thread run faster than unittest main thread, we may observe the number of files in L0 reduce. Pull Request resolved: #6061 Differential Revision: D18642301 fbshipit-source-id: 3e4da2ee963532b6e142336951ea3f47d46df148

Summary: Our process was abort when it call `CheckConsistency`. And the information in `stderr` show that "`L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421` ". Here are the causes of the accident I investigated. * RocksDB will call `CheckConsistency` whenever `MANIFEST` file is update. It will check sequence number interval of every file, except files which were ingested. * When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number. * `CheckConsistency` determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal. * If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst, the `smallest_seqno` of this new file will be smaller than his `largest_seqno`. * If more than one ingested file was ingested before memtable schedule flush, and they all compact into one new sstable file by `IntraL0Compaction`. The sequence interval of this new file will be included in the interval of the memtable. So `CheckConsistency` will return a `Corruption`. * If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start, but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2). **But there was still some data in s1 written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1.** Of course, it would also make a `Corruption` because of overlap of seqno. There is the relationship of the files: > s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno < s1.largest_seqno So I skip pick sst file which was ingested in function `FindIntraL0Compaction ` Here is my bug report: facebook#5913 There are two situations that can cause the check to fail. - First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest. - If there had been one compaction job which compacted sst from L0 to L1, `LevelCompactionPicker` would trigger a `IntraL0Compaction` which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst. - Then some data was put into memtable, and memtable was flushed to L0. We called this sst B. - RocksDB check consistency , and find the `smallest_seqno` of B is less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested. - First we have flushed many sst in L0, we call them [s1, s2, s3]. - There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1. And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1. - m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5. - [s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction. We call it s6. - compacted 4@0 files to L0 - When s6 is added into manifest, the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5. But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5. - s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno Pull Request resolved: facebook#5958 Differential Revision: D18601316 fbshipit-source-id: 5fe54b3c9af52a2e1400728f565e895cde1c7267

Summary: Signed-off-by: Little-Wallace <bupt2013211450@gmail.com> This PR is to fix unstable unit test added by (facebook#5958). I set SYNC_POINT in PickCompaction before. If IntraL0Compaction was trigger, the compact job which compact sst to base level would start instantly. If the compaction thread run faster than unittest main thread, we may observe the number of files in L0 reduce. Pull Request resolved: facebook#6061 Differential Revision: D18642301 fbshipit-source-id: 3e4da2ee963532b6e142336951ea3f47d46df148

Summary: Our process was abort when it call `CheckConsistency`. And the information in `stderr` show that "`L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421` ". Here are the causes of the accident I investigated. * RocksDB will call `CheckConsistency` whenever `MANIFEST` file is update. It will check sequence number interval of every file, except files which were ingested. * When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number. * `CheckConsistency` determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal. * If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst, the `smallest_seqno` of this new file will be smaller than his `largest_seqno`. * If more than one ingested file was ingested before memtable schedule flush, and they all compact into one new sstable file by `IntraL0Compaction`. The sequence interval of this new file will be included in the interval of the memtable. So `CheckConsistency` will return a `Corruption`. * If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start, but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2). **But there was still some data in s1 written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1.** Of course, it would also make a `Corruption` because of overlap of seqno. There is the relationship of the files: > s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno < s1.largest_seqno So I skip pick sst file which was ingested in function `FindIntraL0Compaction ` Here is my bug report: facebook#5913 There are two situations that can cause the check to fail. - First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest. - If there had been one compaction job which compacted sst from L0 to L1, `LevelCompactionPicker` would trigger a `IntraL0Compaction` which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst. - Then some data was put into memtable, and memtable was flushed to L0. We called this sst B. - RocksDB check consistency , and find the `smallest_seqno` of B is less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested. - First we have flushed many sst in L0, we call them [s1, s2, s3]. - There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1. And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1. - m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5. - [s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction. We call it s6. - compacted 4@0 files to L0 - When s6 is added into manifest, the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5. But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5. - s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno Pull Request resolved: facebook#5958 Differential Revision: D18601316 fbshipit-source-id: 5fe54b3c9af52a2e1400728f565e895cde1c7267

Summary: Signed-off-by: Little-Wallace <bupt2013211450@gmail.com> This PR is to fix unstable unit test added by (facebook#5958). I set SYNC_POINT in PickCompaction before. If IntraL0Compaction was trigger, the compact job which compact sst to base level would start instantly. If the compaction thread run faster than unittest main thread, we may observe the number of files in L0 reduce. Pull Request resolved: facebook#6061 Differential Revision: D18642301 fbshipit-source-id: 3e4da2ee963532b6e142336951ea3f47d46df148

Summary: ## Problem Description Our process was abort when it call `CheckConsistency`. And the information in `stderr` show that "`L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421` ". Here are the causes of the accident I investigated. * RocksDB will call `CheckConsistency` whenever `MANIFEST` file is update. It will check sequence number interval of every file, except files which were ingested. * When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number. * `CheckConsistency` determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal. * If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst, the `smallest_seqno` of this new file will be smaller than his `largest_seqno`. * If more than one ingested file was ingested before memtable schedule flush, and they all compact into one new sstable file by `IntraL0Compaction`. The sequence interval of this new file will be included in the interval of the memtable. So `CheckConsistency` will return a `Corruption`. * If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start, but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2). **But there was still some data in s1 written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1.** Of course, it would also make a `Corruption` because of overlap of seqno. There is the relationship of the files: > s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno < s1.largest_seqno So I skip pick sst file which was ingested in function `FindIntraL0Compaction ` ## Reason Here is my bug report: facebook#5913 There are two situations that can cause the check to fail. ### First situation： - First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest. - If there had been one compaction job which compacted sst from L0 to L1, `LevelCompactionPicker` would trigger a `IntraL0Compaction` which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst. - Then some data was put into memtable, and memtable was flushed to L0. We called this sst B. - RocksDB check consistency , and find the `smallest_seqno` of B is less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested. ### Secondary situaion - First we have flushed many sst in L0, we call them [s1, s2, s3]. - There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1. And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1. - m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5. - [s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction. We call it s6. - compacted 4@0 files to L0 - When s6 is added into manifest, the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5. But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5. - s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno Pull Request resolved: facebook#5958 Differential Revision: D18601316 fbshipit-source-id: 5fe54b3c9af52a2e1400728f565e895cde1c7267

Summary: Signed-off-by: Little-Wallace <bupt2013211450@gmail.com> This PR is to fix unstable unit test added by (facebook#5958). I set SYNC_POINT in PickCompaction before. If IntraL0Compaction was trigger, the compact job which compact sst to base level would start instantly. If the compaction thread run faster than unittest main thread, we may observe the number of files in L0 reduce. Pull Request resolved: facebook#6061 Differential Revision: D18642301 fbshipit-source-id: 3e4da2ee963532b6e142336951ea3f47d46df148

Summary: Our process was abort when it call `CheckConsistency`. And the information in `stderr` show that "`L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421` ". Here are the causes of the accident I investigated. * RocksDB will call `CheckConsistency` whenever `MANIFEST` file is update. It will check sequence number interval of every file, except files which were ingested. * When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number. * `CheckConsistency` determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal. * If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst, the `smallest_seqno` of this new file will be smaller than his `largest_seqno`. * If more than one ingested file was ingested before memtable schedule flush, and they all compact into one new sstable file by `IntraL0Compaction`. The sequence interval of this new file will be included in the interval of the memtable. So `CheckConsistency` will return a `Corruption`. * If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start, but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2). **But there was still some data in s1 written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1.** Of course, it would also make a `Corruption` because of overlap of seqno. There is the relationship of the files: > s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno < s1.largest_seqno So I skip pick sst file which was ingested in function `FindIntraL0Compaction ` Here is my bug report: facebook#5913 There are two situations that can cause the check to fail. - First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest. - If there had been one compaction job which compacted sst from L0 to L1, `LevelCompactionPicker` would trigger a `IntraL0Compaction` which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst. - Then some data was put into memtable, and memtable was flushed to L0. We called this sst B. - RocksDB check consistency , and find the `smallest_seqno` of B is less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested. - First we have flushed many sst in L0, we call them [s1, s2, s3]. - There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1. And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1. - m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5. - [s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction. We call it s6. - compacted 4@0 files to L0 - When s6 is added into manifest, the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5. But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5. - s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno Pull Request resolved: facebook#5958 Differential Revision: D18601316 fbshipit-source-id: 5fe54b3c9af52a2e1400728f565e895cde1c7267

Summary: Signed-off-by: Little-Wallace <bupt2013211450@gmail.com> This PR is to fix unstable unit test added by (facebook#5958). I set SYNC_POINT in PickCompaction before. If IntraL0Compaction was trigger, the compact job which compact sst to base level would start instantly. If the compaction thread run faster than unittest main thread, we may observe the number of files in L0 reduce. Pull Request resolved: facebook#6061 Differential Revision: D18642301 fbshipit-source-id: 3e4da2ee963532b6e142336951ea3f47d46df148

…nos between ingested files and memtable's (#10777) Summary: **Context:** Same as #5958 (comment) but apply the fix to FIFO Compaction case Repro: ``` COERCE_CONTEXT_SWICH=1 make -j56 db_stress ./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=18 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=1000 --compaction_pri=3 --open_files=-1 --compaction_style=2 --fifo_allow_compaction=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8388607 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test0/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --ingest_external_file_one_in=100 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=20 --ribbon_starting_level=999 --snapshot_hold_ops=1000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=524288 --write_dbid_to_manifest=0 --writepercent=35 put or merge error: Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 file #479 with seqno 23711 29070 vs. file #482 with seqno 27138 29049 ``` **Summary:** FIFO only does intra-L0 compaction in the following four cases. For other cases, FIFO drops data instead of compacting on data, which is irrelevant to the overlapping seqno issue we are solving. - [FIFOCompactionPicker::PickSizeCompaction](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L155) when `total size < compaction_options_fifo.max_table_files_size` and `compaction_options_fifo.allow_compaction == true` - For this path, we simply reuse the fix in `FindIntraL0Compaction` https://github.com/facebook/rocksdb/pull/5958/files#diff-c261f77d6dd2134333c4a955c311cf4a196a08d3c2bb6ce24fd6801407877c89R56 - This path was not stress-tested at all. Therefore we covered `fifo.allow_compaction` in stress test to surface the overlapping seqno issue we are fixing here. - [FIFOCompactionPicker::PickCompactionToWarm](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L313) when `compaction_options_fifo.age_for_warm > 0` - For this path, we simply replicate the idea in #5958 (comment) and skip files of largest seqno greater than `earliest_mem_seqno` - This path was not stress-tested at all. However covering `age_for_warm` option worths a separate PR to deal with db stress compatibility. Therefore we manually tested this path for this PR - [FIFOCompactionPicker::CompactRange](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L365) that ends up picking one of the above two compactions - [CompactionPicker::CompactFiles](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.cc#L378) - Since `SanitizeCompactionInputFiles()` will be called [before](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.h#L111-L113) `CompactionPicker::CompactFiles` , we simply replicate the idea in #5958 (comment) in `SanitizeCompactionInputFiles()`. To simplify implementation, we return `Stats::Abort()` on encountering seqno-overlapped file when doing compaction to L0 instead of skipping the file and proceed with the compaction. Some additional clean-up included in this PR: - Renamed `earliest_memtable_seqno` to `earliest_mem_seqno` for consistent naming - Added comment about `earliest_memtable_seqno` in related APIs - Made parameter `earliest_memtable_seqno` constant and required Pull Request resolved: #10777 Test Plan: - make check - New unit test `TEST_P(DBCompactionTestFIFOCheckConsistencyWithParam, FlushAfterIntraL0CompactionWithIngestedFile)`corresponding to the above 4 cases, which will fail accordingly without the fix - Regular CI stress run on this PR + stress test with aggressive value #10761 and on FIFO compaction only Reviewed By: ajkr Differential Revision: D40090485 Pulled By: hx235 fbshipit-source-id: 52624186952ee7109117788741aeeac86b624a4f

…to overla…" (#10999) Summary: **Context/Summary:** This reverts commit fc74abb and related HISTORY record. The issue with PR 10777 or general approach using earliest_mem_seqno like #5958 (comment) is that the earliest seqno of memtable of each CFs does not get persisted and will always start with 0 upon Recover(). Later when creating a new memtable in certain CF, we use the last seqno of the whole DB (but not of that CF from previous DB session) for this CF. This will lead to false positive overlapping seqno and PR 10777 will throw something like https://github.com/facebook/rocksdb/blob/main/db/compaction/compaction_picker.cc#L1002-L1004 Luckily a more elegant and complete solution to the overlapping seqno problem these PR aim to solve does not have above problem, see #10922. It is already being pursued and in the process of review. So we can just revert this PR and focus on getting PR10922 to land. Pull Request resolved: #10999 Test Plan: make check Reviewed By: anand1976 Differential Revision: D41572604 Pulled By: hx235 fbshipit-source-id: 9d9bdf594abd235e2137045cef513ca0b14e0a3a

…to overla…" (facebook#10999) Summary: **Context/Summary:** This reverts commit fc74abb and related HISTORY record. The issue with PR 10777 or general approach using earliest_mem_seqno like facebook#5958 (comment) is that the earliest seqno of memtable of each CFs does not get persisted and will always start with 0 upon Recover(). Later when creating a new memtable in certain CF, we use the last seqno of the whole DB (but not of that CF from previous DB session) for this CF. This will lead to false positive overlapping seqno and PR 10777 will throw something like https://github.com/facebook/rocksdb/blob/main/db/compaction/compaction_picker.cc#L1002-L1004 Luckily a more elegant and complete solution to the overlapping seqno problem these PR aim to solve does not have above problem, see facebook#10922. It is already being pursued and in the process of review. So we can just revert this PR and focus on getting PR10922 to land. Pull Request resolved: facebook#10999 Test Plan: make check Reviewed By: anand1976 Differential Revision: D41572604 Pulled By: hx235 fbshipit-source-id: 9d9bdf594abd235e2137045cef513ca0b14e0a3a

Summary: **Context:** Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. **Summary:** - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix #5958 (comment) and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update #5958 (comment) tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: #10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under #5958 (comment) - [Ongoing] Compatibility test: manually run ajkr@36a5686 (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value #10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9

Little-Wallace added 3 commits October 21, 2019 21:37

BUG: check consistency failed

1c28bb7

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

add corruption for Compact

18c7b1a

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

fix consistency failed when IntraL0Compaction pick sst which was inge…

45cd5f6

…sted Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

facebook-github-bot added the CLA Signed label Oct 23, 2019

yiwu-arbug requested review from siying, maysamyabandeh and ajkr October 23, 2019 08:08

fix CompactionPickerTest sequence

21f132b

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

ajkr reviewed Oct 23, 2019

View reviewed changes

siying mentioned this pull request Oct 23, 2019

IntraL0Compaction cause consitency corruption #5913

Closed

andy-kimball mentioned this pull request Oct 24, 2019

19.2 release blockers list cockroachdb/cockroach#40447

Closed

53 tasks

petermattis mentioned this pull request Oct 24, 2019

tool: add level/seqnum invariant check cockroachdb/pebble#366

Closed

Little-Wallace added 4 commits October 26, 2019 18:41

compare seqno when pick intra l0 compaction

0da1373

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

fix interface

2b90d27

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

fix interface

cf58bd0

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

fix intra l0 compaction

950a7a4

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

petermattis reviewed Oct 26, 2019

View reviewed changes

ajkr reviewed Oct 26, 2019

View reviewed changes

Little-Wallace added 2 commits October 27, 2019 15:16

fix interface

caafdb3

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

skip pick first sstable in l0 when intrall0compaction

d66cd3f

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

facebook-github-bot closed this in ec3e3c3 Nov 19, 2019

facebook-github-bot added the Merged label Nov 19, 2019

Little-Wallace added a commit to Little-Wallace/rocksdb that referenced this pull request Nov 20, 2019

Cherry-pick fix corruption with intra-L0 on ingested files (facebook#…

f1cc5b7

…5958) Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

Little-Wallace mentioned this pull request Nov 21, 2019

fix unstable unittest caused by #5958 #6061

Closed

petermattis mentioned this pull request Mar 13, 2020

db: pebble internal error cockroachdb/pebble#567

Closed

hx235 mentioned this pull request Nov 8, 2022

Sort L0 files by newly introduced epoch_num #10922

Closed

hx235 mentioned this pull request Nov 29, 2022

Revert PR 10777 "Fix FIFO causing overlapping seqnos in L0 files due to overla…" #10999

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix corruption with intra-L0 on ingested files #5958

Fix corruption with intra-L0 on ingested files #5958

Little-Wallace commented Oct 23, 2019 •

edited by yiwu-arbug

Loading

ajkr left a comment •

edited

Loading

siying commented Oct 23, 2019

siying commented Oct 23, 2019

ajkr commented Oct 23, 2019

siying commented Oct 23, 2019

siying commented Oct 23, 2019

ajkr commented Oct 23, 2019

ajkr commented Oct 23, 2019

Little-Wallace commented Oct 24, 2019

riversand963 commented Oct 24, 2019 •

edited

Loading

Little-Wallace commented Oct 26, 2019

petermattis left a comment

petermattis Oct 26, 2019

petermattis Oct 26, 2019

Little-Wallace Oct 27, 2019

petermattis Oct 26, 2019

Little-Wallace Oct 27, 2019

petermattis Oct 26, 2019

Little-Wallace Oct 27, 2019

petermattis Oct 26, 2019

ajkr Oct 26, 2019

petermattis Oct 26, 2019

ajkr Oct 26, 2019

Little-Wallace Oct 27, 2019

petermattis Oct 28, 2019

siying Oct 29, 2019

yiwu-arbug Oct 29, 2019 •

edited

Loading

facebook-github-bot commented Nov 19, 2019

siying commented Nov 20, 2019

siying commented Nov 20, 2019

Little-Wallace commented Nov 21, 2019

Little-Wallace commented Nov 21, 2019

Little-Wallace commented Nov 21, 2019

Little-Wallace commented Nov 21, 2019

Fix corruption with intra-L0 on ingested files #5958

Fix corruption with intra-L0 on ingested files #5958

Conversation

Little-Wallace commented Oct 23, 2019 • edited by yiwu-arbug Loading

Problem Description

Reason

First situation：

Secondary situaion

ajkr left a comment • edited Loading

Choose a reason for hiding this comment

siying commented Oct 23, 2019

siying commented Oct 23, 2019

ajkr commented Oct 23, 2019

siying commented Oct 23, 2019

siying commented Oct 23, 2019

ajkr commented Oct 23, 2019

ajkr commented Oct 23, 2019

Little-Wallace commented Oct 24, 2019

riversand963 commented Oct 24, 2019 • edited Loading

Little-Wallace commented Oct 26, 2019

petermattis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yiwu-arbug Oct 29, 2019 • edited Loading

Choose a reason for hiding this comment

facebook-github-bot commented Nov 19, 2019

siying commented Nov 20, 2019

siying commented Nov 20, 2019

Little-Wallace commented Nov 21, 2019

Little-Wallace commented Nov 21, 2019

Little-Wallace commented Nov 21, 2019

Little-Wallace commented Nov 21, 2019

Little-Wallace commented Oct 23, 2019 •

edited by yiwu-arbug

Loading

ajkr left a comment •

edited

Loading

riversand963 commented Oct 24, 2019 •

edited

Loading

yiwu-arbug Oct 29, 2019 •

edited

Loading