bug with DeleteRange #2752

zhangjinpeng87 · 2017-08-17T07:06:53Z

When I use DeleteRange, after compaction, some keys have been deleted by Delete reappear, and some keys not covered by DeleteRange disappear.
For the same data, we have 3 replicas, 1 replica is wrong, and the others are correct.
I can't find out where the problem is. @ajkr

The text was updated successfully, but these errors were encountered:

zhangjinpeng87 · 2017-08-19T04:02:49Z

In our case, each region(key range) have 3 replicas distributed in 3 different node.

The following is the progress how we encounter this problem.

11:46:06 calculate check sum for region 3387, all the 3 replicas are equal
write nothing for region 3387
12:11:00 calculate check sum for region 3387 again, checksum for replica at node 2 is different.

And then, we iterator all the keys for all 3 replicas and found that replica at node 2 missing 200 thousands kv pairs, and 6 thousands kv should have been deleted reappeared.

The following is the key events on node 2

from 11:45:35 to 11:45:53 we write 142 delete_ranges(not include region 3387)
11:45:38 Level-0 flush table #55117
11:45:41 Compacting 6@0 + 130@1 files to L1 (include 55117.sst)
11:47:51 Compacted 6@0 + 130@1 files to L1

11:46:20 Level-0 flush table #55204
11:46:27 Compacting 4@0 files to L0 (include 55204.sst)
11:46:42 Compacted 4@0 files to L0 (Generated table #55222)
11:53:39 Compacting 36@0 + 128@1 files to L1 (include 55222.sst)
11:56:47 Compacted 36@0 + 128@1 files to L1

zhangjinpeng87 · 2017-08-19T08:25:46Z

Sst in level greater than 0 also maybe overlap by the border user key.
i.e. we have a delete range [A, F), when compaction output first 1.sst, its largest key is D, we will add delete range [A, F) into this sst, and reset its largest key to E(the next sst's smallest key) with seqno = kMaxSequenceNumber type = kTypeRangeDeletion.
When generate 2.sst we also add delete range [A, F) into this sst, and reset its smallest key to E with seqno = 0 type = kTypeRangeDeletion.
I don't know if this situation will cause reading wrong value for this border key.

siddontang · 2017-08-20T01:42:22Z

/cc @ajkr

zhangjinpeng87 · 2017-08-21T09:17:02Z

I found the bug that delete range will wrongly delete keys.

Firstly, when added to sst, the DeleteRange never split, so one DeleteRange may be added to multiple sst files.
Secondly, although the DeleteRange is added to sst, but the range of DeleteRange may be broader than the sst's range.
Thirdly, when the output level is bottommost, and the oldest snapshot is larger than the sequence number of the key, key's sequence number will be set to zero for better compression.

Write a delete range [a, d)
After compaction, this delete range may be added to multiple sst files: a.sst, b.sst, c.sst, but the range of these sst files are [a, b), [b, c), [c, d).
When a.sst and b.sst reach the bottommost level, the delete range of these two sst files will be dropped.
Write new keys in range [a, c).
When the new written keys [a, c) reach the bottommost level, their sequence number will be set to zero.
When the front c.sst compact with these keys [a, c) that sequence number are zero, these new written keys will be dropped wrongly.

@siying @ajkr

zhangjinpeng87 · 2017-08-21T09:18:14Z

The solution is spit the delete range when add to sst.

ajkr · 2017-08-21T23:07:19Z

Thanks a lot for investigating this. Can you describe your compaction config? Universal or level, and what kinds of manual compactions, if any?

One of this feature's built-in assumptions is that, when a range deletion gets split between files, any compaction will include either all or none of those files. ExpandInputsToCleanCut intends to guarantee this. So in your example, we should not be able to compact a.sst and b.sst to the bottom level without including c.sst in the same compaction.

I'll try to find where the assumption is broken.

zhangjinpeng87 · 2017-08-22T00:23:48Z

@ajkr we use level compaction

ajkr · 2017-08-22T00:55:56Z

Do you call CompactRange or CompactFiles?

zhangjinpeng87 · 2017-08-22T01:40:00Z

We have 4 column families, we don't trigger any manual compaction for the column family that lose some data, for another column family we do call CompactRange to compact the whole range.

zhangjinpeng87 · 2017-08-22T01:51:19Z

Do you need the LOG?

ajkr · 2017-08-22T02:23:47Z

Sure, LOG would be helpful.

zhangjinpeng87 · 2017-08-22T02:35:59Z

This is the LOG, and this is all ranges we have called delete range(for each column family).
And some regions' write column family lose some data, at the same time some deleted keys reappear in write column family.
The following are the regions have problem.

region 3387
start_key: "t\200\000\000\000\000\000\000\377#_i\200\000\000\000\000\377\000\000\001\003\200\000\000\000\377Y\221\375\000\003\200\000\000\377\000\000\27063\000\000\000\374" 
end_key: "t\200\000\000\000\000\000\000\377#_i\200\000\000\000\000\377\000\000\002\0011718\377\000\000\000\000\373\003\200\000\377\000\000Y\221\270\205\003\200\377\000\000\000\000\006qc\000\376"

region 909
start_key: "t\200\000\000\000\000\000\000\377#_i\200\000\000\000\000\377\000\000\002\0011718\377\000\000\000\000\373\003\200\000\377\000\000Y\221\270\205\003\200\377\000\000\000\000\006qc\000\376" 
end_key: "t\200\000\000\000\000\000\000\377#_i\200\000\000\000\000\377\000\000\002\0014203\377\000\000\000\000\373\003\200\000\377\000\000Y\221\271\216\003\200\377\000\000\000\000\r\356\230\000\376"

region 1003
start_key: "t\200\000\000\000\000\000\000\377%_i\200\000\000\000\000\377\000\000\002\0012280\377\000\000\000\000\373\003\200\000\377\000\000Y\221\270\010\003\200\377\000\000\000\000\002\374\204\000\376" 
end_key: "t\200\000\000\000\000\000\000\377%_i\200\000\000\000\000\377\000\000\002\0015906\377\000\000\000\000\373\003\200\000\377\000\000Y\221\270\365\003\200\377\000\000\000\000\t\216\324\000\376"

zhangjinpeng87 · 2017-08-22T23:38:33Z

@ajkr Any discovery?

ajkr · 2017-08-23T00:32:38Z

Unfortunately, no, the compactions look pretty normal in your LOG file. I've looked at the code and written some tests for your theory about tombstone getting split between level, but couldn't find a way for it to happen so far. Is it reproducible?

One way to verify your split-level theory: you can check the manifest around the time of the compactions. You can check the min/max key of the input files don't overlap other files that aren't included in the compaction. Note you'd need to capture the manifest fairly soon after the corruption as it's compacted periodically and when the DB reopens.

zhangjinpeng87 · 2017-08-23T01:29:37Z

Unfortunately, the DB has been reopened.

zhangjinpeng87 · 2017-08-23T01:30:23Z

BTW, it is hard to reproduce, we just meet once util now.

zhangjinpeng87 · 2017-08-23T01:35:37Z

Is there any other possibility that will cause data lose and deletions been dropped not correctly?

ajkr · 2017-08-23T01:48:17Z

Currently we don't know of any other reasons.

I noticed that your range deletions are probably never dropped. In your LOG there's a few files in L6 and compactions generally output to at most L3. Since range deletions are only dropped when compacted to the bottommost level, they shouldn't be dropped. You could use sst_dump to see whether the tombstones you expect to cover the reappearing keys are still there.

zhangjinpeng87 · 2017-08-23T02:11:48Z

If the delete range has no overlap with L6, L3/L2 also can be bottommost level. I will use sst_dump to check if there is any other clues.

zhangjinpeng87 · 2017-08-23T03:31:37Z

Unfortunately, I don't find any useful clue with sst_dump.

zhangjinpeng87 · 2017-08-23T08:19:07Z

I found a suspicious place in LOG, 56279.sst is contained in compaction JOB 5850, but is deleted in JOB 5822. I can't understand how this could happen.

2017/08/15-11:57:36.473090 7f56dcbfe700 EVENT_LOG_v1 {"time_micros": 1502769456473064, "job": 5850, "event": "compaction_started", "files_L1": [55939], "files_L2": [56228, 56230
, 56232, 56234, 56236, 56239, 56241, 56243, 56245, 56247, 56249, 56251, 56253, 56254, 56256, 56257, 56259, 56262, 56266, 56271, 56273, 56279, 56289, 56300, 56308, 56315, 56319,
56322, 56325, 56327, 56330, 56337, 56344, 56354, 56363, 56370, 56377, 56380, 56383, 56386, 56395, 56404, 56411], "score": 4.57294, "input_data_size": 1523537191}

2017/08/15-12:05:53.521205 7f56de5fe700 EVENT_LOG_v1 {"time_micros": 1502769953521201, "job": 5822, "event": "table_file_deletion", "file_number": 56279}

BTW, is it possible that deleting sst files wrongly after compaction in some extreme situations and causing lose put and delete entries? We check the data in the others replicas(we store the timestamp in value) and find out that those lost keys are written around 08-15 5am, in another word, they are very close in time, so i guess if they are contained in the same sst file.

ajkr · 2017-08-24T06:48:50Z

Did 5822 finish after 5850 finished? Maybe there was a user iterator when 5850 finished, in which case the SST files are preserved. 5822 might be the first caller of PurgeObsoleteFiles when the SST files are unreferenced and thus eligible for deletion.

zhangjinpeng87 · 2017-08-24T12:37:49Z

Yes, JOB 5822 was finished after JOB 5850.
Another question, when would delete a non-existing file?

2017/08/15-14:18:40.625328 7f0cc951cb00 [db/db_impl_files.cc:314] [JOB 2] Tried to delete a non-existing file /data1/kv2/deploy/data/db//064826.sst type=2 #64826 -- IO error: while unlink() file: /data1/kv2/deploy/data/db//064826.sst: No such file or directory

zhangjinpeng87 · 2017-08-28T02:02:54Z

@ajkr We encounter the same problem once again, i will keep this issue updated later.

zhangjinpeng87 · 2017-08-28T05:30:29Z

I find a very suspicious place in MANIFEST file, one sst file's smallest key is larger than largest key.

{"Level": 1, "FileNumber": 185830, "FileSize": 34742680, "SmallestIKey": "'7A7480000000
000000FF2B5F728000000000FF0E33560000000000FAFA87C93C4F53FFE2' seq:1903186226, type:1", "LargestIKey": "'7A7480000000000000FF215F728000000000FF198E0B0000000000FAFA87C91F7243FFFA' se
q:1897963609, type:1"}

We also find out that there must be delete range in the associated compaction, because kMaxSequenceNumber and kTypeRangeDeletion occurred.

seq:72057594037927935, type:15

At the same time, we find out that our lost data are covered by this range.

7A7480000000000000FF275F698000000000FF0000010380000000FF599F0CEA03800000FF000148C86F000000FCFA87AAD033DBFFFE
...
7A7480000000000000FF275F698000000000FF0000010380000000FF599F0EE703800000FF00014AD41D000000FCFA87AAB271E3FFFD

BTW, this is the LOG.

zhangjinpeng87 · 2017-08-28T11:00:49Z

After investigating the MANIFEST and LOG, we found JOB 10643 L0->L1 compaction didn't take all overlapped sst files in L1, this result in overlap in L1 after job finished. This situation will cause the following compaction iterator is not incremental, this is the exact reason why smallest key is larger than largest key.

huachaohuang · 2017-08-28T14:12:42Z

Seems this snippet will produce a SST file with a wrong range:

#include <stdio.h>
#include <assert.h>

#include <rocksdb/db.h>

using namespace rocksdb;

int main(int argc, char** argv) {
  Status s;
  Options opts;
  opts.create_if_missing = true;
  WriteOptions wopts;
  FlushOptions fopts;

  DB* db = nullptr;
  s = DB::Open(opts, "rocksdb", &db);
  assert(s.ok());

  auto cf = db->DefaultColumnFamily();

  s = db->DeleteRange(wopts, cf, Slice("b"), Slice("c"));
  assert(s.ok());

  auto sn = db->GetSnapshot();

  s = db->DeleteRange(wopts, cf, Slice("a"), Slice("b"));
  assert(s.ok());

  s = db->Flush(fopts, cf);
  assert(s.ok());
}

MANIFEST:

--------------- Column family "default"  (ID 0) --------------
log number: 6
comparator: leveldb.BytewiseComparator
--- level 0 --- version# 1 ---
 7:949['b' seq:1, type:15 .. 'c' seq:1, type:15]
--- level 1 --- version# 1 ---
--- level 2 --- version# 1 ---
--- level 3 --- version# 1 ---

ajkr · 2017-08-28T21:55:32Z

@huachaohuang The ["a", "b") deletion is newer than the newest snapshot. It should be dropped when compacted to bottommost level (L0 in your case) as it's not needed for correctness of any snapshot. So ["b", "c"] looks like the right range to me. Anything I missed?

ajkr · 2017-08-28T23:29:33Z

@zhangjinpeng1987 Thanks, understood.

Is it possible the ingested SST files contain range deletions?

zhangjinpeng87 · 2017-08-28T23:43:01Z

@ajkr Never, ingested sst files only contain PUT.

huachaohuang · 2017-08-28T23:51:48Z

@ajkr Let's see another example and check what's happen:

#include <stdio.h>
#include <assert.h>

#include <rocksdb/db.h>

using namespace rocksdb;

int main(int argc, char** argv) {
  Status s;
  Options opts;
  opts.create_if_missing = true;
  ReadOptions ropts;
  WriteOptions wopts;
  FlushOptions fopts;
  CompactRangeOptions copts;

  DB* db = nullptr;
  s = DB::Open(opts, "rocksdb", &db);
  assert(s.ok());

  auto cf = db->DefaultColumnFamily();

  s = db->Put(wopts, cf, Slice("a"), Slice("a"));
  assert(s.ok());

  s = db->Flush(fopts, cf);
  assert(s.ok());

  // This will move 000007.sst to level 1 with entry "a".
  s = db->CompactRange(copts, cf, nullptr, nullptr);
  assert(s.ok());

  s = db->DeleteRange(wopts, cf, Slice("b"), Slice("c"));
  assert(s.ok());

  auto sn = db->GetSnapshot();

  s = db->DeleteRange(wopts, cf, Slice("a"), Slice("b"));
  assert(s.ok());

  // This will build 000010.sst to level 0 with range [b, c).
  // But 000010.sst actually contains delete range [a, b) and [b, c).
  s = db->Flush(fopts, cf);
  assert(s.ok());

  std::string v;
  s = db->Get(ropts, Slice("a"), &v);
  // This will fail because the delete range [a, b) is missed.
  assert(s.IsNotFound());
}

MANIFEST:

--------------- Column family "default"  (ID 0) --------------
log number: 9
comparator: leveldb.BytewiseComparator
--- level 0 --- version# 1 ---
 10:949['b' seq:2, type:15 .. 'c' seq:2, type:15]
--- level 1 --- version# 1 ---
 7:919['a' seq:1, type:1 .. 'a' seq:1, type:1]
--- level 2 --- version# 1 ---

000007_dump.txt

Footer Details:
--------------------------------------
  checksum: 1
  metaindex handle: A20620
  index handle: C70616
  footer version: 2
  table_magic_number: 9863518390377041911
  
Metaindex Details:
--------------------------------------
  Properties block handle: 1A8306

Table Properties:
--------------------------------------
  # data blocks: 1
  # entries: 1
  raw key size: 9
  raw average key size: 9.000000
  raw value size: 1
  raw average value size: 1.000000
  data block size: 26
  index block size: 27
  filter block size: 0
  (estimated) table size: 53
  filter policy name: N/A
  column family ID: 0
  column family name: default
  comparator name: leveldb.BytewiseComparator
  merge operator name: nullptr
  property collectors names: []
  SST file compression algo: Snappy
  creation time: 1503963804
  
Index Details:
--------------------------------------
  Block key hex dump: Data block handle
  Block key ascii

  HEX    62: 0015
  ASCII  b 
  ------

Data Block # 1 @ 0015
--------------------------------------
  HEX    61: 61
  ASCII  a : a 
  ------

Data Block Summary:
--------------------------------------
  # data blocks: 1
  min data block size: 21
  max data block size: 21
  avg data block size: 21.000000

000010_dump.txt

Footer Details:
--------------------------------------
  checksum: 1
  metaindex handle: B3063B
  index handle: F30608
  footer version: 2
  table_magic_number: 9863518390377041911
  
Metaindex Details:
--------------------------------------
  Properties block handle: 008306
  Range deletion block handle: 880626

Table Properties:
--------------------------------------
  # data blocks: 0
  # entries: 2
  raw key size: 18
  raw average key size: 9.000000
  raw value size: 2
  raw average value size: 1.000000
  data block size: 0
  index block size: 13
  filter block size: 0
  (estimated) table size: 13
  filter policy name: N/A
  column family ID: 0
  column family name: default
  comparator name: leveldb.BytewiseComparator
  merge operator name: nullptr
  property collectors names: []
  SST file compression algo: Snappy
  creation time: 1503963804
  
Index Details:
--------------------------------------
  Block key hex dump: Data block handle
  Block key ascii


Range deletions:
--------------------------------------
    HEX    62: 63
  ASCII  b : c 
  ------
  HEX    61: 62
  ASCII  a : b 
  ------

I think the problem is in RangeDelAggregator::AddToBuilder, it only considers the smallest key of the first added tombstone. But in this case, the first added tombstone will be [b, c) because the snapshot between [b, c) and [a, b).

Summary: Since tombstones are not stored in order, we may get a wrong smallest key if we only consider the first added tombstone. Check #2752 for more details. Closes #2799 Differential Revision: D5728217 Pulled By: ajkr fbshipit-source-id: 4a53edb0ca80d2a9fcf10749e52d47d57d6417d3

Summary: Since tombstones are not stored in order, we may get a wrong smallest key if we only consider the first added tombstone. Check facebook#2752 for more details. Closes facebook#2799 Differential Revision: D5728217 Pulled By: ajkr fbshipit-source-id: 4a53edb0ca80d2a9fcf10749e52d47d57d6417d3

Summary: Since tombstones are not stored in order, we may get a wrong smallest key if we only consider the first added tombstone. Check facebook#2752 for more details. Closes facebook#2799 Differential Revision: D5728217 Pulled By: ajkr

sihuazhou · 2018-05-21T14:36:31Z

Hi @ajkr , is the deleteRange() still an experimental feature? Or it's a stable feature now? I noticed that it's still an experimental feature on the comments of CPP, but this feature have been introduced by a long time ago.

ajkr · 2018-05-21T16:09:49Z

@sihuazhou I'd still classify it as an experimental feature. Despite it existing for a while, getting some adoption, and fixing the known bugs, we still haven't optimized its performance so it can cause serious performance regression in certain common cases. Once that's addressed we can recommend it for general-purpose use and remove the warnings. But it'll be a while longer.

sihuazhou · 2018-05-22T05:47:24Z

@ajkr Thanks for your reply, I got it!

zhangjinpeng87 changed the title ~~potential bug with DeleteRange~~ bug with DeleteRange Aug 21, 2017

zhangjinpeng87 mentioned this issue Aug 21, 2017

split delete range when add to sst #2765

Closed

huachaohuang mentioned this issue Aug 29, 2017

Fix wrong smallest key of delete range tombstones #2799

Closed

zhangjinpeng87 closed this as completed Aug 30, 2017

bdarnell mentioned this issue Sep 8, 2017

c-deps: Investigate RocksDB DeleteRange bug cockroachdb/cockroach#18371

Closed

ajkr mentioned this issue Jun 25, 2018

Relax VersionStorageInfo::GetOverlappingInputs check #4050

Closed

karimodm mentioned this issue Dec 21, 2019

Investigate and fix rescan and revalidate on LS and non-LS nodes iotaledger/iri#1391

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug with DeleteRange #2752

bug with DeleteRange #2752

zhangjinpeng87 commented Aug 17, 2017

zhangjinpeng87 commented Aug 19, 2017 •

edited

Loading

zhangjinpeng87 commented Aug 19, 2017

siddontang commented Aug 20, 2017

zhangjinpeng87 commented Aug 21, 2017 •

edited

Loading

zhangjinpeng87 commented Aug 21, 2017

ajkr commented Aug 21, 2017

zhangjinpeng87 commented Aug 22, 2017

ajkr commented Aug 22, 2017

zhangjinpeng87 commented Aug 22, 2017

zhangjinpeng87 commented Aug 22, 2017

ajkr commented Aug 22, 2017

zhangjinpeng87 commented Aug 22, 2017 •

edited

Loading

zhangjinpeng87 commented Aug 22, 2017

ajkr commented Aug 23, 2017

zhangjinpeng87 commented Aug 23, 2017

zhangjinpeng87 commented Aug 23, 2017

zhangjinpeng87 commented Aug 23, 2017 •

edited

Loading

ajkr commented Aug 23, 2017

zhangjinpeng87 commented Aug 23, 2017 •

edited

Loading

zhangjinpeng87 commented Aug 23, 2017

zhangjinpeng87 commented Aug 23, 2017 •

edited

Loading

ajkr commented Aug 24, 2017

zhangjinpeng87 commented Aug 24, 2017

zhangjinpeng87 commented Aug 28, 2017

zhangjinpeng87 commented Aug 28, 2017 •

edited

Loading

zhangjinpeng87 commented Aug 28, 2017 •

edited

Loading

huachaohuang commented Aug 28, 2017

ajkr commented Aug 28, 2017

ajkr commented Aug 28, 2017

zhangjinpeng87 commented Aug 28, 2017

huachaohuang commented Aug 28, 2017

sihuazhou commented May 21, 2018

ajkr commented May 21, 2018

sihuazhou commented May 22, 2018

bug with DeleteRange #2752

bug with DeleteRange #2752

Comments

zhangjinpeng87 commented Aug 17, 2017

zhangjinpeng87 commented Aug 19, 2017 • edited Loading

zhangjinpeng87 commented Aug 19, 2017

siddontang commented Aug 20, 2017

zhangjinpeng87 commented Aug 21, 2017 • edited Loading

zhangjinpeng87 commented Aug 21, 2017

ajkr commented Aug 21, 2017

zhangjinpeng87 commented Aug 22, 2017

ajkr commented Aug 22, 2017

zhangjinpeng87 commented Aug 22, 2017

zhangjinpeng87 commented Aug 22, 2017

ajkr commented Aug 22, 2017

zhangjinpeng87 commented Aug 22, 2017 • edited Loading

zhangjinpeng87 commented Aug 22, 2017

ajkr commented Aug 23, 2017

zhangjinpeng87 commented Aug 23, 2017

zhangjinpeng87 commented Aug 23, 2017

zhangjinpeng87 commented Aug 23, 2017 • edited Loading

ajkr commented Aug 23, 2017

zhangjinpeng87 commented Aug 23, 2017 • edited Loading

zhangjinpeng87 commented Aug 23, 2017

zhangjinpeng87 commented Aug 23, 2017 • edited Loading

ajkr commented Aug 24, 2017

zhangjinpeng87 commented Aug 24, 2017

zhangjinpeng87 commented Aug 28, 2017

zhangjinpeng87 commented Aug 28, 2017 • edited Loading

zhangjinpeng87 commented Aug 28, 2017 • edited Loading

huachaohuang commented Aug 28, 2017

ajkr commented Aug 28, 2017

ajkr commented Aug 28, 2017

zhangjinpeng87 commented Aug 28, 2017

huachaohuang commented Aug 28, 2017

sihuazhou commented May 21, 2018

ajkr commented May 21, 2018

sihuazhou commented May 22, 2018

zhangjinpeng87 commented Aug 19, 2017 •

edited

Loading

zhangjinpeng87 commented Aug 21, 2017 •

edited

Loading

zhangjinpeng87 commented Aug 22, 2017 •

edited

Loading

zhangjinpeng87 commented Aug 23, 2017 •

edited

Loading

zhangjinpeng87 commented Aug 23, 2017 •

edited

Loading

zhangjinpeng87 commented Aug 23, 2017 •

edited

Loading

zhangjinpeng87 commented Aug 28, 2017 •

edited

Loading

zhangjinpeng87 commented Aug 28, 2017 •

edited

Loading