Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug with DeleteRange #2752

Closed
zhangjinpeng87 opened this issue Aug 17, 2017 · 34 comments
Closed

bug with DeleteRange #2752

zhangjinpeng87 opened this issue Aug 17, 2017 · 34 comments

Comments

@zhangjinpeng87
Copy link
Contributor

When I use DeleteRange, after compaction, some keys have been deleted by Delete reappear, and some keys not covered by DeleteRange disappear.
For the same data, we have 3 replicas, 1 replica is wrong, and the others are correct.
I can't find out where the problem is. @ajkr

@zhangjinpeng87
Copy link
Contributor Author

zhangjinpeng87 commented Aug 19, 2017

In our case, each region(key range) have 3 replicas distributed in 3 different node.

The following is the progress how we encounter this problem.

11:46:06 calculate check sum for region 3387, all the 3 replicas are equal
write nothing for region 3387
12:11:00 calculate check sum for region 3387 again, checksum for replica at node 2 is different.

And then, we iterator all the keys for all 3 replicas and found that replica at node 2 missing 200 thousands kv pairs, and 6 thousands kv should have been deleted reappeared.

The following is the key events on node 2

from 11:45:35 to 11:45:53 we write 142 delete_ranges(not include region 3387)
11:45:38 Level-0 flush table #55117
11:45:41 Compacting 6@0 + 130@1 files to L1 (include 55117.sst)
11:47:51 Compacted 6@0 + 130@1 files to L1

11:46:20 Level-0 flush table #55204
11:46:27 Compacting 4@0 files to L0 (include 55204.sst)
11:46:42 Compacted 4@0 files to L0 (Generated table #55222)
11:53:39 Compacting 36@0 + 128@1 files to L1 (include 55222.sst)
11:56:47 Compacted 36@0 + 128@1 files to L1 

@zhangjinpeng87
Copy link
Contributor Author

Sst in level greater than 0 also maybe overlap by the border user key.
i.e. we have a delete range [A, F), when compaction output first 1.sst, its largest key is D, we will add delete range [A, F) into this sst, and reset its largest key to E(the next sst's smallest key) with seqno = kMaxSequenceNumber type = kTypeRangeDeletion.
When generate 2.sst we also add delete range [A, F) into this sst, and reset its smallest key to E with seqno = 0 type = kTypeRangeDeletion.
I don't know if this situation will cause reading wrong value for this border key.

@siddontang
Copy link
Contributor

/cc @ajkr

@zhangjinpeng87
Copy link
Contributor Author

zhangjinpeng87 commented Aug 21, 2017

I found the bug that delete range will wrongly delete keys.

Firstly, when added to sst, the DeleteRange never split, so one DeleteRange may be added to multiple sst files.
Secondly, although the DeleteRange is added to sst, but the range of DeleteRange may be broader than the sst's range.
Thirdly, when the output level is bottommost, and the oldest snapshot is larger than the sequence number of the key, key's sequence number will be set to zero for better compression.

  1. Write a delete range [a, d)
  2. After compaction, this delete range may be added to multiple sst files: a.sst, b.sst, c.sst, but the range of these sst files are [a, b), [b, c), [c, d).
  3. When a.sst and b.sst reach the bottommost level, the delete range of these two sst files will be dropped.
  4. Write new keys in range [a, c).
  5. When the new written keys [a, c) reach the bottommost level, their sequence number will be set to zero.
  6. When the front c.sst compact with these keys [a, c) that sequence number are zero, these new written keys will be dropped wrongly.

@siying @ajkr

@zhangjinpeng87
Copy link
Contributor Author

The solution is spit the delete range when add to sst.

@zhangjinpeng87 zhangjinpeng87 changed the title potential bug with DeleteRange bug with DeleteRange Aug 21, 2017
@ajkr
Copy link
Contributor

ajkr commented Aug 21, 2017

Thanks a lot for investigating this. Can you describe your compaction config? Universal or level, and what kinds of manual compactions, if any?

One of this feature's built-in assumptions is that, when a range deletion gets split between files, any compaction will include either all or none of those files. ExpandInputsToCleanCut intends to guarantee this. So in your example, we should not be able to compact a.sst and b.sst to the bottom level without including c.sst in the same compaction.

I'll try to find where the assumption is broken.

@zhangjinpeng87
Copy link
Contributor Author

@ajkr we use level compaction

@ajkr
Copy link
Contributor

ajkr commented Aug 22, 2017

Do you call CompactRange or CompactFiles?

@zhangjinpeng87
Copy link
Contributor Author

We have 4 column families, we don't trigger any manual compaction for the column family that lose some data, for another column family we do call CompactRange to compact the whole range.

@zhangjinpeng87
Copy link
Contributor Author

Do you need the LOG?

@ajkr
Copy link
Contributor

ajkr commented Aug 22, 2017

Sure, LOG would be helpful.

@zhangjinpeng87
Copy link
Contributor Author

zhangjinpeng87 commented Aug 22, 2017

This is the LOG, and this is all ranges we have called delete range(for each column family).
And some regions' write column family lose some data, at the same time some deleted keys reappear in write column family.
The following are the regions have problem.

region 3387
start_key: "t\200\000\000\000\000\000\000\377#_i\200\000\000\000\000\377\000\000\001\003\200\000\000\000\377Y\221\375\000\003\200\000\000\377\000\000\27063\000\000\000\374" 
end_key: "t\200\000\000\000\000\000\000\377#_i\200\000\000\000\000\377\000\000\002\0011718\377\000\000\000\000\373\003\200\000\377\000\000Y\221\270\205\003\200\377\000\000\000\000\006qc\000\376"

region 909
start_key: "t\200\000\000\000\000\000\000\377#_i\200\000\000\000\000\377\000\000\002\0011718\377\000\000\000\000\373\003\200\000\377\000\000Y\221\270\205\003\200\377\000\000\000\000\006qc\000\376" 
end_key: "t\200\000\000\000\000\000\000\377#_i\200\000\000\000\000\377\000\000\002\0014203\377\000\000\000\000\373\003\200\000\377\000\000Y\221\271\216\003\200\377\000\000\000\000\r\356\230\000\376"

region 1003
start_key: "t\200\000\000\000\000\000\000\377%_i\200\000\000\000\000\377\000\000\002\0012280\377\000\000\000\000\373\003\200\000\377\000\000Y\221\270\010\003\200\377\000\000\000\000\002\374\204\000\376" 
end_key: "t\200\000\000\000\000\000\000\377%_i\200\000\000\000\000\377\000\000\002\0015906\377\000\000\000\000\373\003\200\000\377\000\000Y\221\270\365\003\200\377\000\000\000\000\t\216\324\000\376"

@zhangjinpeng87
Copy link
Contributor Author

@ajkr Any discovery?

@ajkr
Copy link
Contributor

ajkr commented Aug 23, 2017

Unfortunately, no, the compactions look pretty normal in your LOG file. I've looked at the code and written some tests for your theory about tombstone getting split between level, but couldn't find a way for it to happen so far. Is it reproducible?

One way to verify your split-level theory: you can check the manifest around the time of the compactions. You can check the min/max key of the input files don't overlap other files that aren't included in the compaction. Note you'd need to capture the manifest fairly soon after the corruption as it's compacted periodically and when the DB reopens.

@zhangjinpeng87
Copy link
Contributor Author

Unfortunately, the DB has been reopened.

@zhangjinpeng87
Copy link
Contributor Author

BTW, it is hard to reproduce, we just meet once util now.

@zhangjinpeng87
Copy link
Contributor Author

zhangjinpeng87 commented Aug 23, 2017

Is there any other possibility that will cause data lose and deletions been dropped not correctly?

@ajkr
Copy link
Contributor

ajkr commented Aug 23, 2017

Currently we don't know of any other reasons.

I noticed that your range deletions are probably never dropped. In your LOG there's a few files in L6 and compactions generally output to at most L3. Since range deletions are only dropped when compacted to the bottommost level, they shouldn't be dropped. You could use sst_dump to see whether the tombstones you expect to cover the reappearing keys are still there.

@zhangjinpeng87
Copy link
Contributor Author

zhangjinpeng87 commented Aug 23, 2017

If the delete range has no overlap with L6, L3/L2 also can be bottommost level. I will use sst_dump to check if there is any other clues.

@zhangjinpeng87
Copy link
Contributor Author

Unfortunately, I don't find any useful clue with sst_dump.

@zhangjinpeng87
Copy link
Contributor Author

zhangjinpeng87 commented Aug 23, 2017

I found a suspicious place in LOG, 56279.sst is contained in compaction JOB 5850, but is deleted in JOB 5822. I can't understand how this could happen.

2017/08/15-11:57:36.473090 7f56dcbfe700 EVENT_LOG_v1 {"time_micros": 1502769456473064, "job": 5850, "event": "compaction_started", "files_L1": [55939], "files_L2": [56228, 56230
, 56232, 56234, 56236, 56239, 56241, 56243, 56245, 56247, 56249, 56251, 56253, 56254, 56256, 56257, 56259, 56262, 56266, 56271, 56273, 56279, 56289, 56300, 56308, 56315, 56319,
56322, 56325, 56327, 56330, 56337, 56344, 56354, 56363, 56370, 56377, 56380, 56383, 56386, 56395, 56404, 56411], "score": 4.57294, "input_data_size": 1523537191}

2017/08/15-12:05:53.521205 7f56de5fe700 EVENT_LOG_v1 {"time_micros": 1502769953521201, "job": 5822, "event": "table_file_deletion", "file_number": 56279}

BTW, is it possible that deleting sst files wrongly after compaction in some extreme situations and causing lose put and delete entries? We check the data in the others replicas(we store the timestamp in value) and find out that those lost keys are written around 08-15 5am, in another word, they are very close in time, so i guess if they are contained in the same sst file.

@ajkr
Copy link
Contributor

ajkr commented Aug 24, 2017

Did 5822 finish after 5850 finished? Maybe there was a user iterator when 5850 finished, in which case the SST files are preserved. 5822 might be the first caller of PurgeObsoleteFiles when the SST files are unreferenced and thus eligible for deletion.

@zhangjinpeng87
Copy link
Contributor Author

Yes, JOB 5822 was finished after JOB 5850.
Another question, when would delete a non-existing file?

2017/08/15-14:18:40.625328 7f0cc951cb00 [db/db_impl_files.cc:314] [JOB 2] Tried to delete a non-existing file /data1/kv2/deploy/data/db//064826.sst type=2 #64826 -- IO error: while unlink() file: /data1/kv2/deploy/data/db//064826.sst: No such file or directory

@zhangjinpeng87
Copy link
Contributor Author

@ajkr We encounter the same problem once again, i will keep this issue updated later.

@zhangjinpeng87
Copy link
Contributor Author

zhangjinpeng87 commented Aug 28, 2017

I find a very suspicious place in MANIFEST file, one sst file's smallest key is larger than largest key.

{"Level": 1, "FileNumber": 185830, "FileSize": 34742680, "SmallestIKey": "'7A7480000000
000000FF2B5F728000000000FF0E33560000000000FAFA87C93C4F53FFE2' seq:1903186226, type:1", "LargestIKey": "'7A7480000000000000FF215F728000000000FF198E0B0000000000FAFA87C91F7243FFFA' se
q:1897963609, type:1"}

We also find out that there must be delete range in the associated compaction, because kMaxSequenceNumber and kTypeRangeDeletion occurred.

seq:72057594037927935, type:15

At the same time, we find out that our lost data are covered by this range.

7A7480000000000000FF275F698000000000FF0000010380000000FF599F0CEA03800000FF000148C86F000000FCFA87AAD033DBFFFE
...
7A7480000000000000FF275F698000000000FF0000010380000000FF599F0EE703800000FF00014AD41D000000FCFA87AAB271E3FFFD

BTW, this is the LOG.

@zhangjinpeng87
Copy link
Contributor Author

zhangjinpeng87 commented Aug 28, 2017

After investigating the MANIFEST and LOG, we found JOB 10643 L0->L1 compaction didn't take all overlapped sst files in L1, this result in overlap in L1 after job finished. This situation will cause the following compaction iterator is not incremental, this is the exact reason why smallest key is larger than largest key.

@huachaohuang
Copy link
Contributor

Seems this snippet will produce a SST file with a wrong range:

#include <stdio.h>
#include <assert.h>

#include <rocksdb/db.h>

using namespace rocksdb;

int main(int argc, char** argv) {
  Status s;
  Options opts;
  opts.create_if_missing = true;
  WriteOptions wopts;
  FlushOptions fopts;

  DB* db = nullptr;
  s = DB::Open(opts, "rocksdb", &db);
  assert(s.ok());

  auto cf = db->DefaultColumnFamily();

  s = db->DeleteRange(wopts, cf, Slice("b"), Slice("c"));
  assert(s.ok());

  auto sn = db->GetSnapshot();

  s = db->DeleteRange(wopts, cf, Slice("a"), Slice("b"));
  assert(s.ok());

  s = db->Flush(fopts, cf);
  assert(s.ok());
}

MANIFEST:

--------------- Column family "default"  (ID 0) --------------
log number: 6
comparator: leveldb.BytewiseComparator
--- level 0 --- version# 1 ---
 7:949['b' seq:1, type:15 .. 'c' seq:1, type:15]
--- level 1 --- version# 1 ---
--- level 2 --- version# 1 ---
--- level 3 --- version# 1 ---

@ajkr
Copy link
Contributor

ajkr commented Aug 28, 2017

@huachaohuang The ["a", "b") deletion is newer than the newest snapshot. It should be dropped when compacted to bottommost level (L0 in your case) as it's not needed for correctness of any snapshot. So ["b", "c"] looks like the right range to me. Anything I missed?

@ajkr
Copy link
Contributor

ajkr commented Aug 28, 2017

@zhangjinpeng1987 Thanks, understood.

Is it possible the ingested SST files contain range deletions?

@zhangjinpeng87
Copy link
Contributor Author

@ajkr Never, ingested sst files only contain PUT.

@huachaohuang
Copy link
Contributor

@ajkr Let's see another example and check what's happen:

#include <stdio.h>
#include <assert.h>

#include <rocksdb/db.h>

using namespace rocksdb;

int main(int argc, char** argv) {
  Status s;
  Options opts;
  opts.create_if_missing = true;
  ReadOptions ropts;
  WriteOptions wopts;
  FlushOptions fopts;
  CompactRangeOptions copts;

  DB* db = nullptr;
  s = DB::Open(opts, "rocksdb", &db);
  assert(s.ok());

  auto cf = db->DefaultColumnFamily();

  s = db->Put(wopts, cf, Slice("a"), Slice("a"));
  assert(s.ok());

  s = db->Flush(fopts, cf);
  assert(s.ok());

  // This will move 000007.sst to level 1 with entry "a".
  s = db->CompactRange(copts, cf, nullptr, nullptr);
  assert(s.ok());

  s = db->DeleteRange(wopts, cf, Slice("b"), Slice("c"));
  assert(s.ok());

  auto sn = db->GetSnapshot();

  s = db->DeleteRange(wopts, cf, Slice("a"), Slice("b"));
  assert(s.ok());

  // This will build 000010.sst to level 0 with range [b, c).
  // But 000010.sst actually contains delete range [a, b) and [b, c).
  s = db->Flush(fopts, cf);
  assert(s.ok());

  std::string v;
  s = db->Get(ropts, Slice("a"), &v);
  // This will fail because the delete range [a, b) is missed.
  assert(s.IsNotFound());
}

MANIFEST:

--------------- Column family "default"  (ID 0) --------------
log number: 9
comparator: leveldb.BytewiseComparator
--- level 0 --- version# 1 ---
 10:949['b' seq:2, type:15 .. 'c' seq:2, type:15]
--- level 1 --- version# 1 ---
 7:919['a' seq:1, type:1 .. 'a' seq:1, type:1]
--- level 2 --- version# 1 ---

000007_dump.txt

Footer Details:
--------------------------------------
  checksum: 1
  metaindex handle: A20620
  index handle: C70616
  footer version: 2
  table_magic_number: 9863518390377041911
  
Metaindex Details:
--------------------------------------
  Properties block handle: 1A8306

Table Properties:
--------------------------------------
  # data blocks: 1
  # entries: 1
  raw key size: 9
  raw average key size: 9.000000
  raw value size: 1
  raw average value size: 1.000000
  data block size: 26
  index block size: 27
  filter block size: 0
  (estimated) table size: 53
  filter policy name: N/A
  column family ID: 0
  column family name: default
  comparator name: leveldb.BytewiseComparator
  merge operator name: nullptr
  property collectors names: []
  SST file compression algo: Snappy
  creation time: 1503963804
  
Index Details:
--------------------------------------
  Block key hex dump: Data block handle
  Block key ascii

  HEX    62: 0015
  ASCII  b 
  ------

Data Block # 1 @ 0015
--------------------------------------
  HEX    61: 61
  ASCII  a : a 
  ------

Data Block Summary:
--------------------------------------
  # data blocks: 1
  min data block size: 21
  max data block size: 21
  avg data block size: 21.000000

000010_dump.txt

Footer Details:
--------------------------------------
  checksum: 1
  metaindex handle: B3063B
  index handle: F30608
  footer version: 2
  table_magic_number: 9863518390377041911
  
Metaindex Details:
--------------------------------------
  Properties block handle: 008306
  Range deletion block handle: 880626

Table Properties:
--------------------------------------
  # data blocks: 0
  # entries: 2
  raw key size: 18
  raw average key size: 9.000000
  raw value size: 2
  raw average value size: 1.000000
  data block size: 0
  index block size: 13
  filter block size: 0
  (estimated) table size: 13
  filter policy name: N/A
  column family ID: 0
  column family name: default
  comparator name: leveldb.BytewiseComparator
  merge operator name: nullptr
  property collectors names: []
  SST file compression algo: Snappy
  creation time: 1503963804
  
Index Details:
--------------------------------------
  Block key hex dump: Data block handle
  Block key ascii


Range deletions:
--------------------------------------
    HEX    62: 63
  ASCII  b : c 
  ------
  HEX    61: 62
  ASCII  a : b 
  ------

I think the problem is in RangeDelAggregator::AddToBuilder, it only considers the smallest key of the first added tombstone. But in this case, the first added tombstone will be [b, c) because the snapshot between [b, c) and [a, b).

facebook-github-bot pushed a commit that referenced this issue Aug 30, 2017
Summary:
Since tombstones are not stored in order, we may get a wrong smallest key if we only consider the first added tombstone.
Check #2752 for more details.
Closes #2799

Differential Revision: D5728217

Pulled By: ajkr

fbshipit-source-id: 4a53edb0ca80d2a9fcf10749e52d47d57d6417d3
ajkr pushed a commit that referenced this issue Aug 30, 2017
Summary:
Since tombstones are not stored in order, we may get a wrong smallest key if we only consider the first added tombstone.
Check #2752 for more details.
Closes #2799

Differential Revision: D5728217

Pulled By: ajkr

fbshipit-source-id: 4a53edb0ca80d2a9fcf10749e52d47d57d6417d3
petermattis added a commit to cockroachdb/rocksdb that referenced this issue Sep 8, 2017
Summary:
Since tombstones are not stored in order, we may get a wrong smallest
key if we only consider the first added tombstone.  Check
facebook#2752 for more details.
Closes facebook#2799

Differential Revision: D5728217

Pulled By: ajkr

fbshipit-source-id: 4a53edb0ca80d2a9fcf10749e52d47d57d6417d3
petermattis added a commit to cockroachdb/rocksdb that referenced this issue Sep 11, 2017
Summary:
Since tombstones are not stored in order, we may get a wrong smallest
key if we only consider the first added tombstone.
Check facebook#2752 for more details.
Closes facebook#2799

Differential Revision: D5728217

Pulled By: ajkr
@sihuazhou
Copy link

Hi @ajkr , is the deleteRange() still an experimental feature? Or it's a stable feature now? I noticed that it's still an experimental feature on the comments of CPP, but this feature have been introduced by a long time ago.

@ajkr
Copy link
Contributor

ajkr commented May 21, 2018

@sihuazhou I'd still classify it as an experimental feature. Despite it existing for a while, getting some adoption, and fixing the known bugs, we still haven't optimized its performance so it can cause serious performance regression in certain common cases. Once that's addressed we can recommend it for general-purpose use and remove the warnings. But it'll be a while longer.

@sihuazhou
Copy link

@ajkr Thanks for your reply, I got it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants