-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug with DeleteRange #2752
Comments
In our case, each region(key range) have 3 replicas distributed in 3 different node. The following is the progress how we encounter this problem.
And then, we iterator all the keys for all 3 replicas and found that replica at node 2 missing 200 thousands kv pairs, and 6 thousands kv should have been deleted reappeared. The following is the key events on node 2
|
Sst in level greater than 0 also maybe overlap by the border user key. |
/cc @ajkr |
I found the bug that delete range will wrongly delete keys. Firstly, when added to sst, the DeleteRange never split, so one DeleteRange may be added to multiple sst files.
|
The solution is spit the delete range when add to sst. |
Thanks a lot for investigating this. Can you describe your compaction config? Universal or level, and what kinds of manual compactions, if any? One of this feature's built-in assumptions is that, when a range deletion gets split between files, any compaction will include either all or none of those files. I'll try to find where the assumption is broken. |
@ajkr we use level compaction |
Do you call |
We have 4 column families, we don't trigger any manual compaction for the column family that lose some data, for another column family we do call CompactRange to compact the whole range. |
Do you need the LOG? |
Sure, LOG would be helpful. |
This is the LOG, and this is all ranges we have called delete range(for each column family).
|
@ajkr Any discovery? |
Unfortunately, no, the compactions look pretty normal in your LOG file. I've looked at the code and written some tests for your theory about tombstone getting split between level, but couldn't find a way for it to happen so far. Is it reproducible? One way to verify your split-level theory: you can check the manifest around the time of the compactions. You can check the min/max key of the input files don't overlap other files that aren't included in the compaction. Note you'd need to capture the manifest fairly soon after the corruption as it's compacted periodically and when the DB reopens. |
Unfortunately, the DB has been reopened. |
BTW, it is hard to reproduce, we just meet once util now. |
Is there any other possibility that will cause data lose and deletions been dropped not correctly? |
Currently we don't know of any other reasons. I noticed that your range deletions are probably never dropped. In your LOG there's a few files in L6 and compactions generally output to at most L3. Since range deletions are only dropped when compacted to the bottommost level, they shouldn't be dropped. You could use |
If the delete range has no overlap with L6, L3/L2 also can be bottommost level. I will use sst_dump to check if there is any other clues. |
Unfortunately, I don't find any useful clue with sst_dump. |
I found a suspicious place in LOG,
BTW, is it possible that deleting sst files wrongly after compaction in some extreme situations and causing lose put and delete entries? We check the data in the others replicas(we store the timestamp in value) and find out that those lost keys are written around 08-15 5am, in another word, they are very close in time, so i guess if they are contained in the same sst file. |
Did 5822 finish after 5850 finished? Maybe there was a user iterator when 5850 finished, in which case the SST files are preserved. 5822 might be the first caller of |
Yes, JOB 5822 was finished after JOB 5850.
|
@ajkr We encounter the same problem once again, i will keep this issue updated later. |
I find a very suspicious place in MANIFEST file, one sst file's smallest key is larger than largest key.
We also find out that there must be delete range in the associated compaction, because kMaxSequenceNumber and kTypeRangeDeletion occurred.
At the same time, we find out that our lost data are covered by this range.
BTW, this is the LOG. |
After investigating the MANIFEST and LOG, we found |
Seems this snippet will produce a SST file with a wrong range: #include <stdio.h>
#include <assert.h>
#include <rocksdb/db.h>
using namespace rocksdb;
int main(int argc, char** argv) {
Status s;
Options opts;
opts.create_if_missing = true;
WriteOptions wopts;
FlushOptions fopts;
DB* db = nullptr;
s = DB::Open(opts, "rocksdb", &db);
assert(s.ok());
auto cf = db->DefaultColumnFamily();
s = db->DeleteRange(wopts, cf, Slice("b"), Slice("c"));
assert(s.ok());
auto sn = db->GetSnapshot();
s = db->DeleteRange(wopts, cf, Slice("a"), Slice("b"));
assert(s.ok());
s = db->Flush(fopts, cf);
assert(s.ok());
} MANIFEST:
|
@huachaohuang The ["a", "b") deletion is newer than the newest snapshot. It should be dropped when compacted to bottommost level (L0 in your case) as it's not needed for correctness of any snapshot. So ["b", "c"] looks like the right range to me. Anything I missed? |
@zhangjinpeng1987 Thanks, understood. Is it possible the ingested SST files contain range deletions? |
@ajkr Never, ingested sst files only contain PUT. |
@ajkr Let's see another example and check what's happen: #include <stdio.h>
#include <assert.h>
#include <rocksdb/db.h>
using namespace rocksdb;
int main(int argc, char** argv) {
Status s;
Options opts;
opts.create_if_missing = true;
ReadOptions ropts;
WriteOptions wopts;
FlushOptions fopts;
CompactRangeOptions copts;
DB* db = nullptr;
s = DB::Open(opts, "rocksdb", &db);
assert(s.ok());
auto cf = db->DefaultColumnFamily();
s = db->Put(wopts, cf, Slice("a"), Slice("a"));
assert(s.ok());
s = db->Flush(fopts, cf);
assert(s.ok());
// This will move 000007.sst to level 1 with entry "a".
s = db->CompactRange(copts, cf, nullptr, nullptr);
assert(s.ok());
s = db->DeleteRange(wopts, cf, Slice("b"), Slice("c"));
assert(s.ok());
auto sn = db->GetSnapshot();
s = db->DeleteRange(wopts, cf, Slice("a"), Slice("b"));
assert(s.ok());
// This will build 000010.sst to level 0 with range [b, c).
// But 000010.sst actually contains delete range [a, b) and [b, c).
s = db->Flush(fopts, cf);
assert(s.ok());
std::string v;
s = db->Get(ropts, Slice("a"), &v);
// This will fail because the delete range [a, b) is missed.
assert(s.IsNotFound());
} MANIFEST:
000007_dump.txt
000010_dump.txt
I think the problem is in |
Summary: Since tombstones are not stored in order, we may get a wrong smallest key if we only consider the first added tombstone. Check facebook#2752 for more details. Closes facebook#2799 Differential Revision: D5728217 Pulled By: ajkr fbshipit-source-id: 4a53edb0ca80d2a9fcf10749e52d47d57d6417d3
Summary: Since tombstones are not stored in order, we may get a wrong smallest key if we only consider the first added tombstone. Check facebook#2752 for more details. Closes facebook#2799 Differential Revision: D5728217 Pulled By: ajkr
Hi @ajkr , is the |
@sihuazhou I'd still classify it as an experimental feature. Despite it existing for a while, getting some adoption, and fixing the known bugs, we still haven't optimized its performance so it can cause serious performance regression in certain common cases. Once that's addressed we can recommend it for general-purpose use and remove the warnings. But it'll be a while longer. |
@ajkr Thanks for your reply, I got it! |
When I use
DeleteRange
, after compaction, some keys have been deleted byDelete
reappear, and some keys not covered byDeleteRange
disappear.For the same data, we have 3 replicas, 1 replica is wrong, and the others are correct.
I can't find out where the problem is. @ajkr
The text was updated successfully, but these errors were encountered: