SST files overlap at a level and keys are not ordered in a SST file #7405

kevinliuca20 · 2020-09-17T22:58:12Z

We are upgrading our rocksdb storage from 5.7.2 to 6.4.6. After upgrading to the 6.4.6 version, we noticed some shards returned unexpected keys in the iteration. Then we dumped the manifest and sst files, and found the (smallest key, greatest key) of some SST files at the bottommost level overlapped, and some SST files contained multiple sorted ranges but these ranges are overlapped.

For example, an sst file contains 4 segments of sorted keys, these are the ranges showing the first 4 bytes of the keys in hex format(the common prefix is not shown here to make it concise) and the key counts:

[65 66 39 61 - 66 32 39 33] : 6816 keys
[65 63 36 63 - 66 31 36 36] : 4864 keys
[65 66 32 65 - 65 66 37 61] : 1401 keys
[65 61 30 39 - 65 63 65 65] : 740 keys

Then we deployed a detector for misordered keys in the compaction and iteration and found 4 shards having the misordered keys among 400k shards.

We also checked the shards still running v5.7.2, and have found a shard with similar key misordering issue. So this problem already existed in the old version. But we did not get the issue reported from the iteration in the v5.7.2 as in the v6.4.6 - that is why we just noticed it after upgrading to v6.4.6.

Then we want to ask:

Do some versions of rocksdb have issues potentially causing key misordering in sst files?
Is there a change in iterate() behavior from v5.7.2 to v6.4.6?
What is the suggestion to avoid the issue in future?

Thanks. We are still collecting more data from our storage, please let us know if you need additional data.

The text was updated successfully, but these errors were encountered:

ajkr · 2020-09-18T01:29:52Z

One known possibility is block cache key conflict leading to an old block in cache being returned instead of a new block. That's aligned with seeing an ordered sequence of keys injected in the middle of another ordered sequence of keys. Another known possibility is faulty hardware.

More details on the key conflict possibility: block cache keys are composed of (inode number, inode generation (4 bytes), file offset). In older filesystems (even older versions of ext4) the inode generation is randomly generated. Then if we recycle inode number a bunch of times, which file systems tend to do, we can end up reading a stale block from the cache.

In terms of probability of this happening, imagine there's 2^32 days in a year. The birthday paradox math says the class needs to have ~70K students in order for a birthday conflict to be 50% likely. Similarly, we need to recycle a inode number roughly ~70K times to see an (inode number, inode generation) conflict with 50% probability.

But why would RocksDB cycle through files so fast (note the stale block needs to also not be evicted from block cache in order for the problem to occur)? One reason is bugs. We had a bug before that caused infinitely recompacting the same file: #4575. Easy way to guess here is look at the SST file numbers (e.g., the "6" in "000006.sst"). If they're very large (like into the millions) that makes the existence of this bug more plausible.

Also, how to detect how the filesystem picks inode generation (sequentially or randomly)? I don't know the right way but have a diagnostic script that cycles through files, printing <number>,<generation> for each one. Just check if generation is sequentially increasing or randomized: https://gist.github.com/ajkr/2eac6fe4d918d0c8819e9656ec4eab41. Can be run like this: g++ -o ./tmp ./tmp.c && ./tmp /path/to/rocksdb/dir/tempfile 100. Really, if that ever returned random generation numbers for any platform the DB ever ran on, the problem is plausible.

Besides the above possibility (it does sound crazy but I've repro'd it, a bit artificially), the cases we see tend to involve faulty hardware. In recent memory, machine checks have reported CPU / memory problems on the hosts where we knew corruption was first written. The challenge though is how to prevent corruption happening silently and then spreading. Here are some prevention suggestions:

backport 27735de
backport cockroachdb@148dc8a (note this is not from a FB repo)
some users proactively check key ordering whenever they iterate. It seems like a good practice.
force_consistency_checks=true is today a good option, but I have not investigated whether it has deficiencies in the version you're using
We are working on per-KV integrity. While it won't help cases like this where it looks like maybe a comparator returned the wrong result, there are other corruption cases we've seen where garbage/zero data gets introduced and could be prevented by per-KV checksums

kevinliuca20 · 2020-09-23T21:58:01Z

Thank you very much. It contains lots of details which are very helpful.

We are using ext4 and your tool prints random generation numbers. The sst file numbers are around half million currently. Our comparator can handle the internal keys well and return the same result as the user keys, so #4575 does not affect us? The block cache conflict and faulty hardware are possible for our storage, though we do not have enough proof now to confirm which happened. Does the rocksdb team have a plan to avoid the block cache conflict completely in future?

We have added the checking in the iteration, which we run frequently, and are evaluating other options you suggested. Thank you.

yiwu-arbug · 2020-09-24T18:40:41Z

@ajkr Thank you for the detailed explanation of the issue. We hit the cache key issue last week and was about to file an issue. Do you know which kernel version is affected by the random inode generation issue? Thanks.

yiwu-arbug · 2020-09-24T18:53:12Z

Also, is it possible to make the cache key take into account the file checksum as well?

pdillinger · 2020-09-24T20:18:41Z

Also, is it possible to make the cache key take into account the file checksum as well?

Our relatively new "unique with extremely high probability" identifier for sst files is saved DB session id + file number. The DB session id is currently only available as a 20 byte human readable string, with about 100 bits of entropy. File numbers have 10-20+ bits of entropy. To generate efficient but effective cache key prefixes from this, I would generate two 64-bit hashes from the DB session id, and add (with addition) to one of those the file number. Then you get a 128 bit identifier approaching full entropy as your file numbers approach ~1 billion.

There is also a trick in this design (which I considered in the updated backup names): within the same db session id, distinct files are guaranteed unique by unique file numbers. This doesn't actually reduce the overall probability that two files from different databases will collide, but it does "cluster" the collisions such that a single collision between files of two DBs means much higher likelihood of multiple collisions. Why is that good? Keeping the overall probability of files from two DBs colliding the same, it means the probability of any file collisions between any two DBs is much lower--lower than if the file number were hashed. (I believe that in most applications, any corruption in a DB is almost or equally as bad as widespread corruption.)

yiwu-arbug · 2020-09-24T20:57:52Z

Also, is it possible to make the cache key take into account the file checksum as well?

Our relatively new "unique with extremely high probability" identifier for sst files is saved DB session id + file number. The DB session id is currently only available as a 20 byte human readable string, with about 100 bits of entropy. File numbers have 10-20+ bits of entropy. To generate efficient but effective cache key prefixes from this, I would generate two 64-bit hashes from the DB session id, and add (with addition) to one of those the file number. Then you get a 128 bit identifier approaching full entropy as your file numbers approach ~1 billion.

There is also a trick in this design (which I considered in the updated backup names): within the same db session id, distinct files are guaranteed unique by unique file numbers. This doesn't actually reduce the overall probability that two files from different databases will collide, but it does "cluster" the collisions such that a single collision between files of two DBs means much higher likelihood of multiple collisions. Why is that good? Keeping the overall probability of files from two DBs colliding the same, it means the probability of any file collisions between any two DBs is much lower--lower than if the file number were hashed. (I believe that in most applications, any corruption in a DB is almost or equally as bad as widespread corruption.)

That sounds awesome. With file number being part of the hash at least we will not get conflict within the DB instance. I'm Interested in how the DB session id is generated. Any reference?

Also I'm thinking how to detect if cache conflict do happen. For compaction we can do key order check. But looks like there's nothing we can do for point-get request. Thoughts?

pdillinger · 2020-09-24T21:32:42Z

That sounds awesome.

It does cost some extra cache key bytes, at least when you want your cache optimized for the case of multiple DB instances accessing the same SST files. (NB: existing cache key requires shared storage to be considered same SST file; my proposal would generalize to copied files.)

With file number being part of the hash at least we will not get conflict within the DB instance. I'm Interested in how the DB session id is generated. Any reference?

https://github.com/facebook/rocksdb/blob/master/db/db_impl/db_impl.cc#L3704

Also I'm thinking how to detect if cache conflict do happen. For compaction we can do key order check. But looks like there's nothing we can do for point-get request. Thoughts?

Compare data block range to upper/lower bound from index binary search?

yiwu-arbug · 2020-09-24T22:06:54Z

https://github.com/facebook/rocksdb/blob/master/db/db_impl/db_impl.cc#L3704

Thanks. So it takes a uuid and then hash to generate 128bit random number. I search a bit and looks like the uuid file is based on /dev/urandom. Maybe it can directly read 128bit from /dev/urandom.

pdillinger · 2020-09-24T22:42:28Z

So it takes a uuid and then hash to generate 128bit random number.

This way was kind of a matter of "if it ain't broke, don't fix it" and kind of a matter of only relying on what's already available in Env.

Maybe it can directly read 128bit from /dev/urandom.

Improvement welcome, but e.g. adding more code to Env to do basically the same thing for this one case might not be an improvement ;)

yiwu-arbug · 2020-09-24T22:59:14Z

So it takes a uuid and then hash to generate 128bit random number.

This way was kind of a matter of "if it ain't broke, don't fix it" and kind of a matter of only relying on what's already available in Env.

Maybe it can directly read 128bit from /dev/urandom.

Improvement welcome, but e.g. adding more code to Env to do basically the same thing for this one case might not be an improvement ;)

lol, totally agree. My comment was just a random thought.

ajkr · 2020-09-25T23:54:57Z

... In older filesystems (even older versions of ext4) ...

Correction: it's actually newer versions of ext4 that would be affected: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=232530680290ba94ca37852ab10d9556ea28badf

It's a known issue, which is tracked in facebook#7405, facebook#7470. Disable it for now.

Summary: It's a known issue, which is tracked in #7405, #7470. Disable it for now. Pull Request resolved: #7511 Reviewed By: zhichao-cao Differential Revision: D24145075 Pulled By: jay-zhuang fbshipit-source-id: 1858497972f2baba617867aaeac30d93b8305c80

yiwu-arbug · 2020-10-14T18:39:19Z

Hi. Is there work on getting a fix for the cache key issue? Thanks @ajkr @pdillinger @zhichao-cao

Summary: It's a known issue, which is tracked in facebook#7405, facebook#7470. Disable it for now. Pull Request resolved: facebook#7511 Reviewed By: zhichao-cao Differential Revision: D24145075 Pulled By: jay-zhuang fbshipit-source-id: 1858497972f2baba617867aaeac30d93b8305c80

Summary: Use DB session ids in SST table properties to make cache keys stable across DB re-open and copy / move / restore / etc. These new cache keys are currently only enabled when FileSystem does not provide GetUniqueId. For now, they are typically larger, so slightly less efficient. Relevant to facebook#7405 Test Plan: new unit test added, which fails when disabling new functionality

Summary: Use DB session ids in SST table properties to make cache keys stable across DB re-open and copy / move / restore / etc. These new cache keys are currently only enabled when FileSystem does not provide GetUniqueId. For now, they are typically larger, so slightly less efficient. Relevant to #7405 This change has a minor regression in PersistentCache functionality: metaindex blocks are no longer cached in PersistentCache. Table properties blocks already were not but ideally should be. I didn't spent effort to fix & test these issues because we don't believe PersistentCache is used much if at all and expect SecondaryCache to replace it. (Though PRs are welcome.) FIXME: there is more to be fixed for stable cache keys on external SST files Pull Request resolved: #8659 Test Plan: new unit test added, which fails when disabling new functionality Reviewed By: zhichao-cao Differential Revision: D30297705 Pulled By: pdillinger fbshipit-source-id: e8539a5c8802a79340405629870f2e3fb3822d3a

pdillinger · 2021-10-26T17:04:34Z

Hi. Is there work on getting a fix for the cache key issue? Thanks @ajkr @pdillinger @zhichao-cao

Basic support added in #8659, revised for same release in #8686. They are in RocksDB 6.24.0. The stable cache keys are only enabled if the FileSystem does not provide file unique ids, and the keys are not very compact, but they work well. I have an enhanced version in progress.

Also, if the SST file is older than RocksDB 6.24, then it will use a temporary cache key based on the current DB session id and file number rather than the originals for that file. (This is at least stable across table file re-open, but changes across different DB Opens.)

Summary: Use DB session ids in SST table properties to make cache keys stable across DB re-open and copy / move / restore / etc. These new cache keys are currently only enabled when FileSystem does not provide GetUniqueId. For now, they are typically larger, so slightly less efficient. Relevant to facebook/rocksdb#7405 This change has a minor regression in PersistentCache functionality: metaindex blocks are no longer cached in PersistentCache. Table properties blocks already were not but ideally should be. I didn't spent effort to fix & test these issues because we don't believe PersistentCache is used much if at all and expect SecondaryCache to replace it. (Though PRs are welcome.) FIXME: there is more to be fixed for stable cache keys on external SST files Pull Request resolved: facebook/rocksdb#8659 Test Plan: new unit test added, which fails when disabling new functionality Reviewed By: zhichao-cao Differential Revision: D30297705 Pulled By: pdillinger fbshipit-source-id: e8539a5c8802a79340405629870f2e3fb3822d3a

yiwu-arbug mentioned this issue Sep 24, 2020

TiKV panic 'error: Corruption: L6 has overlapping ranges' tikv/tikv#8243

Closed

Connor1996 mentioned this issue Sep 27, 2020

Add compaction order check and data block cache check tikv/rocksdb#195

Merged

jay-zhuang mentioned this issue Oct 6, 2020

flaky test: env_test: GetUniqueId() doesn't return unique id #7470

Closed

jay-zhuang added a commit to jay-zhuang/rocksdb that referenced this issue Oct 6, 2020

Disable a known flaky test: RandomAccessUniqueIDDeletes

a11f9df

It's a known issue, which is tracked in facebook#7405, facebook#7470. Disable it for now.

jay-zhuang mentioned this issue Oct 6, 2020

Disable a known flaky test: RandomAccessUniqueIDDeletes #7511

Closed

absolute8511 mentioned this issue Apr 26, 2021

Waiting fix from rocksdb youzan/ZanRedisDB#83

Open

rohitjoshi mentioned this issue Apr 29, 2021

Failed to write : Corruption: Compaction sees out-of-order keys rust-rocksdb/rust-rocksdb#517

Closed

ajkr mentioned this issue Apr 29, 2021

sst file data block is not in sorted order and have duplicate key #8245

Closed

pdillinger mentioned this issue Jun 3, 2021

Use DbSessionId as cache key prefix when secondary cache is enabled #8360

Closed

pdillinger mentioned this issue Aug 13, 2021

Stable cache keys using DB session ids in SSTs #8659

Closed

ajkr mentioned this issue Sep 2, 2021

Can compaction happen across different column families from RocksDB-5.17? #8718

Closed

pdillinger self-assigned this Oct 26, 2021

pdillinger mentioned this issue Dec 16, 2021

New stable, fixed-length cache keys #9126

Closed

facebook-github-bot closed this as completed in 0050a73 Dec 17, 2021

ajkr mentioned this issue Jun 25, 2023

db->Get() can't find some keys, but Iterator can list them. #5558

Open

ajkr mentioned this issue Dec 5, 2023

org.rocksdb.RocksDBException: Compaction sees out-of-order keys #12119

Open

arybochkin mentioned this issue Jan 30, 2024

[DocDB] Block cache key generation algorithm may return already existing key yugabyte/yugabyte-db#20852

Open

1 task

wqshr12345 mentioned this issue Jul 3, 2024

Getting org.rocksdb.RocksDBException: bad entry in block #9279

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SST files overlap at a level and keys are not ordered in a SST file #7405

SST files overlap at a level and keys are not ordered in a SST file #7405

kevinliuca20 commented Sep 17, 2020

ajkr commented Sep 18, 2020 •

edited

Loading

kevinliuca20 commented Sep 23, 2020

yiwu-arbug commented Sep 24, 2020

yiwu-arbug commented Sep 24, 2020

pdillinger commented Sep 24, 2020 •

edited

Loading

yiwu-arbug commented Sep 24, 2020 •

edited

Loading

pdillinger commented Sep 24, 2020

yiwu-arbug commented Sep 24, 2020

pdillinger commented Sep 24, 2020

yiwu-arbug commented Sep 24, 2020

ajkr commented Sep 25, 2020

yiwu-arbug commented Oct 14, 2020

pdillinger commented Oct 26, 2021

SST files overlap at a level and keys are not ordered in a SST file #7405

SST files overlap at a level and keys are not ordered in a SST file #7405

Comments

kevinliuca20 commented Sep 17, 2020

ajkr commented Sep 18, 2020 • edited Loading

kevinliuca20 commented Sep 23, 2020

yiwu-arbug commented Sep 24, 2020

yiwu-arbug commented Sep 24, 2020

pdillinger commented Sep 24, 2020 • edited Loading

yiwu-arbug commented Sep 24, 2020 • edited Loading

pdillinger commented Sep 24, 2020

yiwu-arbug commented Sep 24, 2020

pdillinger commented Sep 24, 2020

yiwu-arbug commented Sep 24, 2020

ajkr commented Sep 25, 2020

yiwu-arbug commented Oct 14, 2020

pdillinger commented Oct 26, 2021

ajkr commented Sep 18, 2020 •

edited

Loading

pdillinger commented Sep 24, 2020 •

edited

Loading

yiwu-arbug commented Sep 24, 2020 •

edited

Loading