Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SST files overlap at a level and keys are not ordered in a SST file #7405

Closed
kevinliuca20 opened this issue Sep 17, 2020 · 13 comments
Closed
Assignees

Comments

@kevinliuca20
Copy link

We are upgrading our rocksdb storage from 5.7.2 to 6.4.6. After upgrading to the 6.4.6 version, we noticed some shards returned unexpected keys in the iteration. Then we dumped the manifest and sst files, and found the (smallest key, greatest key) of some SST files at the bottommost level overlapped, and some SST files contained multiple sorted ranges but these ranges are overlapped.

For example, an sst file contains 4 segments of sorted keys, these are the ranges showing the first 4 bytes of the keys in hex format(the common prefix is not shown here to make it concise) and the key counts:

[65 66 39 61 - 66 32 39 33] : 6816 keys
[65 63 36 63 - 66 31 36 36] : 4864 keys
[65 66 32 65 - 65 66 37 61] : 1401 keys
[65 61 30 39 - 65 63 65 65] : 740 keys

Then we deployed a detector for misordered keys in the compaction and iteration and found 4 shards having the misordered keys among 400k shards.

We also checked the shards still running v5.7.2, and have found a shard with similar key misordering issue. So this problem already existed in the old version. But we did not get the issue reported from the iteration in the v5.7.2 as in the v6.4.6 - that is why we just noticed it after upgrading to v6.4.6.

Then we want to ask:

  1. Do some versions of rocksdb have issues potentially causing key misordering in sst files?
  2. Is there a change in iterate() behavior from v5.7.2 to v6.4.6?
  3. What is the suggestion to avoid the issue in future?

Thanks. We are still collecting more data from our storage, please let us know if you need additional data.

@ajkr
Copy link
Contributor

ajkr commented Sep 18, 2020

One known possibility is block cache key conflict leading to an old block in cache being returned instead of a new block. That's aligned with seeing an ordered sequence of keys injected in the middle of another ordered sequence of keys. Another known possibility is faulty hardware.

More details on the key conflict possibility: block cache keys are composed of (inode number, inode generation (4 bytes), file offset). In older filesystems (even older versions of ext4) the inode generation is randomly generated. Then if we recycle inode number a bunch of times, which file systems tend to do, we can end up reading a stale block from the cache.

In terms of probability of this happening, imagine there's 2^32 days in a year. The birthday paradox math says the class needs to have ~70K students in order for a birthday conflict to be 50% likely. Similarly, we need to recycle a inode number roughly ~70K times to see an (inode number, inode generation) conflict with 50% probability.

But why would RocksDB cycle through files so fast (note the stale block needs to also not be evicted from block cache in order for the problem to occur)? One reason is bugs. We had a bug before that caused infinitely recompacting the same file: #4575. Easy way to guess here is look at the SST file numbers (e.g., the "6" in "000006.sst"). If they're very large (like into the millions) that makes the existence of this bug more plausible.

Also, how to detect how the filesystem picks inode generation (sequentially or randomly)? I don't know the right way but have a diagnostic script that cycles through files, printing <number>,<generation> for each one. Just check if generation is sequentially increasing or randomized: https://gist.github.com/ajkr/2eac6fe4d918d0c8819e9656ec4eab41. Can be run like this: g++ -o ./tmp ./tmp.c && ./tmp /path/to/rocksdb/dir/tempfile 100. Really, if that ever returned random generation numbers for any platform the DB ever ran on, the problem is plausible.

Besides the above possibility (it does sound crazy but I've repro'd it, a bit artificially), the cases we see tend to involve faulty hardware. In recent memory, machine checks have reported CPU / memory problems on the hosts where we knew corruption was first written. The challenge though is how to prevent corruption happening silently and then spreading. Here are some prevention suggestions:

  • backport 27735de
  • backport cockroachdb@148dc8a (note this is not from a FB repo)
  • some users proactively check key ordering whenever they iterate. It seems like a good practice.
  • force_consistency_checks=true is today a good option, but I have not investigated whether it has deficiencies in the version you're using
  • We are working on per-KV integrity. While it won't help cases like this where it looks like maybe a comparator returned the wrong result, there are other corruption cases we've seen where garbage/zero data gets introduced and could be prevented by per-KV checksums

@kevinliuca20
Copy link
Author

Thank you very much. It contains lots of details which are very helpful.

We are using ext4 and your tool prints random generation numbers. The sst file numbers are around half million currently. Our comparator can handle the internal keys well and return the same result as the user keys, so #4575 does not affect us? The block cache conflict and faulty hardware are possible for our storage, though we do not have enough proof now to confirm which happened. Does the rocksdb team have a plan to avoid the block cache conflict completely in future?

We have added the checking in the iteration, which we run frequently, and are evaluating other options you suggested. Thank you.

@yiwu-arbug
Copy link
Contributor

@ajkr Thank you for the detailed explanation of the issue. We hit the cache key issue last week and was about to file an issue. Do you know which kernel version is affected by the random inode generation issue? Thanks.

@yiwu-arbug
Copy link
Contributor

Also, is it possible to make the cache key take into account the file checksum as well?

@pdillinger
Copy link
Contributor

pdillinger commented Sep 24, 2020

Also, is it possible to make the cache key take into account the file checksum as well?

Our relatively new "unique with extremely high probability" identifier for sst files is saved DB session id + file number. The DB session id is currently only available as a 20 byte human readable string, with about 100 bits of entropy. File numbers have 10-20+ bits of entropy. To generate efficient but effective cache key prefixes from this, I would generate two 64-bit hashes from the DB session id, and add (with addition) to one of those the file number. Then you get a 128 bit identifier approaching full entropy as your file numbers approach ~1 billion.

There is also a trick in this design (which I considered in the updated backup names): within the same db session id, distinct files are guaranteed unique by unique file numbers. This doesn't actually reduce the overall probability that two files from different databases will collide, but it does "cluster" the collisions such that a single collision between files of two DBs means much higher likelihood of multiple collisions. Why is that good? Keeping the overall probability of files from two DBs colliding the same, it means the probability of any file collisions between any two DBs is much lower--lower than if the file number were hashed. (I believe that in most applications, any corruption in a DB is almost or equally as bad as widespread corruption.)

@yiwu-arbug
Copy link
Contributor

yiwu-arbug commented Sep 24, 2020

Also, is it possible to make the cache key take into account the file checksum as well?

Our relatively new "unique with extremely high probability" identifier for sst files is saved DB session id + file number. The DB session id is currently only available as a 20 byte human readable string, with about 100 bits of entropy. File numbers have 10-20+ bits of entropy. To generate efficient but effective cache key prefixes from this, I would generate two 64-bit hashes from the DB session id, and add (with addition) to one of those the file number. Then you get a 128 bit identifier approaching full entropy as your file numbers approach ~1 billion.

There is also a trick in this design (which I considered in the updated backup names): within the same db session id, distinct files are guaranteed unique by unique file numbers. This doesn't actually reduce the overall probability that two files from different databases will collide, but it does "cluster" the collisions such that a single collision between files of two DBs means much higher likelihood of multiple collisions. Why is that good? Keeping the overall probability of files from two DBs colliding the same, it means the probability of any file collisions between any two DBs is much lower--lower than if the file number were hashed. (I believe that in most applications, any corruption in a DB is almost or equally as bad as widespread corruption.)

That sounds awesome. With file number being part of the hash at least we will not get conflict within the DB instance. I'm Interested in how the DB session id is generated. Any reference?

Also I'm thinking how to detect if cache conflict do happen. For compaction we can do key order check. But looks like there's nothing we can do for point-get request. Thoughts?

@pdillinger
Copy link
Contributor

That sounds awesome.

It does cost some extra cache key bytes, at least when you want your cache optimized for the case of multiple DB instances accessing the same SST files. (NB: existing cache key requires shared storage to be considered same SST file; my proposal would generalize to copied files.)

With file number being part of the hash at least we will not get conflict within the DB instance. I'm Interested in how the DB session id is generated. Any reference?

https://github.com/facebook/rocksdb/blob/master/db/db_impl/db_impl.cc#L3704

Also I'm thinking how to detect if cache conflict do happen. For compaction we can do key order check. But looks like there's nothing we can do for point-get request. Thoughts?

Compare data block range to upper/lower bound from index binary search?

@yiwu-arbug
Copy link
Contributor

https://github.com/facebook/rocksdb/blob/master/db/db_impl/db_impl.cc#L3704

Thanks. So it takes a uuid and then hash to generate 128bit random number. I search a bit and looks like the uuid file is based on /dev/urandom. Maybe it can directly read 128bit from /dev/urandom.

@pdillinger
Copy link
Contributor

So it takes a uuid and then hash to generate 128bit random number.

This way was kind of a matter of "if it ain't broke, don't fix it" and kind of a matter of only relying on what's already available in Env.

Maybe it can directly read 128bit from /dev/urandom.

Improvement welcome, but e.g. adding more code to Env to do basically the same thing for this one case might not be an improvement ;)

@yiwu-arbug
Copy link
Contributor

So it takes a uuid and then hash to generate 128bit random number.

This way was kind of a matter of "if it ain't broke, don't fix it" and kind of a matter of only relying on what's already available in Env.

Maybe it can directly read 128bit from /dev/urandom.

Improvement welcome, but e.g. adding more code to Env to do basically the same thing for this one case might not be an improvement ;)

lol, totally agree. My comment was just a random thought.

@ajkr
Copy link
Contributor

ajkr commented Sep 25, 2020

... In older filesystems (even older versions of ext4) ...

Correction: it's actually newer versions of ext4 that would be affected: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=232530680290ba94ca37852ab10d9556ea28badf

jay-zhuang added a commit to jay-zhuang/rocksdb that referenced this issue Oct 6, 2020
It's a known issue, which is tracked in facebook#7405, facebook#7470. Disable it for
now.
facebook-github-bot pushed a commit that referenced this issue Oct 8, 2020
Summary:
It's a known issue, which is tracked in #7405, #7470. Disable it for
now.

Pull Request resolved: #7511

Reviewed By: zhichao-cao

Differential Revision: D24145075

Pulled By: jay-zhuang

fbshipit-source-id: 1858497972f2baba617867aaeac30d93b8305c80
@yiwu-arbug
Copy link
Contributor

Hi. Is there work on getting a fix for the cache key issue? Thanks @ajkr @pdillinger @zhichao-cao

codingrhythm pushed a commit to SafetyCulture/rocksdb that referenced this issue Mar 5, 2021
Summary:
It's a known issue, which is tracked in facebook#7405, facebook#7470. Disable it for
now.

Pull Request resolved: facebook#7511

Reviewed By: zhichao-cao

Differential Revision: D24145075

Pulled By: jay-zhuang

fbshipit-source-id: 1858497972f2baba617867aaeac30d93b8305c80
pdillinger added a commit to pdillinger/rocksdb that referenced this issue Aug 13, 2021
Summary: Use DB session ids in SST table properties to make cache keys
stable across DB re-open and copy / move / restore / etc.

These new cache keys are currently only enabled when FileSystem does not
provide GetUniqueId. For now, they are typically larger, so slightly
less efficient.

Relevant to facebook#7405

Test Plan: new unit test added, which fails when disabling new
functionality
facebook-github-bot pushed a commit that referenced this issue Aug 17, 2021
Summary:
Use DB session ids in SST table properties to make cache keys
stable across DB re-open and copy / move / restore / etc.

These new cache keys are currently only enabled when FileSystem does not
provide GetUniqueId. For now, they are typically larger, so slightly
less efficient.

Relevant to #7405

This change has a minor regression in PersistentCache functionality:
metaindex blocks are no longer cached in PersistentCache. Table properties
blocks already were not but ideally should be. I didn't spent effort to
fix & test these issues because we don't believe PersistentCache is used much
if at all and expect SecondaryCache to replace it. (Though PRs are welcome.)

FIXME: there is more to be fixed for stable cache keys on external SST files

Pull Request resolved: #8659

Test Plan:
new unit test added, which fails when disabling new
functionality

Reviewed By: zhichao-cao

Differential Revision: D30297705

Pulled By: pdillinger

fbshipit-source-id: e8539a5c8802a79340405629870f2e3fb3822d3a
@pdillinger pdillinger self-assigned this Oct 26, 2021
@pdillinger
Copy link
Contributor

Hi. Is there work on getting a fix for the cache key issue? Thanks @ajkr @pdillinger @zhichao-cao

Basic support added in #8659, revised for same release in #8686. They are in RocksDB 6.24.0. The stable cache keys are only enabled if the FileSystem does not provide file unique ids, and the keys are not very compact, but they work well. I have an enhanced version in progress.

Also, if the SST file is older than RocksDB 6.24, then it will use a temporary cache key based on the current DB session id and file number rather than the originals for that file. (This is at least stable across table file re-open, but changes across different DB Opens.)

yoori pushed a commit to yoori/rocksdb that referenced this issue Nov 26, 2023
Summary:
Use DB session ids in SST table properties to make cache keys
stable across DB re-open and copy / move / restore / etc.

These new cache keys are currently only enabled when FileSystem does not
provide GetUniqueId. For now, they are typically larger, so slightly
less efficient.

Relevant to facebook/rocksdb#7405

This change has a minor regression in PersistentCache functionality:
metaindex blocks are no longer cached in PersistentCache. Table properties
blocks already were not but ideally should be. I didn't spent effort to
fix & test these issues because we don't believe PersistentCache is used much
if at all and expect SecondaryCache to replace it. (Though PRs are welcome.)

FIXME: there is more to be fixed for stable cache keys on external SST files

Pull Request resolved: facebook/rocksdb#8659

Test Plan:
new unit test added, which fails when disabling new
functionality

Reviewed By: zhichao-cao

Differential Revision: D30297705

Pulled By: pdillinger

fbshipit-source-id: e8539a5c8802a79340405629870f2e3fb3822d3a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants