Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Immediate put version should be kept before full_history_ts_low to guarantee atomicity #9106

Closed
BusyJay opened this issue Nov 1, 2021 · 5 comments
Assignees
Labels
bug Confirmed RocksDB bugs

Comments

@BusyJay
Copy link
Contributor

BusyJay commented Nov 1, 2021

Note: Please use Issues only for bug reports. For questions, discussions, feature requests, etc. post to dev group: https://groups.google.com/forum/#!forum/rocksdb or https://www.facebook.com/groups/rocksdb.dev

Expected behavior

One usage of user timestamp is to provide transaction functionality. Same transactions usually are traced with the same user timestamp. Supposing four timestamps t1 < t2 < t3 < t4, and three transactions have been committed as follow:

  1. (put k1, v1 t1) (put k2, v2, t1) (put k3, v3, t1)
  2. (del k3, t2)
  3. (put k1, v5, t4)

And full_history_ts_low is set to t3. To meet the requirement of Atomicity, a transaction should either all be read or none be read. So following query results should be guaranteed:

  1. When t3 is used to read k1, k2 and k3, v1, v2 should be returned for k1, k2 respectively, k3 should return None.
  2. When t4 is used to read k1 and k2, v5 and v2 should be returned for k1, k2 respectively.

Actual behavior

Because rocksdb will delete all keys before full_history_ts_low unless it's the largest version, so the query 1 will only return v2 and None for both k1 and k3. Query 2 result is meet.

Whether the behavior is correct depends how we define full_history_ts_low. If full_history_ts_low is defined as the minimal available readable version, then current behavior is correct. But that will make full_history_ts_low useless for correct transactions usage. And application needs to develop its own gc algorithm. If it's defined as the minimal timestamp that guarantee ACID, then current behavior is wrong. The immediate put version should be kept before full_history_ts_low. And user timestamp will be a drop in design for common transaction timestamp.

@riversand963
Copy link
Contributor

But that will make full_history_ts_low useless for correct transactions usage

Could you elaborate on this?

Currently, the purpose of full_history_ts_low is for purging old history. It's the minimal timestamp that can be used to serve read, and applications can increase it overtime. It should not be used by RocksDB transaction layer for conflict detection/concurrency control.
Maybe we can use a separate watermark to first stop reads under the timestamp threshold and wait for existing transactions to finish, and then we can actually start GC based on full_history_ts_low.

@riversand963 riversand963 self-assigned this Nov 1, 2021
@BusyJay
Copy link
Contributor Author

BusyJay commented Nov 2, 2021

To clarify, transaction here means building transaction on top of RocksDB instead of RocksDB's own transaction layer.

To reserve spaces, applications use MVCC will usually only keep partial history of a key. That's the exact purpose of full_history_ts_low. But if an intermediate put version is not reserved, application can't use full_history_ts_low directly as explained above. Instead, full_history_ts_low needs to be set to 0 to avoid accidentally deletion and application needs to develop its own history gc algorithm, for example, use a background job to delete the key explicitly or using a customized compaction filter to clean stale versions.

wait for existing transactions to finish, and then we can actually start GC based on full_history_ts_low

It doesn't work. The example given in the issue doesn't involve any on-going transactions, all queries are performed after updates are finished and compaction is done.

@riversand963
Copy link
Contributor

Thanks @BusyJay for catching this. I don't think the current behavior is intended. It is a bug triggered when there are two different versions of the same user key, one is larger than or equal to full_history_ts_low, while the other is smaller than full_history_ts_low. The correct behavior should be that we still keep the largest version below full_history_ts_low. Its timestamp may be zeroed out, though.
I can send out a fix.

@riversand963 riversand963 linked a pull request Nov 2, 2021 that will close this issue
@riversand963
Copy link
Contributor

I am going to do the fix in #9116.

@riversand963 riversand963 added the bug Confirmed RocksDB bugs label Nov 4, 2021
@riversand963
Copy link
Contributor

Closing since the fix is in master and will be available in next release.
Feel free to reopen if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed RocksDB bugs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants