Support user-defined timestamps in write-committed txns #9629

riversand963 · 2022-02-23T23:29:30Z

Summary:
Pessimistic transactions use pessimistic concurrency control, i.e. locking. Keys are
locked upon first operation that writes the key or has the intention of writing. For example,
PessimisticTransaction::Put(), PessimisticTransaction::Delete(),
PessimisticTransaction::SingleDelete() will write to or delete a key, while
PessimisticTransaction::GetForUpdate() is used by application to indicate
to RocksDB that the transaction has the intention of performing write operation later
in the same transaction.
Pessimistic transactions support two-phase commit (2PC). A transaction can be
Prepared()'ed and then Commit(). The prepare phase is similar to a promise: once
Prepare() succeeds, the transaction has acquired the necessary resources to commit.
The resources include locks, persistence of WAL, etc.
Write-committed transaction is the default pessimistic transaction implementation. In
RocksDB write-committed transaction, Prepare() will write data to the WAL as a prepare
section. Commit() will write a commit marker to the WAL and then write data to the
memtables. While writing to the memtables, different keys in the transaction's write batch
will be assigned different sequence numbers in ascending order.
Until commit/rollback, the transaction holds locks on the keys so that no other transaction
can write to the same keys. Furthermore, the keys' sequence numbers represent the order
in which they are committed and should be made visible. This is convenient for us to
implement support for user-defined timestamps.
Since column families with and without timestamps can co-exist in the same database,
a transaction may or may not involve timestamps. Based on this observation, we add two
optional members to each PessimisticTransaction, read_timestamp_ and
commit_timestamp_. If no key in the transaction's write batch has timestamp, then
setting these two variables do not have any effect. For the rest of this commit, we discuss
only the cases when these two variables are meaningful.

read_timestamp_ is used mainly for validation, and should be set before first call to
GetForUpdate(). Otherwise, the latter will return non-ok status. GetForUpdate() calls
TryLock() that can verify if another transaction has written the same key since
read_timestamp_ till this call to GetForUpdate(). If another transaction has indeed
written the same key, then validation fails, and RocksDB allows this transaction to
refine read_timestamp_ by increasing it. Note that a transaction can still use Get()
with a different timestamp to read, but the result of the read should not be used to
determine data that will be written later.

commit_timestamp_ must be set after finishing writing and before transaction commit.
This applies to both 2PC and non-2PC cases. In the case of 2PC, it's usually set after
prepare phase succeeds.

We currently require that the commit timestamp be chosen after all keys are locked. This
means we disallow the TransactionDB-level APIs if user-defined timestamp is used
by the transaction. Specifically, calling PessimisticTransactionDB::Put(),
PessimisticTransactionDB::Delete(), PessimisticTransactionDB::SingleDelete(),
etc. will return non-ok status because they specify timestamps before locking the keys.
Users are also prompted to use the Transaction APIs when they receive the non-ok status.

Differential Revision: D31822445

facebook-github-bot · 2022-02-23T23:29:56Z

This pull request was exported from Phabricator. Differential Revision: D31822445

facebook-github-bot · 2022-02-24T01:32:21Z

This pull request was exported from Phabricator. Differential Revision: D31822445

facebook-github-bot · 2022-02-24T17:24:25Z

This pull request was exported from Phabricator. Differential Revision: D31822445

ltamasi

Thanks a lot for the PR @riversand963 ! LGTM in general, just some questions/comments

ltamasi · 2022-02-28T17:22:58Z

db/db_impl/db_impl_write.cc

+    // Otherwise, it means we are just writing to the WAL, and we allow
+    // timestamps unset for the keys in the write batch.


Could you clarify why it is OK to not have the timestamps set if we're only writing to the WAL?

Sure, will clarify, and I will explain in this comment as well:
in write-committed prepare phase, we do not need the timestamp in the WAL because these keys are not yet inserted to the memtables yet, and they will be used only for recovery. During recovery, should they be inserted to memtables, we will obtain the timestamp from the commit marker which is also in the WAL.

ltamasi · 2022-02-28T17:32:11Z

utilities/transactions/pessimistic_transaction.cc

+    }
+  }
+
+  if (kMaxTxnTimestamp == read_timestamp_) {


Would it make sense to check do_validate here (and below)?

I may be missing something, but we do perform check for do_validate. Actually, we require do_validate to be true, otherwise, it's an invalid argument.

I'm talking about the case when read_timestamp_ is not set (kMaxTxnTimestamp). The current logic considers that an error even if do_validate is false, right? I'm wondering if that's what we want, or if we would want to make sure read_timestamp_ is set if and only if do_validate is set.

If the control flow reaches here, it means we enable timestamp and call GetForUpdate(). I think in this case, we should just require do_validate to be true AND read_timestamp be set.

Sorry, I don't follow... At the point where we check kMaxTxnTimestamp == read_timestamp_, is do_validate guaranteed to be set? If it is, the else if (!do_validate) branch can't be hit, right?

In other words, I was wondering if we would want to do some version of this:

if (do_validate) { if (kMaxTxnTimestamp == read_timestamp_) { // error } } else { if (kMaxTxnTimestamp != read_timestamp_) { // error } }

I may know the confusion. I think the logic can be

if (!do_validate) { // error. Because GetForUpate() with timestamp enabled must perform validation. } else if (kMaxTxnTimestamp == read_timestamp_) { // error }

Thanks! Checking do_validate first definitely makes the intent clearer to me

ltamasi · 2022-02-28T17:40:20Z

utilities/transactions/pessimistic_transaction.cc

+                                           value, exclusive, do_validate);
+}
+
+Status WriteCommittedTxn::GetForUpdate(const ReadOptions& read_options,


Is the type of the output value the only difference between this GetForUpdate and the previous one? If yes, we could eliminate the code duplication by introducing a private helper template

ltamasi · 2022-02-28T17:55:42Z

utilities/transactions/pessimistic_transaction.cc

+    column_family =
+        column_family ? column_family : db_impl_->DefaultColumnFamily();
+    assert(column_family);
+    const Comparator* const ucmp = column_family->GetComparator();
+    assert(ucmp);
+    size_t ts_sz = ucmp->timestamp_size();
+    if (0 == ts_sz) {
+      s = GetBatchForWrite()->Put(column_family, key, value);
+    } else {
+      assert(ts_sz == sizeof(TxnTimestamp));
+      if (!IndexingEnabled()) {
+        cfs_with_ts_.insert(column_family->GetID());
+      }
+      s = GetBatchForWrite()->Put(column_family, key, value);
+    }


I'm wondering if we could turn these repeated chunks of code into private helpers as well.

ltamasi · 2022-02-28T18:35:31Z

utilities/transactions/pessimistic_transaction.h

+  Status SingleDeleteUntracked(ColumnFamilyHandle* column_family,
+                               const Slice& key) override;


Would it make sense to have a SingleDeleteUntracked overload taking SliceParts for the sake of completeness?

The original motivation for this PR is to add timestamp to existing transaction APIs. I may have violated this myself, and I need to double check.

Double-checked that we are not adding new APIs, so I will skip this in this PR.

ltamasi · 2022-02-28T18:36:58Z

utilities/transactions/pessimistic_transaction.h

+  // indexing_enabled_ is false. If a key is written when indexing_enabled_ is
+  // true, then the corresponding column family is not added to cfs_with_ts
+  // even if it enables timestamp.
+  std::unordered_set<uint32_t> cfs_with_ts_;


We could consider renaming this to something like cfs_with_ts_indexing_disabled_ or similar to make its semantics clearer in the code.

Good point. Will do

ltamasi · 2022-02-28T18:43:42Z

utilities/transactions/pessimistic_transaction_db.h

  void ReinitializeTransaction(
      Transaction* txn, const WriteOptions& write_options,
      const TransactionOptions& txn_options = TransactionOptions());

  virtual Status VerifyCFOptions(const ColumnFamilyOptions& cf_options);

+  Status FailIfCfEnablesTs(const ColumnFamilyHandle* column_family) const;


Could this also be a static helper?

We could only if we remove the following since we need to call DefaultColumnFamily() which is non-static.

Status FailIfCfEnablesTs(const ColumnFamilyHandle* column_family) const { column_family = column_family ? column_family : DefaultColumnFamily(); ... }

ltamasi · 2022-02-28T19:01:51Z

utilities/transactions/write_committed_transaction_ts_test.cc

+  ASSERT_OK(
+      pessimistic_txn_db->WriteWithConcurrencyControl(WriteOptions(), &wb2));
+
+  auto* txn = db->BeginTransaction(WriteOptions(), TransactionOptions(),


Minor but we could immediately pass these transaction pointers to a unique_ptr, which would eliminate the need for manual deletes

I think RAAI is a good idea and should be used in most cases if possible.
For RocksDB transactions, we unlock the keys in the destructor. Therefore, we more often would like to control explicitly when a transaction is deleted, especially when there are multiple transactions in the test.
With that being said, I am OK with applying the suggested change to tests where we create only one transaction, and adding a comment saying that the locks will be released only when the txn variable goes out of scope.

I think we can still use smart pointers and explicitly call reset().
There are a lot of legacy code in transaction tests, but I will enforce this in the new test code added by me.

Sure. Just wanted mention that if needed, we can still explicitly destroy a transaction held by a unique_ptr using reset.

data race? :)

facebook-github-bot · 2022-03-02T01:33:52Z

This pull request was exported from Phabricator. Differential Revision: D31822445

riversand963 · 2022-03-02T01:35:26Z

Thanks @ltamasi for the review!

facebook-github-bot · 2022-03-02T16:20:37Z

This pull request was exported from Phabricator. Differential Revision: D31822445

ltamasi

Thanks again @riversand963 , this looks awesome!

ltamasi · 2022-03-02T17:35:15Z

utilities/transactions/pessimistic_transaction.h

 };

+template <typename TValue>


Very minor but since these are private templates only used in WriteCommittedTxn, their implementation could be moved to the .cc file. (Also, the same goes for the methods that use these templates like Put / Delete etc.)

ltamasi · 2022-03-02T17:46:43Z

utilities/transactions/write_committed_transaction_ts_test.cc

+
+namespace ROCKSDB_NAMESPACE {
+
+// TODO: add test cases for enable_indexing=true.


I think you can remove these two TODOs now, right?

facebook-github-bot · 2022-03-03T01:46:12Z

This pull request was exported from Phabricator. Differential Revision: D31822445

Summary: Pull Request resolved: facebook#9629 Pessimistic transactions use pessimistic concurrency control, i.e. locking. Keys are locked upon first operation that writes the key or has the intention of writing. For example, `PessimisticTransaction::Put()`, `PessimisticTransaction::Delete()`, `PessimisticTransaction::SingleDelete()` will write to or delete a key, while `PessimisticTransaction::GetForUpdate()` is used by application to indicate to RocksDB that the transaction has the intention of performing write operation later in the same transaction. Pessimistic transactions support two-phase commit (2PC). A transaction can be `Prepared()`'ed and then `Commit()`. The prepare phase is similar to a promise: once `Prepare()` succeeds, the transaction has acquired the necessary resources to commit. The resources include locks, persistence of WAL, etc. Write-committed transaction is the default pessimistic transaction implementation. In RocksDB write-committed transaction, `Prepare()` will write data to the WAL as a prepare section. `Commit()` will write a commit marker to the WAL and then write data to the memtables. While writing to the memtables, different keys in the transaction's write batch will be assigned different sequence numbers in ascending order. Until commit/rollback, the transaction holds locks on the keys so that no other transaction can write to the same keys. Furthermore, the keys' sequence numbers represent the order in which they are committed and should be made visible. This is convenient for us to implement support for user-defined timestamps. Since column families with and without timestamps can co-exist in the same database, a transaction may or may not involve timestamps. Based on this observation, we add two optional members to each `PessimisticTransaction`, `read_timestamp_` and `commit_timestamp_`. If no key in the transaction's write batch has timestamp, then setting these two variables do not have any effect. For the rest of this commit, we discuss only the cases when these two variables are meaningful. read_timestamp_ is used mainly for validation, and should be set before first call to `GetForUpdate()`. Otherwise, the latter will return non-ok status. `GetForUpdate()` calls `TryLock()` that can verify if another transaction has written the same key since `read_timestamp_` till this call to `GetForUpdate()`. If another transaction has indeed written the same key, then validation fails, and RocksDB allows this transaction to refine `read_timestamp_` by increasing it. Note that a transaction can still use `Get()` with a different timestamp to read, but the result of the read should not be used to determine data that will be written later. commit_timestamp_ must be set after finishing writing and before transaction commit. This applies to both 2PC and non-2PC cases. In the case of 2PC, it's usually set after prepare phase succeeds. We currently require that the commit timestamp be chosen after all keys are locked. This means we disallow the `TransactionDB`-level APIs if user-defined timestamp is used by the transaction. Specifically, calling `PessimisticTransactionDB::Put()`, `PessimisticTransactionDB::Delete()`, `PessimisticTransactionDB::SingleDelete()`, etc. will return non-ok status because they specify timestamps before locking the keys. Users are also prompted to use the `Transaction` APIs when they receive the non-ok status. Reviewed By: ltamasi Differential Revision: D31822445 fbshipit-source-id: 878ac045279f5896260f41fa49fb9c001a9937bd

facebook-github-bot · 2022-03-08T22:07:33Z

This pull request was exported from Phabricator. Differential Revision: D31822445

…mmittedTxn::GetForUpdate (#12369) Summary: When PR #9629 introduced user-defined timestamp support for `WriteCommittedTxn`, it adds this usage mandate for API `GetForUpdate` when UDT is enabled. The `do_validate` flag has to be true, and user should have already called `Transaction::SetReadTimestampForValidation` to set a read timestamp for validation. The rationale behind this mandate is this: 1) with do_vaildate = true, `GetForUpdate` could verify this relationships: let's denote the user-defined timestamp in db for the key as `Ts_db` and the read timestamp user set via `Transaction::SetReadTimestampForValidation` as `Ts_read`. UDT based validation will only pass if `Ts_db <= Ts_read`. https://github.com/facebook/rocksdb/blob/5950907a823b99a6ae126ab075995c602d815d7a/utilities/transactions/transaction_util.cc#L141 2) Let's denote the committed timestamp set via `Transaction::SetCommitTimestamp` to be `Ts_cmt`. Later `WriteCommitedTxn::Commit` would only pass if this condition is met: `Ts_read < Ts_cmt`. https://github.com/facebook/rocksdb/blob/5950907a823b99a6ae126ab075995c602d815d7a/utilities/transactions/pessimistic_transaction.cc#L431 Together these two checks can ensure `Ts_db < Ts_cmt` to meet the user-defined timestamp invariant that newer timestamp should have newer sequence number. The `do_validate` flag was originally intended to make snapshot based validation optional. If it's true, `GetForUpdate` checks no entry is written after the snapshot. If it's false, it will skip this snapshot based validation. In this PR, we are making the UDT based validation configurable too based on this flag instead of mandating it for below reasons: 1) in some cases the users themselves can enforce aformentioned invariant on their side independently, without RocksDB help, for example, if they are managing a monotonically increasing timestamp, and their transactions are only committed in a single thread. So they don't need this UDT based validation and wants to skip it, 2) It also could be expensive or not practical for users to come up with such a read timestamp that is exactly in between their commit timestamp and the db's timestamp. For example, in aformentioned case where a monotonically increasing timestamp is managed, the users would need to access this timestamp both for setting the read timestamp and for setting the commit timestamp. So it's preferable to skip this check too. Pull Request resolved: #12369 Test Plan: added unit tests Reviewed By: ltamasi Differential Revision: D54268920 Pulled By: jowlyzhang fbshipit-source-id: ca7693796f9bb11f376a2059d91841e51c89435a

facebook-github-bot added CLA Signed fb-exported labels Feb 23, 2022

riversand963 requested a review from ltamasi February 23, 2022 23:30

riversand963 force-pushed the export-D31822445 branch from a25c60a to f98448b Compare February 24, 2022 01:32

riversand963 force-pushed the export-D31822445 branch from f98448b to a3ab195 Compare February 24, 2022 17:24

ltamasi reviewed Feb 28, 2022

View reviewed changes

riversand963 force-pushed the export-D31822445 branch from a3ab195 to de5abce Compare March 2, 2022 01:33

riversand963 force-pushed the export-D31822445 branch from de5abce to 15f761c Compare March 2, 2022 16:20

ltamasi approved these changes Mar 2, 2022

View reviewed changes

riversand963 force-pushed the export-D31822445 branch from 15f761c to 47baa4d Compare March 3, 2022 01:46

riversand963 force-pushed the export-D31822445 branch from 47baa4d to c26a662 Compare March 8, 2022 22:07

facebook-github-bot closed this in 3b6dc04 Mar 9, 2022

riversand963 deleted the export-D31822445 branch March 9, 2022 00:42

jowlyzhang mentioned this pull request Feb 21, 2024

Use do_validate flag to control timestamp based validation in WriteCommittedTxn::GetForUpdate #12369

Closed

		// Otherwise, it means we are just writing to the WAL, and we allow
		// timestamps unset for the keys in the write batch.

		Status SingleDeleteUntracked(ColumnFamilyHandle* column_family,
		const Slice& key) override;


		namespace ROCKSDB_NAMESPACE {

		// TODO: add test cases for enable_indexing=true.

Support user-defined timestamps in write-committed txns #9629

Support user-defined timestamps in write-committed txns #9629

Conversation

riversand963 commented Feb 23, 2022

facebook-github-bot commented Feb 23, 2022

facebook-github-bot commented Feb 24, 2022

facebook-github-bot commented Feb 24, 2022

ltamasi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Mar 2, 2022

riversand963 commented Mar 2, 2022

facebook-github-bot commented Mar 2, 2022

ltamasi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Mar 3, 2022

facebook-github-bot commented Mar 8, 2022