Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support user-defined timestamps in write-committed txns #9629

Closed
wants to merge 1 commit into from

Conversation

riversand963
Copy link
Contributor

Summary:
Pessimistic transactions use pessimistic concurrency control, i.e. locking. Keys are
locked upon first operation that writes the key or has the intention of writing. For example,
PessimisticTransaction::Put(), PessimisticTransaction::Delete(),
PessimisticTransaction::SingleDelete() will write to or delete a key, while
PessimisticTransaction::GetForUpdate() is used by application to indicate
to RocksDB that the transaction has the intention of performing write operation later
in the same transaction.
Pessimistic transactions support two-phase commit (2PC). A transaction can be
Prepared()'ed and then Commit(). The prepare phase is similar to a promise: once
Prepare() succeeds, the transaction has acquired the necessary resources to commit.
The resources include locks, persistence of WAL, etc.
Write-committed transaction is the default pessimistic transaction implementation. In
RocksDB write-committed transaction, Prepare() will write data to the WAL as a prepare
section. Commit() will write a commit marker to the WAL and then write data to the
memtables. While writing to the memtables, different keys in the transaction's write batch
will be assigned different sequence numbers in ascending order.
Until commit/rollback, the transaction holds locks on the keys so that no other transaction
can write to the same keys. Furthermore, the keys' sequence numbers represent the order
in which they are committed and should be made visible. This is convenient for us to
implement support for user-defined timestamps.
Since column families with and without timestamps can co-exist in the same database,
a transaction may or may not involve timestamps. Based on this observation, we add two
optional members to each PessimisticTransaction, read_timestamp_ and
commit_timestamp_. If no key in the transaction's write batch has timestamp, then
setting these two variables do not have any effect. For the rest of this commit, we discuss
only the cases when these two variables are meaningful.

read_timestamp_ is used mainly for validation, and should be set before first call to
GetForUpdate(). Otherwise, the latter will return non-ok status. GetForUpdate() calls
TryLock() that can verify if another transaction has written the same key since
read_timestamp_ till this call to GetForUpdate(). If another transaction has indeed
written the same key, then validation fails, and RocksDB allows this transaction to
refine read_timestamp_ by increasing it. Note that a transaction can still use Get()
with a different timestamp to read, but the result of the read should not be used to
determine data that will be written later.

commit_timestamp_ must be set after finishing writing and before transaction commit.
This applies to both 2PC and non-2PC cases. In the case of 2PC, it's usually set after
prepare phase succeeds.

We currently require that the commit timestamp be chosen after all keys are locked. This
means we disallow the TransactionDB-level APIs if user-defined timestamp is used
by the transaction. Specifically, calling PessimisticTransactionDB::Put(),
PessimisticTransactionDB::Delete(), PessimisticTransactionDB::SingleDelete(),
etc. will return non-ok status because they specify timestamps before locking the keys.
Users are also prompted to use the Transaction APIs when they receive the non-ok status.

Differential Revision: D31822445

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31822445

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31822445

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31822445

Copy link
Contributor

@ltamasi ltamasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR @riversand963 ! LGTM in general, just some questions/comments

Comment on lines 137 to 138
// Otherwise, it means we are just writing to the WAL, and we allow
// timestamps unset for the keys in the write batch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify why it is OK to not have the timestamps set if we're only writing to the WAL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will clarify, and I will explain in this comment as well:
in write-committed prepare phase, we do not need the timestamp in the WAL because these keys are not yet inserted to the memtables yet, and they will be used only for recovery. During recovery, should they be inserted to memtables, we will obtain the timestamp from the commit marker which is also in the WAL.

}
}

if (kMaxTxnTimestamp == read_timestamp_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to check do_validate here (and below)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing something, but we do perform check for do_validate. Actually, we require do_validate to be true, otherwise, it's an invalid argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm talking about the case when read_timestamp_ is not set (kMaxTxnTimestamp). The current logic considers that an error even if do_validate is false, right? I'm wondering if that's what we want, or if we would want to make sure read_timestamp_ is set if and only if do_validate is set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the control flow reaches here, it means we enable timestamp and call GetForUpdate(). I think in this case, we should just require do_validate to be true AND read_timestamp be set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't follow... At the point where we check kMaxTxnTimestamp == read_timestamp_, is do_validate guaranteed to be set? If it is, the else if (!do_validate) branch can't be hit, right?

In other words, I was wondering if we would want to do some version of this:

if (do_validate) {
   if (kMaxTxnTimestamp == read_timestamp_) {
      // error
   }
} else {
   if (kMaxTxnTimestamp != read_timestamp_) {
      // error
   }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may know the confusion. I think the logic can be

if (!do_validate) {
  // error. Because GetForUpate() with timestamp enabled must perform validation.
} else if (kMaxTxnTimestamp == read_timestamp_) {
  // error
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Checking do_validate first definitely makes the intent clearer to me

value, exclusive, do_validate);
}

Status WriteCommittedTxn::GetForUpdate(const ReadOptions& read_options,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the type of the output value the only difference between this GetForUpdate and the previous one? If yes, we could eliminate the code duplication by introducing a private helper template

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Comment on lines 250 to 264
column_family =
column_family ? column_family : db_impl_->DefaultColumnFamily();
assert(column_family);
const Comparator* const ucmp = column_family->GetComparator();
assert(ucmp);
size_t ts_sz = ucmp->timestamp_size();
if (0 == ts_sz) {
s = GetBatchForWrite()->Put(column_family, key, value);
} else {
assert(ts_sz == sizeof(TxnTimestamp));
if (!IndexingEnabled()) {
cfs_with_ts_.insert(column_family->GetID());
}
s = GetBatchForWrite()->Put(column_family, key, value);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we could turn these repeated chunks of code into private helpers as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Comment on lines +271 to +272
Status SingleDeleteUntracked(ColumnFamilyHandle* column_family,
const Slice& key) override;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have a SingleDeleteUntracked overload taking SliceParts for the sake of completeness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original motivation for this PR is to add timestamp to existing transaction APIs. I may have violated this myself, and I need to double check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double-checked that we are not adding new APIs, so I will skip this in this PR.

// indexing_enabled_ is false. If a key is written when indexing_enabled_ is
// true, then the corresponding column family is not added to cfs_with_ts
// even if it enables timestamp.
std::unordered_set<uint32_t> cfs_with_ts_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consider renaming this to something like cfs_with_ts_indexing_disabled_ or similar to make its semantics clearer in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Will do

void ReinitializeTransaction(
Transaction* txn, const WriteOptions& write_options,
const TransactionOptions& txn_options = TransactionOptions());

virtual Status VerifyCFOptions(const ColumnFamilyOptions& cf_options);

Status FailIfCfEnablesTs(const ColumnFamilyHandle* column_family) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this also be a static helper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could only if we remove the following since we need to call DefaultColumnFamily() which is non-static.

Status FailIfCfEnablesTs(const ColumnFamilyHandle* column_family) const {
  column_family = column_family ? column_family : DefaultColumnFamily();
  ...
}

ASSERT_OK(
pessimistic_txn_db->WriteWithConcurrencyControl(WriteOptions(), &wb2));

auto* txn = db->BeginTransaction(WriteOptions(), TransactionOptions(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor but we could immediately pass these transaction pointers to a unique_ptr, which would eliminate the need for manual deletes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think RAAI is a good idea and should be used in most cases if possible.
For RocksDB transactions, we unlock the keys in the destructor. Therefore, we more often would like to control explicitly when a transaction is deleted, especially when there are multiple transactions in the test.
With that being said, I am OK with applying the suggested change to tests where we create only one transaction, and adding a comment saying that the locks will be released only when the txn variable goes out of scope.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can still use smart pointers and explicitly call reset().
There are a lot of legacy code in transaction tests, but I will enforce this in the new test code added by me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Just wanted mention that if needed, we can still explicitly destroy a transaction held by a unique_ptr using reset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data race? :)

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31822445

@riversand963
Copy link
Contributor Author

Thanks @ltamasi for the review!

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31822445

Copy link
Contributor

@ltamasi ltamasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @riversand963 , this looks awesome!

};

template <typename TValue>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor but since these are private templates only used in WriteCommittedTxn, their implementation could be moved to the .cc file. (Also, the same goes for the methods that use these templates like Put / Delete etc.)


namespace ROCKSDB_NAMESPACE {

// TODO: add test cases for enable_indexing=true.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove these two TODOs now, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31822445

Summary:
Pull Request resolved: facebook#9629

Pessimistic transactions use pessimistic concurrency control, i.e. locking. Keys are
locked upon first operation that writes the key or has the intention of writing. For example,
`PessimisticTransaction::Put()`, `PessimisticTransaction::Delete()`,
`PessimisticTransaction::SingleDelete()` will write to or delete a key, while
`PessimisticTransaction::GetForUpdate()` is used by application to indicate
to RocksDB that the transaction has the intention of performing write operation later
in the same transaction.
Pessimistic transactions support two-phase commit (2PC). A transaction can be
`Prepared()`'ed and then `Commit()`. The prepare phase is similar to a promise: once
`Prepare()` succeeds, the transaction has acquired the necessary resources to commit.
The resources include locks, persistence of WAL, etc.
Write-committed transaction is the default pessimistic transaction implementation. In
RocksDB write-committed transaction, `Prepare()` will write data to the WAL as a prepare
section. `Commit()` will write a commit marker to the WAL and then write data to the
memtables. While writing to the memtables, different keys in the transaction's write batch
will be assigned different sequence numbers in ascending order.
Until commit/rollback, the transaction holds locks on the keys so that no other transaction
can write to the same keys. Furthermore, the keys' sequence numbers represent the order
in which they are committed and should be made visible. This is convenient for us to
implement support for user-defined timestamps.
Since column families with and without timestamps can co-exist in the same database,
a transaction may or may not involve timestamps. Based on this observation, we add two
optional members to each `PessimisticTransaction`, `read_timestamp_` and
`commit_timestamp_`. If no key in the transaction's write batch has timestamp, then
setting these two variables do not have any effect. For the rest of this commit, we discuss
only the cases when these two variables are meaningful.

read_timestamp_ is used mainly for validation, and should be set before first call to
`GetForUpdate()`. Otherwise, the latter will return non-ok status. `GetForUpdate()` calls
`TryLock()` that can verify if another transaction has written the same key since
`read_timestamp_` till this call to `GetForUpdate()`. If another transaction has indeed
written the same key, then validation fails, and RocksDB allows this transaction to
refine `read_timestamp_` by increasing it. Note that a transaction can still use `Get()`
with a different timestamp to read, but the result of the read should not be used to
determine data that will be written later.

commit_timestamp_ must be set after finishing writing and before transaction commit.
This applies to both 2PC and non-2PC cases. In the case of 2PC, it's usually set after
prepare phase succeeds.

We currently require that the commit timestamp be chosen after all keys are locked. This
means we disallow the `TransactionDB`-level APIs if user-defined timestamp is used
by the transaction. Specifically, calling `PessimisticTransactionDB::Put()`,
`PessimisticTransactionDB::Delete()`, `PessimisticTransactionDB::SingleDelete()`,
etc. will return non-ok status because they specify timestamps before locking the keys.
Users are also prompted to use the `Transaction` APIs when they receive the non-ok status.

Reviewed By: ltamasi

Differential Revision: D31822445

fbshipit-source-id: 878ac045279f5896260f41fa49fb9c001a9937bd
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31822445

@riversand963 riversand963 deleted the export-D31822445 branch March 9, 2022 00:42
facebook-github-bot pushed a commit that referenced this pull request Mar 7, 2024
…mmittedTxn::GetForUpdate (#12369)

Summary:
When PR #9629 introduced user-defined timestamp support for `WriteCommittedTxn`, it adds this usage mandate for API `GetForUpdate` when UDT is enabled. The `do_validate` flag has to be true, and user should have already called `Transaction::SetReadTimestampForValidation` to set a read timestamp for validation. The rationale behind this mandate is this:
1) with do_vaildate = true, `GetForUpdate` could verify this relationships: let's denote the user-defined timestamp in db for the key as  `Ts_db` and the read timestamp user set via `Transaction::SetReadTimestampForValidation` as `Ts_read`. UDT based validation will only pass if `Ts_db <= Ts_read`.
https://github.com/facebook/rocksdb/blob/5950907a823b99a6ae126ab075995c602d815d7a/utilities/transactions/transaction_util.cc#L141

2)  Let's denote the committed timestamp set via `Transaction::SetCommitTimestamp` to be `Ts_cmt`. Later `WriteCommitedTxn::Commit` would only pass if this condition is met: `Ts_read < Ts_cmt`. https://github.com/facebook/rocksdb/blob/5950907a823b99a6ae126ab075995c602d815d7a/utilities/transactions/pessimistic_transaction.cc#L431

Together these two checks can ensure `Ts_db < Ts_cmt` to meet the user-defined timestamp invariant that newer timestamp should have newer sequence number.

The `do_validate` flag was originally intended to make snapshot based validation optional. If it's true, `GetForUpdate` checks no entry is written after the snapshot. If it's false, it will skip this snapshot based validation. In this PR, we are making the UDT based validation configurable too based on this flag instead of mandating it for below reasons: 1) in some cases the users themselves can enforce aformentioned invariant on their side independently, without RocksDB help, for example, if they are managing a monotonically increasing timestamp, and their transactions are only committed in a single thread. So they don't need this UDT based validation and wants to skip it, 2) It also could be expensive or not practical for users to come up with such a read timestamp that is exactly in between their commit timestamp and the db's timestamp. For example, in aformentioned case where a monotonically increasing timestamp is managed, the users would need to access this timestamp both for setting the read timestamp and for setting the commit timestamp. So it's preferable to skip this check too.

Pull Request resolved: #12369

Test Plan: added unit tests

Reviewed By: ltamasi

Differential Revision: D54268920

Pulled By: jowlyzhang

fbshipit-source-id: ca7693796f9bb11f376a2059d91841e51c89435a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants