-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support user-defined timestamps in write-committed txns #9629
Conversation
This pull request was exported from Phabricator. Differential Revision: D31822445 |
This pull request was exported from Phabricator. Differential Revision: D31822445 |
a25c60a
to
f98448b
Compare
f98448b
to
a3ab195
Compare
This pull request was exported from Phabricator. Differential Revision: D31822445 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR @riversand963 ! LGTM in general, just some questions/comments
db/db_impl/db_impl_write.cc
Outdated
// Otherwise, it means we are just writing to the WAL, and we allow | ||
// timestamps unset for the keys in the write batch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify why it is OK to not have the timestamps set if we're only writing to the WAL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will clarify, and I will explain in this comment as well:
in write-committed prepare phase, we do not need the timestamp in the WAL because these keys are not yet inserted to the memtables yet, and they will be used only for recovery. During recovery, should they be inserted to memtables, we will obtain the timestamp from the commit marker which is also in the WAL.
} | ||
} | ||
|
||
if (kMaxTxnTimestamp == read_timestamp_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to check do_validate
here (and below)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be missing something, but we do perform check for do_validate
. Actually, we require do_validate
to be true, otherwise, it's an invalid argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm talking about the case when read_timestamp_
is not set (kMaxTxnTimestamp
). The current logic considers that an error even if do_validate
is false, right? I'm wondering if that's what we want, or if we would want to make sure read_timestamp_
is set if and only if do_validate
is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the control flow reaches here, it means we enable timestamp and call GetForUpdate(). I think in this case, we should just require do_validate to be true AND read_timestamp be set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't follow... At the point where we check kMaxTxnTimestamp == read_timestamp_
, is do_validate
guaranteed to be set? If it is, the else if (!do_validate)
branch can't be hit, right?
In other words, I was wondering if we would want to do some version of this:
if (do_validate) {
if (kMaxTxnTimestamp == read_timestamp_) {
// error
}
} else {
if (kMaxTxnTimestamp != read_timestamp_) {
// error
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may know the confusion. I think the logic can be
if (!do_validate) {
// error. Because GetForUpate() with timestamp enabled must perform validation.
} else if (kMaxTxnTimestamp == read_timestamp_) {
// error
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Checking do_validate
first definitely makes the intent clearer to me
value, exclusive, do_validate); | ||
} | ||
|
||
Status WriteCommittedTxn::GetForUpdate(const ReadOptions& read_options, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the type of the output value the only difference between this GetForUpdate
and the previous one? If yes, we could eliminate the code duplication by introducing a private helper template
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
column_family = | ||
column_family ? column_family : db_impl_->DefaultColumnFamily(); | ||
assert(column_family); | ||
const Comparator* const ucmp = column_family->GetComparator(); | ||
assert(ucmp); | ||
size_t ts_sz = ucmp->timestamp_size(); | ||
if (0 == ts_sz) { | ||
s = GetBatchForWrite()->Put(column_family, key, value); | ||
} else { | ||
assert(ts_sz == sizeof(TxnTimestamp)); | ||
if (!IndexingEnabled()) { | ||
cfs_with_ts_.insert(column_family->GetID()); | ||
} | ||
s = GetBatchForWrite()->Put(column_family, key, value); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we could turn these repeated chunks of code into private helpers as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
Status SingleDeleteUntracked(ColumnFamilyHandle* column_family, | ||
const Slice& key) override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to have a SingleDeleteUntracked
overload taking SliceParts
for the sake of completeness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original motivation for this PR is to add timestamp to existing transaction APIs. I may have violated this myself, and I need to double check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double-checked that we are not adding new APIs, so I will skip this in this PR.
// indexing_enabled_ is false. If a key is written when indexing_enabled_ is | ||
// true, then the corresponding column family is not added to cfs_with_ts | ||
// even if it enables timestamp. | ||
std::unordered_set<uint32_t> cfs_with_ts_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could consider renaming this to something like cfs_with_ts_indexing_disabled_
or similar to make its semantics clearer in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Will do
void ReinitializeTransaction( | ||
Transaction* txn, const WriteOptions& write_options, | ||
const TransactionOptions& txn_options = TransactionOptions()); | ||
|
||
virtual Status VerifyCFOptions(const ColumnFamilyOptions& cf_options); | ||
|
||
Status FailIfCfEnablesTs(const ColumnFamilyHandle* column_family) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this also be a static
helper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could only if we remove the following since we need to call DefaultColumnFamily()
which is non-static.
Status FailIfCfEnablesTs(const ColumnFamilyHandle* column_family) const {
column_family = column_family ? column_family : DefaultColumnFamily();
...
}
ASSERT_OK( | ||
pessimistic_txn_db->WriteWithConcurrencyControl(WriteOptions(), &wb2)); | ||
|
||
auto* txn = db->BeginTransaction(WriteOptions(), TransactionOptions(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor but we could immediately pass these transaction pointers to a unique_ptr
, which would eliminate the need for manual delete
s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think RAAI is a good idea and should be used in most cases if possible.
For RocksDB transactions, we unlock the keys in the destructor. Therefore, we more often would like to control explicitly when a transaction is deleted, especially when there are multiple transactions in the test.
With that being said, I am OK with applying the suggested change to tests where we create only one transaction, and adding a comment saying that the locks will be released only when the txn variable goes out of scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can still use smart pointers and explicitly call reset()
.
There are a lot of legacy code in transaction tests, but I will enforce this in the new test code added by me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Just wanted mention that if needed, we can still explicitly destroy a transaction held by a unique_ptr
using reset
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data race? :)
This pull request was exported from Phabricator. Differential Revision: D31822445 |
a3ab195
to
de5abce
Compare
Thanks @ltamasi for the review! |
This pull request was exported from Phabricator. Differential Revision: D31822445 |
de5abce
to
15f761c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @riversand963 , this looks awesome!
}; | ||
|
||
template <typename TValue> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor but since these are private templates only used in WriteCommittedTxn
, their implementation could be moved to the .cc file. (Also, the same goes for the methods that use these templates like Put
/ Delete
etc.)
|
||
namespace ROCKSDB_NAMESPACE { | ||
|
||
// TODO: add test cases for enable_indexing=true. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove these two TODOs now, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
This pull request was exported from Phabricator. Differential Revision: D31822445 |
15f761c
to
47baa4d
Compare
Summary: Pull Request resolved: facebook#9629 Pessimistic transactions use pessimistic concurrency control, i.e. locking. Keys are locked upon first operation that writes the key or has the intention of writing. For example, `PessimisticTransaction::Put()`, `PessimisticTransaction::Delete()`, `PessimisticTransaction::SingleDelete()` will write to or delete a key, while `PessimisticTransaction::GetForUpdate()` is used by application to indicate to RocksDB that the transaction has the intention of performing write operation later in the same transaction. Pessimistic transactions support two-phase commit (2PC). A transaction can be `Prepared()`'ed and then `Commit()`. The prepare phase is similar to a promise: once `Prepare()` succeeds, the transaction has acquired the necessary resources to commit. The resources include locks, persistence of WAL, etc. Write-committed transaction is the default pessimistic transaction implementation. In RocksDB write-committed transaction, `Prepare()` will write data to the WAL as a prepare section. `Commit()` will write a commit marker to the WAL and then write data to the memtables. While writing to the memtables, different keys in the transaction's write batch will be assigned different sequence numbers in ascending order. Until commit/rollback, the transaction holds locks on the keys so that no other transaction can write to the same keys. Furthermore, the keys' sequence numbers represent the order in which they are committed and should be made visible. This is convenient for us to implement support for user-defined timestamps. Since column families with and without timestamps can co-exist in the same database, a transaction may or may not involve timestamps. Based on this observation, we add two optional members to each `PessimisticTransaction`, `read_timestamp_` and `commit_timestamp_`. If no key in the transaction's write batch has timestamp, then setting these two variables do not have any effect. For the rest of this commit, we discuss only the cases when these two variables are meaningful. read_timestamp_ is used mainly for validation, and should be set before first call to `GetForUpdate()`. Otherwise, the latter will return non-ok status. `GetForUpdate()` calls `TryLock()` that can verify if another transaction has written the same key since `read_timestamp_` till this call to `GetForUpdate()`. If another transaction has indeed written the same key, then validation fails, and RocksDB allows this transaction to refine `read_timestamp_` by increasing it. Note that a transaction can still use `Get()` with a different timestamp to read, but the result of the read should not be used to determine data that will be written later. commit_timestamp_ must be set after finishing writing and before transaction commit. This applies to both 2PC and non-2PC cases. In the case of 2PC, it's usually set after prepare phase succeeds. We currently require that the commit timestamp be chosen after all keys are locked. This means we disallow the `TransactionDB`-level APIs if user-defined timestamp is used by the transaction. Specifically, calling `PessimisticTransactionDB::Put()`, `PessimisticTransactionDB::Delete()`, `PessimisticTransactionDB::SingleDelete()`, etc. will return non-ok status because they specify timestamps before locking the keys. Users are also prompted to use the `Transaction` APIs when they receive the non-ok status. Reviewed By: ltamasi Differential Revision: D31822445 fbshipit-source-id: 878ac045279f5896260f41fa49fb9c001a9937bd
47baa4d
to
c26a662
Compare
This pull request was exported from Phabricator. Differential Revision: D31822445 |
…mmittedTxn::GetForUpdate (#12369) Summary: When PR #9629 introduced user-defined timestamp support for `WriteCommittedTxn`, it adds this usage mandate for API `GetForUpdate` when UDT is enabled. The `do_validate` flag has to be true, and user should have already called `Transaction::SetReadTimestampForValidation` to set a read timestamp for validation. The rationale behind this mandate is this: 1) with do_vaildate = true, `GetForUpdate` could verify this relationships: let's denote the user-defined timestamp in db for the key as `Ts_db` and the read timestamp user set via `Transaction::SetReadTimestampForValidation` as `Ts_read`. UDT based validation will only pass if `Ts_db <= Ts_read`. https://github.com/facebook/rocksdb/blob/5950907a823b99a6ae126ab075995c602d815d7a/utilities/transactions/transaction_util.cc#L141 2) Let's denote the committed timestamp set via `Transaction::SetCommitTimestamp` to be `Ts_cmt`. Later `WriteCommitedTxn::Commit` would only pass if this condition is met: `Ts_read < Ts_cmt`. https://github.com/facebook/rocksdb/blob/5950907a823b99a6ae126ab075995c602d815d7a/utilities/transactions/pessimistic_transaction.cc#L431 Together these two checks can ensure `Ts_db < Ts_cmt` to meet the user-defined timestamp invariant that newer timestamp should have newer sequence number. The `do_validate` flag was originally intended to make snapshot based validation optional. If it's true, `GetForUpdate` checks no entry is written after the snapshot. If it's false, it will skip this snapshot based validation. In this PR, we are making the UDT based validation configurable too based on this flag instead of mandating it for below reasons: 1) in some cases the users themselves can enforce aformentioned invariant on their side independently, without RocksDB help, for example, if they are managing a monotonically increasing timestamp, and their transactions are only committed in a single thread. So they don't need this UDT based validation and wants to skip it, 2) It also could be expensive or not practical for users to come up with such a read timestamp that is exactly in between their commit timestamp and the db's timestamp. For example, in aformentioned case where a monotonically increasing timestamp is managed, the users would need to access this timestamp both for setting the read timestamp and for setting the commit timestamp. So it's preferable to skip this check too. Pull Request resolved: #12369 Test Plan: added unit tests Reviewed By: ltamasi Differential Revision: D54268920 Pulled By: jowlyzhang fbshipit-source-id: ca7693796f9bb11f376a2059d91841e51c89435a
Summary:
Pessimistic transactions use pessimistic concurrency control, i.e. locking. Keys are
locked upon first operation that writes the key or has the intention of writing. For example,
PessimisticTransaction::Put()
,PessimisticTransaction::Delete()
,PessimisticTransaction::SingleDelete()
will write to or delete a key, whilePessimisticTransaction::GetForUpdate()
is used by application to indicateto RocksDB that the transaction has the intention of performing write operation later
in the same transaction.
Pessimistic transactions support two-phase commit (2PC). A transaction can be
Prepared()
'ed and thenCommit()
. The prepare phase is similar to a promise: oncePrepare()
succeeds, the transaction has acquired the necessary resources to commit.The resources include locks, persistence of WAL, etc.
Write-committed transaction is the default pessimistic transaction implementation. In
RocksDB write-committed transaction,
Prepare()
will write data to the WAL as a preparesection.
Commit()
will write a commit marker to the WAL and then write data to thememtables. While writing to the memtables, different keys in the transaction's write batch
will be assigned different sequence numbers in ascending order.
Until commit/rollback, the transaction holds locks on the keys so that no other transaction
can write to the same keys. Furthermore, the keys' sequence numbers represent the order
in which they are committed and should be made visible. This is convenient for us to
implement support for user-defined timestamps.
Since column families with and without timestamps can co-exist in the same database,
a transaction may or may not involve timestamps. Based on this observation, we add two
optional members to each
PessimisticTransaction
,read_timestamp_
andcommit_timestamp_
. If no key in the transaction's write batch has timestamp, thensetting these two variables do not have any effect. For the rest of this commit, we discuss
only the cases when these two variables are meaningful.
read_timestamp_ is used mainly for validation, and should be set before first call to
GetForUpdate()
. Otherwise, the latter will return non-ok status.GetForUpdate()
callsTryLock()
that can verify if another transaction has written the same key sinceread_timestamp_
till this call toGetForUpdate()
. If another transaction has indeedwritten the same key, then validation fails, and RocksDB allows this transaction to
refine
read_timestamp_
by increasing it. Note that a transaction can still useGet()
with a different timestamp to read, but the result of the read should not be used to
determine data that will be written later.
commit_timestamp_ must be set after finishing writing and before transaction commit.
This applies to both 2PC and non-2PC cases. In the case of 2PC, it's usually set after
prepare phase succeeds.
We currently require that the commit timestamp be chosen after all keys are locked. This
means we disallow the
TransactionDB
-level APIs if user-defined timestamp is usedby the transaction. Specifically, calling
PessimisticTransactionDB::Put()
,PessimisticTransactionDB::Delete()
,PessimisticTransactionDB::SingleDelete()
,etc. will return non-ok status because they specify timestamps before locking the keys.
Users are also prompted to use the
Transaction
APIs when they receive the non-ok status.Differential Revision: D31822445