Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update WriteBatch::AssignTimestamp API #9205

Closed
wants to merge 1 commit into from

Conversation

riversand963
Copy link
Contributor

Summary:
Update WriteBatch::AssignTimestamp() APIs so that they take an
additional argument, i.e. a function object called checker indicating the user-specified logic of performing
checks on timestamp sizes.

WriteBatch is a building block used by multiple other RocksDB components, each of which may track
timestamp information in different data structures. For example, transaction can either write to
WriteBatchWithIndex which is a WriteBatch with index, or write directly to raw WriteBatch if
Transaction::DisableIndexing() is called.
WriteBatchWithIndex keeps mapping from column family id to comparator, and transaction needs
to keep similar information for the WriteBatch if user calls Transaction::DisableIndexing() (dynamically)
so that we will know the size of each timestamp later. The bookkeeping info maintained by WriteBatchWithIndex
and Transaction should not overlap.
When we later call WriteBatch::AssignTimestamp(), we need to use these data structures to guarantee
that we do not accidentally assign timestamps for keys from column families that disable timestamp.

Differential Revision: D31735186

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31735186

@riversand963 riversand963 changed the title Update WriteBatch::AssignTimestamp() and Add Update WriteBatch::AssignTimestamp API Nov 23, 2021
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31735186

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31735186

Copy link
Contributor

@ltamasi ltamasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @riversand963 !

db/write_batch.cc Outdated Show resolved Hide resolved
db/write_batch_internal.h Outdated Show resolved Hide resolved
db/write_batch_internal.h Outdated Show resolved Hide resolved
db/write_batch_internal.h Outdated Show resolved Hide resolved
db/write_batch_test.cc Outdated Show resolved Hide resolved
db/write_batch_test.cc Outdated Show resolved Hide resolved
db/write_batch_test.cc Show resolved Hide resolved
db/write_batch_test.cc Outdated Show resolved Hide resolved
db/write_batch_test.cc Outdated Show resolved Hide resolved
db/write_batch_internal.h Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31735186

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31735186

@riversand963
Copy link
Contributor Author

Thanks @ltamasi for the review!

Copy link
Contributor

@ltamasi ltamasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I just have some minor comments. Thanks @riversand963 !

db/write_batch_internal.h Outdated Show resolved Hide resolved
db/write_batch_internal.h Outdated Show resolved Hide resolved
db/write_batch_internal.h Outdated Show resolved Hide resolved
db/write_batch_test.cc Outdated Show resolved Hide resolved
include/rocksdb/write_batch.h Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31735186

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31735186

Summary:
Pull Request resolved: facebook#9205

Update WriteBatch::AssignTimestamp() APIs so that they take an
additional argument, i.e. a function object called `checker` indicating the user-specified logic of performing
checks on timestamp sizes.

WriteBatch is a building block used by multiple other RocksDB components, each of which may track
timestamp information in different data structures. For example, transaction can either write to
`WriteBatchWithIndex` which is a `WriteBatch` with index, or write directly to raw `WriteBatch` if
`Transaction::DisableIndexing()` is called.
`WriteBatchWithIndex` keeps mapping from column family id to comparator, and transaction needs
to keep similar information for the `WriteBatch` if user calls `Transaction::DisableIndexing()` (dynamically)
so that we will know the size of each timestamp later. The bookkeeping info maintained by `WriteBatchWithIndex`
and `Transaction` should not overlap.
When we later call `WriteBatch::AssignTimestamp()`, we need to use these data structures to guarantee
that we do not accidentally assign timestamps for keys from column families that disable timestamp.

Reviewed By: ltamasi

Differential Revision: D31735186

fbshipit-source-id: 7e8dd4af5dcb85e1368e98b10b222a241c20f91b
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31735186

@riversand963
Copy link
Contributor Author

The Linter error is a warning about memcpy.

@riversand963 riversand963 deleted the export-D31735186 branch December 1, 2021 06:33
Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a very awkward and ad-hoc addition. Why not have the column families know their own timestamp size to do their own checking, rather than requiring every user to awkwardly opt-in to checking? Then any read or write op can use the same info to validate timestamp sizes.


// Experimental.
// Assign timestamp to write batch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean overwrite the timestamp of all existing entries in the write batch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all entries. If a key is for a column family that disables timestamp, it will skip this key. Since all keys in the write batch, if enable timestamp, share the same timestamp, I think it's OK to say "assign a timestamp to the write batch".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all keys in the write batch, if enable timestamp, share the same timestamp

Where does this come from? The API documentation for WriteBatch::Put suggest you can assign a timestamp per key.

Regarding that: the API doc for Status Put(ColumnFamilyHandle* column_family, const Slice& key, const Slice& value) suggests it supports timestamp, which there is no test for that I can find.

From API:

as long as key points to a contiguous buffer with timestamp appended after user key.

The way this is phrased suggests to me that space for the timestamp has to be part of the buffer pointed to by the Slice, but beyond the bounds (size) of the Slice. (To match DB::Put, key Slice size does not include timestamp.)

Also,

as long as the timestamp is the last Slice in (SliceParts) key

I don't see such a limitation in the implementation, and it seems to break the intended flexibility / encapsulation provided by SliceParts: the user of SliceParts should treat it logically as one contiguous data buffer. ("A set of Slices that are virtually concatenated together.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are talking about AssignTimestamp(const Slice& ts) not Put. The result of this API is that all keys, if enable timestamp, will have the same timestamp.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result of this API is that all keys, if enable timestamp, will have the same timestamp.

What about keys added after AssignTimestamp? Your description doesn't make it clear. If the data model were more clear to the API user, this might be obvious, but the data model with timestamps is not clear from the API comments. The name "AssignTimestamp" alone (not to mention "Assign timestamp to write batch") suggests a data model in which a timestamp is tracked separately from individual entries to apply to them at Write time. I had to look at the implementations to understand the contract for AssignTimestamp(s) (and Put).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keys that do not exist in the WB when calling the AssignTimestamp API will obviously not have their timestamp set to the value. We can make it clear in the API comments.
Since the WriteBatch::Handler is public, user can always define their own handler, call Iterate() and then insert new keys after that. Therefore, your concern applies to the entire WriteBatch::Handler, and I do not have a good way of preventing that.

#include <atomic>
#include <functional>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed if not using std::function? (Would std::function be better?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove #include<functional>.

// This requires that all keys, if enable timestamp, (possibly from multiple
// column families) in the write batch have timestamps of the same format.
// checker: callable object to check the timestamp sizes of column families.
// User can call checker(uint32_t cf, size_t& ts_sz) which does the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User can call

This seems like the user is providing the function (of course someone can call their own function(!)), and the implementation expects this behavior from the provided function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's expected from user if they opt in.

// 2. if cf's timestamp size is 0, then set ts_sz to 0 and return OK.
// 3. otherwise, compare ts_sz with cf's timestamp size and return
// Status::InvalidArgument() if different.
template <typename Checker = TimestampChecker>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the Checker aspect of this API is effectively for internal use only, because users would have to provide their own implementation of non-default template instantiation.

@riversand963
Copy link
Contributor Author

@pdillinger

Why not have the column families know their own timestamp size

WriteBatch itself is a simple data structure that does not have pointers to either DB or ColumnFamily. To be able to perform the desired check, you either need to store extra information in each WriteBatch as members, or pass extra argument via the API.

If you choose to store extra information in each WriteBatch, then you will waste space if AssignTimestamp() is called within WriteBatchWithIndex or Transaction. WriteBatchWithIndex has a map from cf id to comparators that can be readily used for this check. One more tricky thing, you can dynamically enable/disable indexing for each WriteBatchWithIndex.

I have thought about storing just the db pointer in each write batch, but obtaining a column family from db requires the db mutex, and there is no guarantee the column family still exists. Therefore, more checking is needed.

@riversand963
Copy link
Contributor Author

Also keep in mind that we cannot assume the number of column families is small and the space overhead is minimal.

@pdillinger
Copy link
Contributor

Why not either

  • Pass in to AssignTimestamp a set of ColumnFamilyHandles you want to apply the timestamp to. This is a simpler and more flexible interface. For checking purposes (optional), you can record in the WriteBatch the implied timestamp size for each involved column family for checking at Write time. Trusting the user to provide authoritative information for input checking--without even checking it against authoritative information--is a bad design IMHO.
  • Make WriteBatch track a timestamp per column family, rather than per entry. Associate entries with timestamps at Write time (or applicable WBWI read time). Is "all keys in the write batch, if enable timestamp, share the same timestamp" the data model you wanted, or is that just the intended functionality of AssignTimestamp?

By the way, I had to look at the implementation of AssignTimestamps to understand the significant of the vector index. I believe it corresponds to the order of operations added to the WriteBatch, which is an awkward way of assigning timestamps. (If you wanted specific timestamps for each entry, why not just assign them as added?)

@pdillinger
Copy link
Contributor

the implied timestamp size for each involved column family for checking at Write time

And/or checking on subsequent AssignTimestamp

@riversand963
Copy link
Contributor Author

Pass in to AssignTimestamp a set of ColumnFamilyHandles you want to apply the timestamp to

This is suboptimal. Some users, e.g. WriteBatchWithIndex already tracks cf->comparator mapping. Why do we need to consume extra space and CPU to build another set of ColumnFamilyHandles? Consider another case, a transaction tracks the cf->comparator mapping for certain writes, but writes to the raw WriteBatch for others.

you can record in the WriteBatch the implied timestamp size for each involved column family for checking at Write time

Same. If many column families share the same timestamp size, this can waste space. I do not need a map from cf->timestamp, I just need a (unordered) set.

There is a plan to improve WriteBatch::Put(), Delete(), etc. APIs, as part of #8946 (TBD). After that or as part of that, we can improve the documentation.

@pdillinger
Copy link
Contributor

Transaction can use internal APIs with fewer safeguards against user error. It looks like this functionality could have been added to WriteBatchInternal.

Some users, e.g. WriteBatchWithIndex already tracks cf->comparator mapping.

Why? NewIterator and other functions take a ColumnFamilyHandle, so you have access to the comparator whenever needed without saving to the WBWI, right? Even with all entries in one skip list, you do not need comparator for column family n + 1 to know that an entry in column family n comes before it.

But to avoid getting distracted by design critiques and performance concerns, let's re-focus on a key point:

HISTORY.md

Public API change

  • Extend WriteBatch::AssignTimestamp and AssignTimestamps API so that both functions can accept an optional checker argument that performs additional checking on timestamp sizes.

This is false . The feature only works if including internal header db/write_batch_internal.h.

@riversand963
Copy link
Contributor Author

Why? NewIterator and other functions take a ColumnFamilyHandle, so you have access to the comparator whenever needed without saving to the WBWI, right?

Not really.

  • First, and unrelated to this PR, WBWI already tracks the mapping from cf_id to comparator, please refer to WriteBatchWithIndex::Rep::comparator which is of type WriteBatchEntryComparator defined in write_batch_with_index_internal.h.
  • Second, AssignTimestamp() does not take a ColumnFamilyHandle as argument. When we call AssignTimestamp() to assign timestamps to the keys in this write batch, we must pass in information about multiple column family timestamp sizes or comparators. It's unnecessarily restricting and inefficiency to require user to construct a certain type of collection since they (Transaction, other external users of RocksDB) may already track certain data structures that can be used for the checking. I will challenge proposals that require user to create object(s) containing certain collection type(s) in order to call AssignTimestamp() API, and I will also challenge proposals that use specific data types to store such information within WriteBatch, even if it's optional and controlled by a flag. I will also challenge proposals that advocate adding overloaded versions of AssignTimestamp() for different argument types.

This is false . The feature only works if including internal header db/write_batch_internal.h.

This is and should be a backward-compatible public API change. We should allow users to update/set timestamps of keys in a write batch which is already a public API. In many cases, you do NOT know the timestamp until you decide to commit the write batch. Given the current implementation, I believe the checking should be user-defined too.

If we are accurate and explicit about the behavior of Checker, e.g. input/output, thread-safety, it will be an improvement over the existing AssignTimestamp() API which is a must for reasons mentioned above.

@riversand963
Copy link
Contributor Author

Thinking about this more to bridge the gap between current code and review comments from @pdillinger . One solution may be removing the AssignTimestamp() APIs completely. This means user can only assign timestamps via an overloaded version of the DB::Write(WriteBatch*, ts) API, or via WriteBatch::Put(), Delete(), etc, which may not be possible in certain cases when timestamp is known only after it is written to the write batch. Consequently, we perform the check in the write-thread or while holding db mutex. If the number of keys to assign timestamps in the write batch is small, this should not be a problem. It may not be the case if we have large transactions because we will spend longer time in single-threaded execution. Furthermore, users will have access to the updated WriteBatch only after DB::Write() returns. Not sure whether this is OK, but probably yes.

@riversand963
Copy link
Contributor Author

@pdillinger

By the way, I had to look at the implementation of AssignTimestamps to understand the significant of the vector index. I believe it corresponds to the order of operations added to the WriteBatch, which is an awkward way of assigning timestamps.

I am also not a big fan of that, but it's also used to index into the WriteBatch::prot_info_. If you think it's confusing, how do you feel about removing the WriteBatch::AssignTimestamps() API? I added this API after conversation with a potential customer, but I do not have a confirmed use case of this API yet. This way, we only have the AssignTimestamp() API. Since its argument is not a collection type, the vector index should not confuse API user.

@pdillinger
Copy link
Contributor

I am also not a big fan of that, but it's also used to index into the WriteBatch::prot_info_.

I don't think that helps the public API user. prot_info_ is private, and ProtectionInfo, where the indexing occurs, is not defined in the public API.

If you think it's confusing, how do you feel about removing the WriteBatch::AssignTimestamps() API?

I think that would be better.

@pdillinger
Copy link
Contributor

pdillinger commented Dec 13, 2021

If we are accurate and explicit about the behavior of Checker, e.g. input/output, thread-safety, it will be an improvement over the existing AssignTimestamp() API which is a must for reasons mentioned above.

Even with #9278 it's still an awkward API, with unclear/inconsistent documentation. Why does it say "user can call checker" rather than "AssignTimestamp calls checker"? What does it mean to "Assign timestamp to write batch" if some entries are in CFs without timestamp? And (repeating myself) does the timestamp apply to future entries added to the write batch? And if it's called "assign" instead of "overwrite," does that mean the entries did not or should not have timestamps before call?

"ret: OK if assignment succeeds" and "if cf's timestamp size is 0, then set ts_sz to 0 and return OK" seem inconsistent. If there is no timestamp for an entry's CF, there is no timestamp assignment for that entry.

If you are intending that 'checker' could be rather general/stable for use across many WriteBatches and potentially other APIs to be added to WriteBatch, why is it tied to the notion of checking the particular semantics for AssignTimestamp/AssignTimestamp (which BTW happen to be different from each other in handling of no timestamp case)? Why not just have the user provide a function that takes cf_id and returns expected timestamp size (or SIZE_MAX for "I don't know but fail")? What's wrong with that simpler callback? And should we have the user provide this function at WriteBatch construction time rather than at AssignTimestamp time?

If you are intending the user to exercise more flexibility in the implementation of their checker (e.g. I don't know, skip some CFs with timestamp?), why is there no flexibility in the specified behavior of checker?

@riversand963
Copy link
Contributor Author

@pdillinger
Happening in #8946:
I removed the WriteBatch::AssignTimestamps() API and renamed the remaining API to UpdateTimestamp(). Doing so simplifies the logic of timestamp checking. As a result, I can simplify the contract of checker as suggested.
Also improved code comment to address the issues raised earlier.
I am thinking of also doing timestamp update in PreprocessWrite().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants