Update WriteBatch::AssignTimestamp API #9205

riversand963 · 2021-11-23T18:05:53Z

Summary:
Update WriteBatch::AssignTimestamp() APIs so that they take an
additional argument, i.e. a function object called checker indicating the user-specified logic of performing
checks on timestamp sizes.

WriteBatch is a building block used by multiple other RocksDB components, each of which may track
timestamp information in different data structures. For example, transaction can either write to
WriteBatchWithIndex which is a WriteBatch with index, or write directly to raw WriteBatch if
Transaction::DisableIndexing() is called.
WriteBatchWithIndex keeps mapping from column family id to comparator, and transaction needs
to keep similar information for the WriteBatch if user calls Transaction::DisableIndexing() (dynamically)
so that we will know the size of each timestamp later. The bookkeeping info maintained by WriteBatchWithIndex
and Transaction should not overlap.
When we later call WriteBatch::AssignTimestamp(), we need to use these data structures to guarantee
that we do not accidentally assign timestamps for keys from column families that disable timestamp.

Differential Revision: D31735186

facebook-github-bot · 2021-11-23T18:06:17Z

This pull request was exported from Phabricator. Differential Revision: D31735186

facebook-github-bot · 2021-11-25T06:16:34Z

This pull request was exported from Phabricator. Differential Revision: D31735186

facebook-github-bot · 2021-11-25T06:18:23Z

This pull request was exported from Phabricator. Differential Revision: D31735186

ltamasi

Thanks for the patch @riversand963 !

db/write_batch.cc

db/write_batch_internal.h

db/write_batch_test.cc

db/write_batch_internal.h

facebook-github-bot · 2021-11-30T22:30:18Z

This pull request was exported from Phabricator. Differential Revision: D31735186

facebook-github-bot · 2021-11-30T22:31:15Z

This pull request was exported from Phabricator. Differential Revision: D31735186

riversand963 · 2021-11-30T22:42:17Z

Thanks @ltamasi for the review!

ltamasi

LGTM, I just have some minor comments. Thanks @riversand963 !

db/write_batch_internal.h

db/write_batch_test.cc

include/rocksdb/write_batch.h

facebook-github-bot · 2021-12-01T04:31:47Z

This pull request was exported from Phabricator. Differential Revision: D31735186

facebook-github-bot · 2021-12-01T04:32:43Z

This pull request was exported from Phabricator. Differential Revision: D31735186

include/rocksdb/write_batch.h

Summary: Pull Request resolved: facebook#9205 Update WriteBatch::AssignTimestamp() APIs so that they take an additional argument, i.e. a function object called `checker` indicating the user-specified logic of performing checks on timestamp sizes. WriteBatch is a building block used by multiple other RocksDB components, each of which may track timestamp information in different data structures. For example, transaction can either write to `WriteBatchWithIndex` which is a `WriteBatch` with index, or write directly to raw `WriteBatch` if `Transaction::DisableIndexing()` is called. `WriteBatchWithIndex` keeps mapping from column family id to comparator, and transaction needs to keep similar information for the `WriteBatch` if user calls `Transaction::DisableIndexing()` (dynamically) so that we will know the size of each timestamp later. The bookkeeping info maintained by `WriteBatchWithIndex` and `Transaction` should not overlap. When we later call `WriteBatch::AssignTimestamp()`, we need to use these data structures to guarantee that we do not accidentally assign timestamps for keys from column families that disable timestamp. Reviewed By: ltamasi Differential Revision: D31735186 fbshipit-source-id: 7e8dd4af5dcb85e1368e98b10b222a241c20f91b

facebook-github-bot · 2021-12-01T04:57:20Z

This pull request was exported from Phabricator. Differential Revision: D31735186

riversand963 · 2021-12-01T05:51:01Z

The Linter error is a warning about memcpy.

pdillinger

This seems like a very awkward and ad-hoc addition. Why not have the column families know their own timestamp size to do their own checking, rather than requiring every user to awkwardly opt-in to checking? Then any read or write op can use the same info to validate timestamp sizes.

pdillinger · 2021-12-01T17:05:05Z

include/rocksdb/write_batch.h


+  // Experimental.
+  // Assign timestamp to write batch.


Do you mean overwrite the timestamp of all existing entries in the write batch?

Not all entries. If a key is for a column family that disables timestamp, it will skip this key. Since all keys in the write batch, if enable timestamp, share the same timestamp, I think it's OK to say "assign a timestamp to the write batch".

Since all keys in the write batch, if enable timestamp, share the same timestamp

Where does this come from? The API documentation for WriteBatch::Put suggest you can assign a timestamp per key.

Regarding that: the API doc for Status Put(ColumnFamilyHandle* column_family, const Slice& key, const Slice& value) suggests it supports timestamp, which there is no test for that I can find.

From API:

as long as key points to a contiguous buffer with timestamp appended after user key.

The way this is phrased suggests to me that space for the timestamp has to be part of the buffer pointed to by the Slice, but beyond the bounds (size) of the Slice. (To match DB::Put, key Slice size does not include timestamp.)

Also,

as long as the timestamp is the last Slice in (SliceParts) key

I don't see such a limitation in the implementation, and it seems to break the intended flexibility / encapsulation provided by SliceParts: the user of SliceParts should treat it logically as one contiguous data buffer. ("A set of Slices that are virtually concatenated together.")

We are talking about AssignTimestamp(const Slice& ts) not Put. The result of this API is that all keys, if enable timestamp, will have the same timestamp.

The result of this API is that all keys, if enable timestamp, will have the same timestamp.

What about keys added after AssignTimestamp? Your description doesn't make it clear. If the data model were more clear to the API user, this might be obvious, but the data model with timestamps is not clear from the API comments. The name "AssignTimestamp" alone (not to mention "Assign timestamp to write batch") suggests a data model in which a timestamp is tracked separately from individual entries to apply to them at Write time. I had to look at the implementations to understand the contract for AssignTimestamp(s) (and Put).

The keys that do not exist in the WB when calling the AssignTimestamp API will obviously not have their timestamp set to the value. We can make it clear in the API comments.
Since the WriteBatch::Handler is public, user can always define their own handler, call Iterate() and then insert new keys after that. Therefore, your concern applies to the entire WriteBatch::Handler, and I do not have a good way of preventing that.

pdillinger · 2021-12-01T17:09:25Z

include/rocksdb/write_batch.h

 #include <atomic>
+#include <functional>


Is this needed if not using std::function? (Would std::function be better?)

I can remove #include<functional>.

pdillinger · 2021-12-01T17:09:42Z

include/rocksdb/write_batch.h

+  // This requires that all keys, if enable timestamp, (possibly from multiple
+  // column families) in the write batch have timestamps of the same format.
+  // checker: callable object to check the timestamp sizes of column families.
+  // User can call checker(uint32_t cf, size_t& ts_sz) which does the


User can call

This seems like the user is providing the function (of course someone can call their own function(!)), and the implementation expects this behavior from the provided function.

Yes, it's expected from user if they opt in.

pdillinger · 2021-12-01T17:14:58Z

include/rocksdb/write_batch.h

+  // 2. if cf's timestamp size is 0, then set ts_sz to 0 and return OK.
+  // 3. otherwise, compare ts_sz with cf's timestamp size and return
+  // Status::InvalidArgument() if different.
+  template <typename Checker = TimestampChecker>


Also, the Checker aspect of this API is effectively for internal use only, because users would have to provide their own implementation of non-default template instantiation.

riversand963 · 2021-12-01T19:57:08Z

@pdillinger

Why not have the column families know their own timestamp size

WriteBatch itself is a simple data structure that does not have pointers to either DB or ColumnFamily. To be able to perform the desired check, you either need to store extra information in each WriteBatch as members, or pass extra argument via the API.

If you choose to store extra information in each WriteBatch, then you will waste space if AssignTimestamp() is called within WriteBatchWithIndex or Transaction. WriteBatchWithIndex has a map from cf id to comparators that can be readily used for this check. One more tricky thing, you can dynamically enable/disable indexing for each WriteBatchWithIndex.

I have thought about storing just the db pointer in each write batch, but obtaining a column family from db requires the db mutex, and there is no guarantee the column family still exists. Therefore, more checking is needed.

riversand963 · 2021-12-01T20:04:28Z

Also keep in mind that we cannot assume the number of column families is small and the space overhead is minimal.

pdillinger · 2021-12-01T20:30:13Z

Why not either

Pass in to AssignTimestamp a set of ColumnFamilyHandles you want to apply the timestamp to. This is a simpler and more flexible interface. For checking purposes (optional), you can record in the WriteBatch the implied timestamp size for each involved column family for checking at Write time. Trusting the user to provide authoritative information for input checking--without even checking it against authoritative information--is a bad design IMHO.
Make WriteBatch track a timestamp per column family, rather than per entry. Associate entries with timestamps at Write time (or applicable WBWI read time). Is "all keys in the write batch, if enable timestamp, share the same timestamp" the data model you wanted, or is that just the intended functionality of AssignTimestamp?

By the way, I had to look at the implementation of AssignTimestamps to understand the significant of the vector index. I believe it corresponds to the order of operations added to the WriteBatch, which is an awkward way of assigning timestamps. (If you wanted specific timestamps for each entry, why not just assign them as added?)

pdillinger · 2021-12-01T20:31:17Z

the implied timestamp size for each involved column family for checking at Write time

And/or checking on subsequent AssignTimestamp

riversand963 · 2021-12-01T22:57:09Z

Pass in to AssignTimestamp a set of ColumnFamilyHandles you want to apply the timestamp to

This is suboptimal. Some users, e.g. WriteBatchWithIndex already tracks cf->comparator mapping. Why do we need to consume extra space and CPU to build another set of ColumnFamilyHandles? Consider another case, a transaction tracks the cf->comparator mapping for certain writes, but writes to the raw WriteBatch for others.

you can record in the WriteBatch the implied timestamp size for each involved column family for checking at Write time

Same. If many column families share the same timestamp size, this can waste space. I do not need a map from cf->timestamp, I just need a (unordered) set.

There is a plan to improve WriteBatch::Put(), Delete(), etc. APIs, as part of #8946 (TBD). After that or as part of that, we can improve the documentation.

pdillinger · 2021-12-06T17:02:24Z

Transaction can use internal APIs with fewer safeguards against user error. It looks like this functionality could have been added to WriteBatchInternal.

Some users, e.g. WriteBatchWithIndex already tracks cf->comparator mapping.

Why? NewIterator and other functions take a ColumnFamilyHandle, so you have access to the comparator whenever needed without saving to the WBWI, right? Even with all entries in one skip list, you do not need comparator for column family n + 1 to know that an entry in column family n comes before it.

But to avoid getting distracted by design critiques and performance concerns, let's re-focus on a key point:

HISTORY.md

Public API change

Extend WriteBatch::AssignTimestamp and AssignTimestamps API so that both functions can accept an optional checker argument that performs additional checking on timestamp sizes.

This is false . The feature only works if including internal header db/write_batch_internal.h.

riversand963 · 2021-12-07T05:29:11Z

Why? NewIterator and other functions take a ColumnFamilyHandle, so you have access to the comparator whenever needed without saving to the WBWI, right?

Not really.

First, and unrelated to this PR, WBWI already tracks the mapping from cf_id to comparator, please refer to WriteBatchWithIndex::Rep::comparator which is of type WriteBatchEntryComparator defined in write_batch_with_index_internal.h.
Second, AssignTimestamp() does not take a ColumnFamilyHandle as argument. When we call AssignTimestamp() to assign timestamps to the keys in this write batch, we must pass in information about multiple column family timestamp sizes or comparators. It's unnecessarily restricting and inefficiency to require user to construct a certain type of collection since they (Transaction, other external users of RocksDB) may already track certain data structures that can be used for the checking. I will challenge proposals that require user to create object(s) containing certain collection type(s) in order to call AssignTimestamp() API, and I will also challenge proposals that use specific data types to store such information within WriteBatch, even if it's optional and controlled by a flag. I will also challenge proposals that advocate adding overloaded versions of AssignTimestamp() for different argument types.

This is false . The feature only works if including internal header db/write_batch_internal.h.

This is and should be a backward-compatible public API change. We should allow users to update/set timestamps of keys in a write batch which is already a public API. In many cases, you do NOT know the timestamp until you decide to commit the write batch. Given the current implementation, I believe the checking should be user-defined too.

If we are accurate and explicit about the behavior of Checker, e.g. input/output, thread-safety, it will be an improvement over the existing AssignTimestamp() API which is a must for reasons mentioned above.

riversand963 · 2021-12-08T03:44:22Z

Thinking about this more to bridge the gap between current code and review comments from @pdillinger . One solution may be removing the AssignTimestamp() APIs completely. This means user can only assign timestamps via an overloaded version of the DB::Write(WriteBatch*, ts) API, or via WriteBatch::Put(), Delete(), etc, which may not be possible in certain cases when timestamp is known only after it is written to the write batch. Consequently, we perform the check in the write-thread or while holding db mutex. If the number of keys to assign timestamps in the write batch is small, this should not be a problem. It may not be the case if we have large transactions because we will spend longer time in single-threaded execution. Furthermore, users will have access to the updated WriteBatch only after DB::Write() returns. Not sure whether this is OK, but probably yes.

riversand963 · 2021-12-10T06:28:37Z

@pdillinger

By the way, I had to look at the implementation of AssignTimestamps to understand the significant of the vector index. I believe it corresponds to the order of operations added to the WriteBatch, which is an awkward way of assigning timestamps.

I am also not a big fan of that, but it's also used to index into the WriteBatch::prot_info_. If you think it's confusing, how do you feel about removing the WriteBatch::AssignTimestamps() API? I added this API after conversation with a potential customer, but I do not have a confirmed use case of this API yet. This way, we only have the AssignTimestamp() API. Since its argument is not a collection type, the vector index should not confuse API user.

pdillinger · 2021-12-13T17:38:59Z

I am also not a big fan of that, but it's also used to index into the WriteBatch::prot_info_.

I don't think that helps the public API user. prot_info_ is private, and ProtectionInfo, where the indexing occurs, is not defined in the public API.

If you think it's confusing, how do you feel about removing the WriteBatch::AssignTimestamps() API?

I think that would be better.

pdillinger · 2021-12-13T18:11:44Z

If we are accurate and explicit about the behavior of Checker, e.g. input/output, thread-safety, it will be an improvement over the existing AssignTimestamp() API which is a must for reasons mentioned above.

Even with #9278 it's still an awkward API, with unclear/inconsistent documentation. Why does it say "user can call checker" rather than "AssignTimestamp calls checker"? What does it mean to "Assign timestamp to write batch" if some entries are in CFs without timestamp? And (repeating myself) does the timestamp apply to future entries added to the write batch? And if it's called "assign" instead of "overwrite," does that mean the entries did not or should not have timestamps before call?

"ret: OK if assignment succeeds" and "if cf's timestamp size is 0, then set ts_sz to 0 and return OK" seem inconsistent. If there is no timestamp for an entry's CF, there is no timestamp assignment for that entry.

If you are intending that 'checker' could be rather general/stable for use across many WriteBatches and potentially other APIs to be added to WriteBatch, why is it tied to the notion of checking the particular semantics for AssignTimestamp/AssignTimestamp (which BTW happen to be different from each other in handling of no timestamp case)? Why not just have the user provide a function that takes cf_id and returns expected timestamp size (or SIZE_MAX for "I don't know but fail")? What's wrong with that simpler callback? And should we have the user provide this function at WriteBatch construction time rather than at AssignTimestamp time?

If you are intending the user to exercise more flexibility in the implementation of their checker (e.g. I don't know, skip some CFs with timestamp?), why is there no flexibility in the specified behavior of checker?

riversand963 · 2021-12-21T23:09:06Z

@pdillinger
Happening in #8946:
I removed the WriteBatch::AssignTimestamps() API and renamed the remaining API to UpdateTimestamp(). Doing so simplifies the logic of timestamp checking. As a result, I can simplify the contract of checker as suggested.
Also improved code comment to address the issues raised earlier.
I am thinking of also doing timestamp update in PreprocessWrite().

facebook-github-bot added CLA Signed fb-exported labels Nov 23, 2021

riversand963 changed the title ~~Update WriteBatch::AssignTimestamp() and Add~~ Update WriteBatch::AssignTimestamp API Nov 23, 2021

riversand963 force-pushed the export-D31735186 branch from b7a3e74 to 4f3e6e7 Compare November 25, 2021 06:16

riversand963 force-pushed the export-D31735186 branch from 4f3e6e7 to 3b35bde Compare November 25, 2021 06:18

riversand963 requested a review from ltamasi November 29, 2021 16:27

ltamasi reviewed Nov 29, 2021

View reviewed changes

db/write_batch_internal.h Outdated Show resolved Hide resolved

riversand963 force-pushed the export-D31735186 branch from 3b35bde to a4d5506 Compare November 30, 2021 22:30

riversand963 force-pushed the export-D31735186 branch from a4d5506 to 32b2d3b Compare November 30, 2021 22:31

ltamasi approved these changes Nov 30, 2021

View reviewed changes

riversand963 force-pushed the export-D31735186 branch from 32b2d3b to 72b7c43 Compare December 1, 2021 04:31

riversand963 force-pushed the export-D31735186 branch from 72b7c43 to ef24532 Compare December 1, 2021 04:32

ltamasi reviewed Dec 1, 2021

View reviewed changes

include/rocksdb/write_batch.h Outdated Show resolved Hide resolved

riversand963 force-pushed the export-D31735186 branch from ef24532 to 48efc98 Compare December 1, 2021 04:57

facebook-github-bot closed this in 9246165 Dec 1, 2021

riversand963 deleted the export-D31735186 branch December 1, 2021 06:33

pdillinger reviewed Dec 1, 2021

View reviewed changes

pdillinger mentioned this pull request Dec 8, 2021

Linker failure using custom checker with WriteBatch::AssignTimestamp #9272

Closed

Update WriteBatch::AssignTimestamp API #9205

Update WriteBatch::AssignTimestamp API #9205

Conversation

riversand963 commented Nov 23, 2021

facebook-github-bot commented Nov 23, 2021

facebook-github-bot commented Nov 25, 2021

facebook-github-bot commented Nov 25, 2021

ltamasi left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Nov 30, 2021

facebook-github-bot commented Nov 30, 2021

riversand963 commented Nov 30, 2021

ltamasi left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Dec 1, 2021

facebook-github-bot commented Dec 1, 2021

facebook-github-bot commented Dec 1, 2021

riversand963 commented Dec 1, 2021

pdillinger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

riversand963 commented Dec 1, 2021

riversand963 commented Dec 1, 2021

pdillinger commented Dec 1, 2021

pdillinger commented Dec 1, 2021

riversand963 commented Dec 1, 2021

pdillinger commented Dec 6, 2021

Public API change

riversand963 commented Dec 7, 2021

riversand963 commented Dec 8, 2021

riversand963 commented Dec 10, 2021

pdillinger commented Dec 13, 2021

pdillinger commented Dec 13, 2021 • edited Loading

riversand963 commented Dec 21, 2021

pdillinger commented Dec 13, 2021 •

edited

Loading