Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise APIs related to user-defined timestamp #8946

Closed
wants to merge 35 commits into from

Conversation

riversand963
Copy link
Contributor

@riversand963 riversand963 commented Sep 22, 2021

@ajkr reminded me that we have a rule of not including per-kv related data in WriteOptions.
Namely, WriteOptions should not include information about "what-to-write", but should just
include information about "how-to-write".

According to this rule, WriteOptions::timestamp (experimental) is clearly a violation. Therefore,
this PR removes WriteOptions::timestamp for compliance.
After the removal, we need to pass timestamp info via another set of APIs. This PR proposes a set
of overloaded functions Put(write_opts, key, value, ts), Delete(write_opts, key, ts), and
SingleDelete(write_opts, key, ts). Planned to add Write(write_opts, batch, ts), but its complexity
made me reconsider doing it in another PR (maybe).

For better checking and returning error early, we also add a new set of APIs to WriteBatch that take
extra timestamp information when writing to WriteBatches.
These set of APIs in WriteBatchWithIndex are currently not supported, and are on our TODO list.

Removed WriteBatch::AssignTimestamps() and renamed WriteBatch::AssignTimestamp() to
WriteBatch::UpdateTimestamps() since this method require that all keys have space for timestamps
allocated already and multiple timestamps can be updated.

The constructor of WriteBatch now takes a fourth argument default_cf_ts_sz which is the timestamp
size of the default column family. This will be used to allocate space when calling APIs that do not
specify a column family handle.

Also, updated DB::Get(), DB::MultiGet(), DB::NewIterator(), DB::NewIterators() methods, replacing
some assertions about timestamp to returning Status code.

Test plan:
make check
./db_bench -benchmarks=fillseq,fillrandom,readrandom,readseq,deleterandom -user_timestamp_size=8
./db_stress --user_timestamp_size=8 -nooverwritepercent=0 -test_secondary=0 -secondary_catch_up_one_in=0 -continuous_verification_interval=0

Make sure there is no perf regression by running the following

./db_bench_opt -db=/dev/shm/rocksdb -use_existing_db=0 -level0_stop_writes_trigger=256 -level0_slowdown_writes_trigger=256 -level0_file_num_compaction_trigger=256 -disable_wal=1 -duration=10 -benchmarks=fillrandom

Before this PR

DB path: [/dev/shm/rocksdb]
fillrandom   :       1.831 micros/op 546235 ops/sec;   60.4 MB/s

After this PR

DB path: [/dev/shm/rocksdb]
fillrandom   :       1.820 micros/op 549404 ops/sec;   60.8 MB/s

@mrambacher
Copy link
Contributor

Personally, I prefer the WriteOptions::timestamp and think we should stick with the original implementation/proposal:

  • WriteOptions::timestamp matches ReadOptions::timestamp. Why is it the right model for read but not wrte?
  • I believe a timestamp is the type of "per-operation thing" that these options were intended for.
  • It may be possible to do error checking in the DB for the WriteOptions (to guarantee or add a timestamp of the same size) with the WriteOption that is much harder than without
  • I believe the wrapper-style APIs cause problems in a "has-a" wrapper implementation (StackableDB). These new APIs appear not to break compatibility/interoperability but can actually cause problems and code to be bypassed.

With all of these points in mind, I believe the WriteOptions::timestamp is a better approach.

@pdillinger
Copy link
Contributor

This PR proposes a set
of overloaded functions Put(write_opts, key, value, ts) and Delete(write_opts, key, ts).

SstFileWriter uses Put(key, ts, value).

What bugs me about all these options is that non-timestamp users have to deal with all the timestamp-specific overloads when writing their code, and vice-versa. If someone forgets a timestamp (or less likely, accidentally provides one), they only find out at run time. If we had DbWithTimestamp and DB, only containing applicable functions, that would reduce the mixing of disjoint functionality in the API. Obviously we would want a common base type like DbBase for functions and declarations in common.

In addition or instead, we could have wrappers that allow putting more useful types than Slice on keys, timestamps (when applicable), and values. Perhaps call them TypedDb and TypedDbWithTimestamp.

@riversand963
Copy link
Contributor Author

@pdillinger

non-timestamp users have to deal with all the timestamp-specific overloads when writing their code, and vice-versa

I doubt. For non-timestamp users, they do not have to worry about timestamp-specific overloads. If they accidentally call these overloads, we can return InvalidArgument. To be able to use timestamps, they need to open db with column family options setting comparator to a timestamp-aware comparator. If they fail to do so, calling timestamp overloads can return error. I currently added assertions, but can add error returning. For timestamp users, calling non-timestamp overloads can also perform similar check and return error (TBD).

I am concerned about differentiating between <user_key> and <user_key, ts> internally, but this cannot be solved by wrapping DBImpl or something.

@riversand963 riversand963 marked this pull request as draft September 22, 2021 17:33
@riversand963
Copy link
Contributor Author

@mrambacher

I believe a timestamp is the type of "per-operation thing" that these options were intended for.

For DB::Put(), DB::Delete(), etc, yes. For DB::Write(), it's not necessarily the case. keys in a write batch can have different timestamps, actually the timestamps do not come are WriteOptions. They come from application directly.

@pdillinger
Copy link
Contributor

I doubt. For non-timestamp users, they do not have to worry about timestamp-specific overloads.

Their mere presence complicates the interface.
Screen Shot 2021-09-22 at 12 50 21 PM

@ajkr
Copy link
Contributor

ajkr commented Sep 22, 2021

I doubt. For non-timestamp users, they do not have to worry about timestamp-specific overloads.

Their mere presence complicates the interface.
Screen Shot 2021-09-22 at 12 50 21 PM

Would a DB (without timestamp) expose a similar requirement: it must be configured to use a Comparator that is not timestamp-aware?

Also I think comparator is a ColumnFamilyOption

@ltamasi
Copy link
Contributor

ltamasi commented Sep 22, 2021

The way I see it, there are two types of information passed to the write APIs: the key-value data that tells RocksDB what to write, and the write options (e.g. sync mode, disable WAL etc.) that tell RocksDB how to write the said key-values. In this classification, timestamps clearly belong in the what bucket. Also, I'm not concerned about adding new parameters/overloads for data (like timestamps) because our data model gets extended once in a blue moon.

@riversand963
Copy link
Contributor Author

riversand963 commented Sep 22, 2021

I doubt. For non-timestamp users, they do not have to worry about timestamp-specific overloads.

Their mere presence complicates the interface.
Screen Shot 2021-09-22 at 12 50 21 PM

Would a DB (without timestamp) expose a similar requirement: it must be configured to use a Comparator that is not timestamp-aware?

Also I think comparator is a ColumnFamilyOption

Yes, it is a column family option because we allow different column families to have different comparators. That said, it is allowed for some column families to enable timestamp, while others do not. Note that we can write to all these column families in ONE write batch.
If application does not use timestamp or own comparator, nothing needs to be done because the default BytewiseComparator() will be used. If an application wishes to use its own comparator, then the application needs to make sure the comparator makes the correct assumption about timestamp presence.

@riversand963
Copy link
Contributor Author

Furthermore, if an application just wants to implement own Comparator without concerning about timestamp can simply do so because the base class Comparator::timestamp_size() is 0, indicating it's disabled by default.

@pdillinger
Copy link
Contributor

That said, it is allowed for some column families to enable timestamp, while others do not

Oh right, so we need both in DB, which to my taste is a low-level interface. (It should be possible to create a TypedColumnFamily object that provides some conveniences I describe.)

because our data model gets extended once in a blue moon.

Let's hope :)

@@ -353,10 +353,17 @@ class DB {
virtual Status Put(const WriteOptions& options,
ColumnFamilyHandle* column_family, const Slice& key,
const Slice& value) = 0;
virtual Status Put(const WriteOptions& options,
ColumnFamilyHandle* column_family, const Slice& key,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SstFileWriter uses the order Put(key, ts, value)

(re-iterating to make sure this isn't lost in the discussion)

Copy link
Contributor Author

@riversand963 riversand963 Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will address these comments once we agree on the overall direction. :)

Currently, this is not a blocker for us, since we are working on adding timestamp support to WriteCommittedTxn transactions. I hope we can get this resolved before

  • release of RocksDB 7.0
  • people start adding timestamp support for Merge, DeleteRange, etc.
  • declaring (subset of) feature complete for customers

whichever comes first.

@riversand963 riversand963 force-pushed the write-opts-no-ts branch 2 times, most recently from 4b5dd0a to 5a43ee6 Compare September 27, 2021 22:44
@riversand963 riversand963 force-pushed the write-opts-no-ts branch 2 times, most recently from 3fa6d4e to 114277e Compare October 6, 2021 22:39
@riversand963 riversand963 marked this pull request as ready for review November 2, 2021 22:53
@riversand963 riversand963 marked this pull request as draft November 22, 2021 06:18
@riversand963 riversand963 changed the title Remove WriteOptions::timestamp [WIP]Remove WriteOptions::timestamp Dec 20, 2021
@riversand963 riversand963 force-pushed the write-opts-no-ts branch 2 times, most recently from 543ba27 to a8698c7 Compare December 21, 2021 05:53
@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@riversand963 riversand963 deleted the write-opts-no-ts branch February 2, 2022 06:26
v01dstar pushed a commit to v01dstar/rocksdb that referenced this pull request Jun 14, 2022
Summary:
ajkr reminded me that we have a rule of not including per-kv related data in `WriteOptions`.
Namely, `WriteOptions` should not include information about "what-to-write", but should just
include information about "how-to-write".

According to this rule, `WriteOptions::timestamp` (experimental) is clearly a violation. Therefore,
this PR removes `WriteOptions::timestamp` for compliance.
After the removal, we need to pass timestamp info via another set of APIs. This PR proposes a set
of overloaded functions `Put(write_opts, key, value, ts)`, `Delete(write_opts, key, ts)`, and
`SingleDelete(write_opts, key, ts)`. Planned to add `Write(write_opts, batch, ts)`, but its complexity
made me reconsider doing it in another PR (maybe).

For better checking and returning error early, we also add a new set of APIs to `WriteBatch` that take
extra `timestamp` information when writing to `WriteBatch`es.
These set of APIs in `WriteBatchWithIndex` are currently not supported, and are on our TODO list.

Removed `WriteBatch::AssignTimestamps()` and renamed `WriteBatch::AssignTimestamp()` to
`WriteBatch::UpdateTimestamps()` since this method require that all keys have space for timestamps
allocated already and multiple timestamps can be updated.

The constructor of `WriteBatch` now takes a fourth argument `default_cf_ts_sz` which is the timestamp
size of the default column family. This will be used to allocate space when calling APIs that do not
specify a column family handle.

Also, updated `DB::Get()`, `DB::MultiGet()`, `DB::NewIterator()`, `DB::NewIterators()` methods, replacing
some assertions about timestamp to returning Status code.

Pull Request resolved: facebook#8946

Test Plan:
make check
./db_bench -benchmarks=fillseq,fillrandom,readrandom,readseq,deleterandom -user_timestamp_size=8
./db_stress --user_timestamp_size=8 -nooverwritepercent=0 -test_secondary=0 -secondary_catch_up_one_in=0 -continuous_verification_interval=0

Make sure there is no perf regression by running the following
```
./db_bench_opt -db=/dev/shm/rocksdb -use_existing_db=0 -level0_stop_writes_trigger=256 -level0_slowdown_writes_trigger=256 -level0_file_num_compaction_trigger=256 -disable_wal=1 -duration=10 -benchmarks=fillrandom
```

Before this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom   :       1.831 micros/op 546235 ops/sec;   60.4 MB/s
```
After this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom   :       1.820 micros/op 549404 ops/sec;   60.8 MB/s
```

Reviewed By: ltamasi

Differential Revision: D33721359

Pulled By: riversand963

fbshipit-source-id: c131561534272c120ffb80711d42748d21badf09
Signed-off-by: v01dstar <yang.zhang@pingcap.com>
tabokie pushed a commit to tikv/rocksdb that referenced this pull request Jun 14, 2022
Summary:
ajkr reminded me that we have a rule of not including per-kv related data in `WriteOptions`.
Namely, `WriteOptions` should not include information about "what-to-write", but should just
include information about "how-to-write".

According to this rule, `WriteOptions::timestamp` (experimental) is clearly a violation. Therefore,
this PR removes `WriteOptions::timestamp` for compliance.
After the removal, we need to pass timestamp info via another set of APIs. This PR proposes a set
of overloaded functions `Put(write_opts, key, value, ts)`, `Delete(write_opts, key, ts)`, and
`SingleDelete(write_opts, key, ts)`. Planned to add `Write(write_opts, batch, ts)`, but its complexity
made me reconsider doing it in another PR (maybe).

For better checking and returning error early, we also add a new set of APIs to `WriteBatch` that take
extra `timestamp` information when writing to `WriteBatch`es.
These set of APIs in `WriteBatchWithIndex` are currently not supported, and are on our TODO list.

Removed `WriteBatch::AssignTimestamps()` and renamed `WriteBatch::AssignTimestamp()` to
`WriteBatch::UpdateTimestamps()` since this method require that all keys have space for timestamps
allocated already and multiple timestamps can be updated.

The constructor of `WriteBatch` now takes a fourth argument `default_cf_ts_sz` which is the timestamp
size of the default column family. This will be used to allocate space when calling APIs that do not
specify a column family handle.

Also, updated `DB::Get()`, `DB::MultiGet()`, `DB::NewIterator()`, `DB::NewIterators()` methods, replacing
some assertions about timestamp to returning Status code.

Pull Request resolved: facebook#8946

Test Plan:
make check
./db_bench -benchmarks=fillseq,fillrandom,readrandom,readseq,deleterandom -user_timestamp_size=8
./db_stress --user_timestamp_size=8 -nooverwritepercent=0 -test_secondary=0 -secondary_catch_up_one_in=0 -continuous_verification_interval=0

Make sure there is no perf regression by running the following
```
./db_bench_opt -db=/dev/shm/rocksdb -use_existing_db=0 -level0_stop_writes_trigger=256 -level0_slowdown_writes_trigger=256 -level0_file_num_compaction_trigger=256 -disable_wal=1 -duration=10 -benchmarks=fillrandom
```

Before this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom   :       1.831 micros/op 546235 ops/sec;   60.4 MB/s
```
After this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom   :       1.820 micros/op 549404 ops/sec;   60.8 MB/s
```

Reviewed By: ltamasi

Differential Revision: D33721359

Pulled By: riversand963

fbshipit-source-id: c131561534272c120ffb80711d42748d21badf09
Signed-off-by: v01dstar <yang.zhang@pingcap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed rocksdb-7.0 PRs with breaking API changes that need to land in the next major release, 7.0.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants