Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for timestamp in Get/Put #5079

Closed
wants to merge 21 commits into from

Conversation

riversand963
Copy link
Contributor

@riversand963 riversand963 commented Mar 18, 2019

It's useful to be able to (optionally) associate key-value pairs with user-provided timestamps. This PR is an early effort towards this goal and continues the work of #4942. A suite of new unit tests exist in DBBasicTestWithTimestampWithParam. Support for timestamp requires the user to provide timestamp as a slice in ReadOptions and WriteOptions. All timestamps of the same database must share the same length, format, etc. The format of the timestamp is the same throughout the same database, and the user is responsible for providing a comparator function (Comparator) to order the <key, timestamp> tuples. Once created, the format and length of the timestamp cannot change (at least for now).

Test plan (on devserver):

$COMPILE_WITH_ASAN=1 make -j32 all
$./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
$make check

All tests must pass.

We also run the following db_bench tests to verify whether there is regression on Get/Put while timestamp is not enabled.

$TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
$TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000

Repeat for 6 times for both versions.

Results are as follows:

|        | readrandom | fillrandom |
| master | 16.77 MB/s | 47.05 MB/s |
| PR5079 | 16.44 MB/s | 47.03 MB/s |

Copy link
Contributor

@sagar0 sagar0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! This would be a great feature to have in RocksDB.

include/rocksdb/comparator.h Outdated Show resolved Hide resolved
Copy link
Contributor

@siying siying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! We need to document in some comments that it is just a feature in-progress, what is supported, not supported, and mention that format compatibility is not guaranteed and API is applicable to change. And I think it's pretty close to be merged!

db/db_impl.cc Outdated
if (read_options.timestamp != nullptr) {
Slice akey;
std::string buf;
Status s = AppendTimestamp(key, *(read_options.timestamp), &akey, &buf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a TODO: in GetImpl() we already copy the key once to LookupKey lkey. We should do the copying together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I have TODOs for both Write/Put and Get. Will be addressed in another PR. This one focuses more on API and tries to demonstrate how the internal can work with timestamp.

include/rocksdb/comparator.h Outdated Show resolved Hide resolved
table/block_based_table_reader.cc Outdated Show resolved Hide resolved
table/get_context.cc Outdated Show resolved Hide resolved
@@ -182,7 +182,8 @@ bool GetContext::SaveValue(const ParsedInternalKey& parsed_key,
assert(matched);
assert((state_ != kMerge && parsed_key.type != kTypeMerge) ||
merge_context_ != nullptr);
if (ucmp_->Equal(parsed_key.user_key, user_key_)) {
if (ucmp_->Equal(parsed_key.user_key, user_key_) ||
ucmp_->CompareWithoutTimestamp(parsed_key.user_key, user_key_) == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it is correct to check only key without timestamp for hit, I think merge should only happen if both of user key and timestamp are the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, merge should also look at timestamp. To RocksDB, two equal keys with different user-timestamps are just two different keys. Without further instruction from the user, e.g. merge policy, RocksDB needs to treat them as separate keys.
I didn't implement merge in this PR though.

db/db_impl.cc Outdated
@@ -1352,6 +1352,15 @@ ColumnFamilyHandle* DBImpl::DefaultColumnFamily() const {
Status DBImpl::Get(const ReadOptions& read_options,
ColumnFamilyHandle* column_family, const Slice& key,
PinnableSlice* value) {
if (read_options.timestamp != nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not in this PR, but we should think about returning which timestamp for the entry returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we return the timestamp to user? Say if user specifies 55 as timestamp, and the db has a record with ts 54, we need to return 54 as well as the key-value pair?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We have been talking about a hidden column or something like that in SQL tables which is timestamp.

include/rocksdb/comparator.h Outdated Show resolved Hide resolved
table/block_based_table_reader.cc Outdated Show resolved Hide resolved
include/rocksdb/comparator.h Show resolved Hide resolved
db/memtable.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@siying siying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are very close to get it committed!

include/rocksdb/options.h Show resolved Hide resolved
include/rocksdb/options.h Show resolved Hide resolved
include/rocksdb/comparator.h Show resolved Hide resolved
include/rocksdb/comparator.h Outdated Show resolved Hide resolved
include/rocksdb/comparator.h Outdated Show resolved Hide resolved
Copy link
Contributor

@siying siying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's land the first commit!!! Please mention it in HISTORY.md

@siying
Copy link
Contributor

siying commented Apr 29, 2019

Failed tests are legitimate though.

@riversand963
Copy link
Contributor Author

Failed tests are legitimate though.

Taking a look now.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@riversand963
Copy link
Contributor Author

riversand963 commented May 1, 2019

@siying, I think we should also persist timestamp size info to the MANIFEST using the safe ignore mask.

@riversand963
Copy link
Contributor Author

Discussed with @siying offline. For now we can still rely on comparator name to identify comparators without persisting timestamp size in the MANIFEST. It may be a good idea to postpone MANIFEST format change (although can be forward-compatible) until we are ready to officially release timestamp support.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. Re-import the pull request

@riversand963
Copy link
Contributor Author

@siying. Since we may likely allow users to construct their own WriteBatch with augmented keys (key + timestamp), I think we probably need to perform some argument sanity checks in DBImpl::Write anyway. Therefore, we should not perform the check in DB::Put in order to reduce duplicate code. Then we may need to add some state information to WriteBatch to indicate whether timestamp(s) has been assigned to the records in the write batch.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. Re-import the pull request

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. Re-import the pull request

Summary:
Explicitly override ComparaWithoutTimestamp in BytewiseComparator and
ReverseBytewiseComparator to avoid an extra virtual function call incurred by
Comparator::CompareWithoutTimestamp.
Summary:
We need to make timestamp_size_ non const because we implicitly use assignment
operator. Also make it private to disable subclass modifying its value.
Finally, add an explicit overloaded assignment.
@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. Re-import the pull request

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. Re-import the pull request

@riversand963 riversand963 removed the WIP Work in progress label Jun 5, 2019
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@riversand963 merged this pull request in 340ed4f.

vagogte pushed a commit to vagogte/rocksdb that referenced this pull request Jun 18, 2019
Summary:
It's useful to be able to (optionally) associate key-value pairs with user-provided timestamps. This PR is an early effort towards this goal and continues the work of facebook#4942. A suite of new unit tests exist in DBBasicTestWithTimestampWithParam. Support for timestamp requires the user to provide timestamp as a slice in `ReadOptions` and `WriteOptions`. All timestamps of the same database must share the same length, format, etc. The format of the timestamp is the same throughout the same database, and the user is responsible for providing a comparator function (Comparator) to order the <key, timestamp> tuples. Once created, the format and length of the timestamp cannot change (at least for now).

Test plan (on devserver):
```
$COMPILE_WITH_ASAN=1 make -j32 all
$./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
$make check
```
All tests must pass.

We also run the following db_bench tests to verify whether there is regression on Get/Put while timestamp is not enabled.
```
$TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
$TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000
```
Repeat for 6 times for both versions.

Results are as follows:
```
|        | readrandom | fillrandom |
| master | 16.77 MB/s | 47.05 MB/s |
| PR5079 | 16.44 MB/s | 47.03 MB/s |
```
Pull Request resolved: facebook#5079

Differential Revision: D15132946

Pulled By: riversand963

fbshipit-source-id: 833a0d657eac21182f0f206c910a6438154c742c
facebook-github-bot pushed a commit that referenced this pull request Jul 25, 2019
Summary:
In previous #5079, we added user-specified timestamp to `DB::Get()` and `DB::Put()`. Limitation is that these two functions may cause extra memory allocation and key copy. The reason is that `WriteBatch` does not allocate extra memory for timestamps because it is not aware of timestamp size, and we did not provide an API to assign/update timestamp of each key within a `WriteBatch`.
We address these issues in this PR by doing the following.
1. Add a `timestamp_size_` to `WriteBatch` so that `WriteBatch` can take timestamps into account when calling `WriteBatch::Put`, `WriteBatch::Delete`, etc.
2. Add APIs `WriteBatch::AssignTimestamp` and `WriteBatch::AssignTimestamps` so that application can assign/update timestamps for each key in a `WriteBatch`.
3. Avoid key copy in `GetImpl` by adding new constructor to `LookupKey`.

Test plan (on devserver):
```
$make clean && COMPILE_WITH_ASAN=1 make -j32 all
$./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
$make check
```
If the API extension looks good, I will add more unit tests.

Some simple benchmark using db_bench.
```
$rm -rf /dev/shm/dbbench/* && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
$rm -rf /dev/shm/dbbench/* && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000 -disable_wal=true
```
Master is at a78503b.
```
|        | readrandom | fillrandom |
| master | 15.53 MB/s | 25.97 MB/s |
| PR5502 | 16.70 MB/s | 25.80 MB/s |
```
Pull Request resolved: #5502

Differential Revision: D16340894

Pulled By: riversand963

fbshipit-source-id: 51132cf792be07d1efc3ac33f5768c4ee2608bb8
merryChris pushed a commit to merryChris/rocksdb that referenced this pull request Nov 18, 2019
…k#5502)

Summary:
In previous facebook#5079, we added user-specified timestamp to `DB::Get()` and `DB::Put()`. Limitation is that these two functions may cause extra memory allocation and key copy. The reason is that `WriteBatch` does not allocate extra memory for timestamps because it is not aware of timestamp size, and we did not provide an API to assign/update timestamp of each key within a `WriteBatch`.
We address these issues in this PR by doing the following.
1. Add a `timestamp_size_` to `WriteBatch` so that `WriteBatch` can take timestamps into account when calling `WriteBatch::Put`, `WriteBatch::Delete`, etc.
2. Add APIs `WriteBatch::AssignTimestamp` and `WriteBatch::AssignTimestamps` so that application can assign/update timestamps for each key in a `WriteBatch`.
3. Avoid key copy in `GetImpl` by adding new constructor to `LookupKey`.

Test plan (on devserver):
```
$make clean && COMPILE_WITH_ASAN=1 make -j32 all
$./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
$make check
```
If the API extension looks good, I will add more unit tests.

Some simple benchmark using db_bench.
```
$rm -rf /dev/shm/dbbench/* && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
$rm -rf /dev/shm/dbbench/* && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000 -disable_wal=true
```
Master is at a78503b.
```
|        | readrandom | fillrandom |
| master | 15.53 MB/s | 25.97 MB/s |
| PR5502 | 16.70 MB/s | 25.80 MB/s |
```
Pull Request resolved: facebook#5502

Differential Revision: D16340894

Pulled By: riversand963

fbshipit-source-id: 51132cf792be07d1efc3ac33f5768c4ee2608bb8
hicder added a commit to rockset/rocksdb-cloud that referenced this pull request Mar 3, 2020
Summary:
Fixes facebook#5079. So what happens was
1. Durable uploads CLOUDMANIFEST
2. Non-durable downloads CLOUDMANIFEST
3. Non-durable tried to open MANIFEST, couldn't find it locally nor in the cloud, assuming it's a new DB.
4. Durable uploads new MANIFEST
5. Non-durable wrote new epoch to its local CLOUDMANIFEST
6. Non-durable failed to open MANIFEST, restarts. At restart, it resuses its local CLOUDMANIFEST without compatible MANIFEST, goes into crash loop.

If 3 and 4 reverses order, non-durable will be able to open the DB. This diff ensures that if this scenario occurs, non-durable will reinitializes the entire directory.

Test Plan: added a unittest to describe such scenario.

Reviewers: dhruba, igor

Reviewed By: igor

Differential Revision: https://rockset.phacility.com/D5571
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants