-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshots with user-specified timestamps #9879
Conversation
Will add unit tests later. |
@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
3a9af5a
to
954a6e6
Compare
@riversand963 has updated the pull request. You must reimport the pull request before landing. |
954a6e6
to
9e6858f
Compare
@riversand963 has updated the pull request. You must reimport the pull request before landing. |
9e6858f
to
7b8b977
Compare
@riversand963 has updated the pull request. You must reimport the pull request before landing. |
7b8b977
to
c4548d6
Compare
@riversand963 has updated the pull request. You must reimport the pull request before landing. |
c4548d6
to
993ae87
Compare
@riversand963 has updated the pull request. You must reimport the pull request before landing. |
@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
993ae87
to
5a87086
Compare
@riversand963 has updated the pull request. You must reimport the pull request before landing. |
5a87086
to
f3ca491
Compare
@riversand963 has updated the pull request. You must reimport the pull request before landing. |
@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
f3ca491
to
ad977d6
Compare
@riversand963 has updated the pull request. You must reimport the pull request before landing. |
ad977d6
to
90f90d9
Compare
@riversand963 has updated the pull request. You must reimport the pull request before landing. |
@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@riversand963 has updated the pull request. You must reimport the pull request before landing. |
@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@riversand963 has updated the pull request. You must reimport the pull request before landing. |
@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@riversand963 has updated the pull request. You must reimport the pull request before landing. |
@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@riversand963 has updated the pull request. You must reimport the pull request before landing. |
@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@riversand963 has updated the pull request. You must reimport the pull request before landing. |
@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@riversand963 has updated the pull request. You must reimport the pull request before landing. |
@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written
to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable.
It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can
do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps
and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29.
This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in
https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with user-specified timestamps.
Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps.
In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a
rocksdb::Snapshot
object is created with the last published sequence number of the super-version. You can see that the reader actually
has no good idea of what timestamp to assign to this snapshot, because by the time the
GetSnapshot()
is called,an arbitrarily long period of time may have already elapsed since the last write, which is when the last published
sequence number is written.
This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is
exposed only to the layer of
TransactionDB
. Application can tell RocksDB to create a snapshot when a transactioncommits, effectively associating the last sequence number with a timestamp. It is also assumed that application will
ensure any two snapshots with timestamps should satisfy the following:
If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on
in the database, then we also allow the user to use a new API
TransactionDB::CreateTimestampedSnapshot()
to createa snapshot with associated timestamp.
Code example
The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with
other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp.
We also provide two additional APIs for stats collection and reporting purposes.
To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release
timestamped snapshots whose timestamps are older than or equal to a given threshold.
Before shutdown, RocksDB will release all timestamped snapshots.
Comparison with user-defined timestamp and how they can be combined:
User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile
mapping between snapshots (sequence numbers) and timestamps.
Different internal keys with the same user key but different timestamps will be treated as different by compaction,
thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection.
In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in
this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent).
The timestamped snapshot supports the semantics of reading at an exact point in time.
Timestamped snapshots can also be used with user-defined timestamp.
Test plan: