Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve consistency and isolation semantics by adding Context parameter to DB API #2310

Open
2 tasks done
Tracked by #2331
PokIsemaine opened this issue May 14, 2024 · 19 comments
Open
2 tasks done
Tracked by #2331
Assignees
Labels
enhancement type enhancement

Comments

@PokIsemaine
Copy link
Contributor

Search before asking

  • I had searched in the issues and found no similar issues.

Motivation

The current DB API may use multiple read operations or have nested calls. These read operations do not use fixed snapshots, which may cause different snapshot data to be read during a single operation, causing inconsistency.

Solution

Referring to the current LatestSnapshot and GetOptions, we can add a Context parameter to each DB API, through which the API can pass a definite snapshot.

After a few simple attempts, I found that changes often have a ripple effect, requiring multiple modules to change their APIs at the same time, which can result in a huge PR that is difficult to break down into multiple smaller PRs. I'm going to try to give a draft PR in the near future to get a rough idea of what to do, and then gradually refine the changes to each module.

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@PokIsemaine PokIsemaine added the enhancement type enhancement label May 14, 2024
@PokIsemaine
Copy link
Contributor Author

PokIsemaine commented May 16, 2024

Need some help:

struct Context {
  explicit Context(engine::Storage *storage) : storage_(storage), snapshot_(storage->GetDB()->GetSnapshot()) {}

  engine::Storage *storage_ = nullptr;
  const rocksdb::Snapshot *snapshot_ = nullptr;
  rocksdb::WriteBatchWithIndex* batch_ = nullptr;

  Context() = default;
  rocksdb::ReadOptions GetReadOptions();
  const rocksdb::Snapshot *GetSnapShot();
};

This is the general idea: pass in the Context to the Database API, use a fixed snapshot when calling, and this snapshot will not change during the entire calling process. When we need to read our own data for the current operation, we use WriteBatchWithIndex + GetFromBatchAndDB to obtain the data batch=>db(snapshot=ctx.snapshot).

Most operations nowadays use WriteBatch. Is there any way to copy the operation sequence from WriteBatch to ctx.WriteBatchWithIndex? In this way, I only need to modify the last Write part of storage, instead of modifying ctx.WriteBatchWithIndex when the DB API modifies WriteBatch.

@mapleFU
Copy link
Member

mapleFU commented May 16, 2024

Most operations nowadays use WriteBatch. Is there any way to copy the operation sequence from WriteBatch to ctx.WriteBatchWithIndex? In this way, I only need to modify the last Write part of storage, instead of modifying ctx.WriteBatchWithIndex when the DB API modifies WriteBatch.

rocksdb has the relation below:

WriteBatchBase
WriteBatchWithIndex : WriteBatchBase
WriteBatch : WriteBatchBase

Should we switch to WriteBatchBase in some place? Besides, WriteBatchWithIndex has a WriteBatch* GetWriteBatch() override; interface here

@PokIsemaine
Copy link
Contributor Author

Need some help:

struct Context {
  explicit Context(engine::Storage *storage) : storage_(storage), snapshot_(storage->GetDB()->GetSnapshot()) {}

  engine::Storage *storage_ = nullptr;
  const rocksdb::Snapshot *snapshot_ = nullptr;
  rocksdb::WriteBatchWithIndex* batch_ = nullptr;

  Context() = default;
  rocksdb::ReadOptions GetReadOptions();
  const rocksdb::Snapshot *GetSnapShot();
};

This is the general idea: pass in the Context to the Database API, use a fixed snapshot when calling, and this snapshot will not change during the entire calling process. When we need to read our own data for the current operation, we use WriteBatchWithIndex + GetFromBatchAndDB to obtain the data batch=>db(snapshot=ctx.snapshot).

Most operations nowadays use WriteBatch. Is there any way to copy the operation sequence from WriteBatch to ctx.WriteBatchWithIndex? In this way, I only need to modify the last Write part of storage, instead of modifying ctx.WriteBatchWithIndex when the DB API modifies WriteBatch.

Correction: Because there may be multiple Writes, they should be appended to ctx.WriteBatchWithIndex instead of simply copied

@mapleFU
Copy link
Member

mapleFU commented May 16, 2024

@PokIsemaine I've checked that most output uses GetWriteBatch with a WriteBatchBase, would that ok for the scenerio here?

@PokIsemaine
Copy link
Contributor Author

I think there are currently two ways:

  1. Keep the current WriteBatch, then use WriteBatch.Iterator(&handler) when writing, and refer to batch_debugger.h to write a WriteBatch::Handler that appends WriteBatch operations to WriteBatchWithIndex one by one.
  2. Get WriteBatchWithIndex in GetWriteBatchBase, but I found that WriteBatchWithIndex cannot support all operations of WriteBatch, such as DeleteRange. Even after using WriteBatchWithIndex::GetWriteBatch after DeleteRange, we cannot index the effect of DeleteRange in Batch through GetFromBatchAndDB.

For the DeleteRange operation, maybe we need to switch to for + Delete, but I don't know if this will have a big performance impact. If it is the first type, we just add Batch but do not perform Write operation, what will happen?

我认为目前两种方式:

  1. 保留现在的 WriteBatch,然后在 Write 的时候使用 WriteBatch.Iterator(&handler), 并参考 batch_debugger.h 的方式编写一个将 WriteBatch 操作逐个追加到 WriteBatchWithIndex 的 WriteBatch::Handler
  2. GetWriteBatchBase 中改为获取 WriteBatchWithIndex,但是我发现 WriteBatchWithIndex 并不能支持 WriteBatch 的所有操作,例如 DeleteRange。即使使用 WriteBatchWithIndex::GetWriteBatch 后 DeleteRange,我们并不能通过 GetFromBatchAndDB 在 Batch 中索引到 DeleteRange 的效果。

对于 DeleteRange 操作,或许我们需要转为 for + Delete 的方式,但我不清楚这样是否会有很大的性能影响。如果是第一种,我们只是加入 Batch 但不进行 Write 操作,这样又会怎么样。

@PokIsemaine
Copy link
Contributor Author

PokIsemaine commented May 18, 2024

Some other questions:

  1. What isolation level can we expect if kvrocks requests are processed by multiple threads without using transactions? Serializable, snapshot isolation, or something else?
  2. What resources are specifically protected by LockGuard for write operations?

另外的一些疑问:

  1. 如果不使用事务,多线程处理 kvrocks 请求,我们期望什么样的隔离级别?可串行化、快照隔离还是其他的情况?
  2. 写操作的 LockGuard 具体保护的资源是什么?

@mapleFU
Copy link
Member

mapleFU commented May 18, 2024

(2) LockGuard protect the "keys" for operation. During writing, it first collects the key it tents to write, and Lock the all keys we would like to write

(1) Is a interesting problem, I think we're looking forward to a SI ( snapshot isolation ). This cannot avoid concurrency read-modify-write sequence to same key. And it would making single write or multiple write operations seeing the same snapshot

@git-hulk
Copy link
Member

git-hulk commented May 23, 2024

For the DeleteRange operation, maybe we need to switch to for + Delete, but I don't know if this will have a big performance impact. If it is the first type, we just add Batch but do not perform Write operation, what will happen?

This should be unacceptable for the performance consideration. For example, it might cause the Kvrocks stop to service if it runs the FLUSH[DB|ALL] command with a large number of keys in the DB. To mitigate this, it'd be better to disallow using the DeleteRange operation in the transaction if necessary.

@mapleFU
Copy link
Member

mapleFU commented May 23, 2024

I also think maybe we should checkout how we use DeleteRange. It can always be a performance hack since it's a "hack" in implementation

@PokIsemaine
Copy link
Contributor Author

For the DeleteRange operation, maybe we need to switch to for + Delete, but I don't know if this will have a big performance impact. If it is the first type, we just add Batch but do not perform Write operation, what will happen?

This should be unacceptable for the performance consideration. For example, it might cause the Kvrocker stop to service if it runs the FLUSH[DB|ALL] command with a large number of keys in the DB. To mitigate this, it'd be better to disallow using the DeleteRange operation in the transaction if necessary.

Does this impact occur when db_->Write is executed? What if I just use WriteBatchWithIndex to log the operations performed but don't do a Write ?

@mapleFU
Copy link
Member

mapleFU commented May 23, 2024

https://github.com/facebook/rocksdb/wiki/DeleteRange-Implementation

To be short DeleteRange write a explicit "range" tombstone, and this would affect scan, read, compaction...

@PokIsemaine
Copy link
Contributor Author

I also think maybe we should checkout how we use DeleteRange. It can always be a performance hack since it's a "hack" in implementation

Currently it mainly affects kvrocks2redis, rdb and some test cases.

@PokIsemaine
Copy link
Contributor Author

PokIsemaine commented May 23, 2024

Progress synchronization:

  1. Keep the current WriteBatch, then use WriteBatch.Iterator(&handler) when writing, and refer to batch_debugger.h to write a WriteBatch::Handler that appends WriteBatch operations to WriteBatchWithIndex one by one.

I'm trying the first option but currently in a specific version of GitHub Action, the golang test times out and I'm currently troubleshooting why.
unstable...PokIsemaine:kvrocks:unstable

image

@git-hulk
Copy link
Member

@PokIsemaine I'm sorry for didn't quite understand why we need to care about the WriteBatch. From the issue description, what we want to solve is that may use the undetermined snapshot while reading with nested read operations? If yes, passing the snapshot via the context is a good solution.

For the write operation and transaction mode, the lock of the key should be able to protect the corresponding key won't be changed, so it should be good to leave it as it is.

Help to correct me if I missed anything.

@mapleFU
Copy link
Member

mapleFU commented May 23, 2024

I'm sorry for didn't quite understand why we need to care about the WriteBatch. From the issue description, what we want to solve is that may use the undetermined snapshot while reading with nested read operations? If yes, passing the snapshot via the context is a good solution.

The main-issue is framework-level read-after-write. I like a command issues a write and read after it. It's hard to ensure read "own-workspace" after this command may writing some data

@git-hulk
Copy link
Member

I see, thank you. But I can't remember if any commands will write-then-read except the Lua and transaction.

@mapleFU
Copy link
Member

mapleFU commented May 23, 2024

@git-hulk @PokIsemaine I think currently we can first reject "DeleteRange" and check out other command works?

IMO DeleteRange is mostly used in background. Maybe I can recheck the logic use "DeleteRange" in commands

@PragmaTwice
Copy link
Member

PragmaTwice commented May 23, 2024

Yeah the current conversation goes beyond the issue title. I think it's related to this project: https://summer-ospp.ac.cn/org/prodetail/249430512?lang=en&list=pro
We can have separate issues for it.

@PokIsemaine PokIsemaine changed the title Read the determined rocksdb snapshot by passing the Context parameter Improve consistency and isolation semantics by adding Context parameter to DB API May 25, 2024
@PokIsemaine
Copy link
Contributor Author

Yeah the current conversation goes beyond the issue title. I think it's related to this project: https://summer-ospp.ac.cn/org/prodetail/249430512?lang=en&list=pro We can have separate issues for it.

Yes, we are discussing this project. I tried changing the title of this issue and grouping it into a new OSPP Tracking issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement type enhancement
Projects
None yet
Development

No branches or pull requests

4 participants