New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change locking system into using RocksDB's pessimistic transactions #46
Comments
Comment by agiardullo
|
Comment by spetrunia
In order to have SQL semantics, one has to be able to rollback a failed statement. The statement may be a part of a transaction, in that case the statement must be rolled back, but the transaction must remain open (i.e. neither commit nor rollback). If we only use Pessimistic Transaction API for locking, we can achieve correct behavior by simply not releasing failed statement's locks util the transaction has either committed or rolled back. This will limit concurrency but is a correct behavior. If we use Pessimistic Transaction API to also hold transaction's changes, then we need to be able roll back: If a statement inside a transaction changes a few rows and then fails, these changes must not be visible by the rest of the transaction. |
Comment by spetrunia I've read the new Pessimistic Transactions patch (posted at https://reviews.facebook.net/D40869 ). It looks like the patch provides everything that MyRocks needs. |
Comment by spetrunia Getting close to getting something to work, found one thing that I forgot about. In the current system, row lock waits are integrated into MySQL.
and one can use With my new code that uses pessimistic RocksDB trx API, we have state=Updating and the thread is not KILLable:
In order to achieve state="Waiting for row lock" and KILLability, MyRocks does the following in
That is, we use MySQL's wrappers over pthread_cond_t and pthread_mutex_t, mysql's wait function, and also we inform MySQL that we start/finish waiting. Possible ways out:
|
Comment by spetrunia Consider two cases: Case 1.
The last call times out and returns:
I have made MyRocks to return ER_LOCK_WAIT_TIMEOUT to SQL layer in this case. Case 2:
The returned value is the same as in Case 1:
But the situation is different. In the Case 1, it makes sense to wait more. In Case 2, increasing wait timeout or retrying won't help. The return value is the same, though, so MyRocks returns ER_LOCK_TIMEOUT in Case 2, too. For the SQL user, the error message is misleading. I am not sure how big of a problem this is. |
Comment by agiardullo This is good feedback. About distinguishing error cases, I can change the api to return a different status in each case. I believe there are 4 interesting cases: Case 1 above: Transaction timed out waiting to acquire a lock. In this case, it is possible that having a longer timeout could have succeeded (but we dont know since we never got the lock). Case 2 above: Detected a write conflict. Case 3: Transaction has an expiration time set and is expired. Case 4: We don't have enough memtable history to determine whether there are any conflicts (User could then choose to tune max_write_buffer_number_to_maintain). Should we have a different Status for each of these 4 cases? How about: |
Comment by agiardullo Re MySql locking: I think we could come up with an api that lets you override what mutex/condvar is used to do locking. But I will need to look into mysql a bit and think about this some more. |
Comment by agiardullo Sergei, how about we chat about myrocks locking api? I just sent you a fb msg. |
Comment by spetrunia Got an interesting problem while rebasing the patch over the current tree. rocksdb.rocksdb test started to fail with an error like this:
while the expected error was:
The new error is caused by this scenario:
The problem is not observable on the current repository, because old-style The problem was not observable before the rebase, because the write (NEW-WRITE) didn't |
Comment by spetrunia
|
Comment by yoshinorim I recently committed https://reviews.facebook.net/D45963 that increments index_id and commits into data dictionary at Sequence_generator::get_and_update_next_number(). The internal begin->commit happens per index creation. I discussed with Herman in the diff -- I think there are two ways -- one is using Transaction API and doing begin -> select for update -> update -> commit. the other is what Herman suggested -- begin->update->commit at get_next_number(). I think the latter is easier and fine. Performance is not a concern since get_next_number() is called at index creation (DDL) only. Maybe it's better to switch to select for update -> update at Sequence_generator::get_and_update_next_number(), within a transaction created at ha_rocksdb::create_key_defs(), then commit altogether? |
Comment by hermanlee The select for update on the sequence_number would block other create table requests until the transaction is committed? If we're performing a restore of multiple databases where tables are created in parallel, could one of the table creates fail due to timing out on the select for update? |
…verse CF Summary: Make ha_rocksdb::index_read_map() correctly handle find_flag=HA_READ_BEFORE_KEY. Explanation how it should be handled is provided in storage/rocksdb/rocksdb-range-access.txt Test Plan: mtr t/rocksdb_range.test, used gcov to check the new code is covered Reviewers: maykov, hermanlee4, jtolmer, yoshinorim Differential Revision: https://reviews.facebook.net/D35331
…verse CF Summary: Make ha_rocksdb::index_read_map() correctly handle find_flag=HA_READ_BEFORE_KEY. Explanation how it should be handled is provided in storage/rocksdb/rocksdb-range-access.txt Test Plan: mtr t/rocksdb_range.test, used gcov to check the new code is covered Reviewers: maykov, hermanlee4, jtolmer, yoshinorim Differential Revision: https://reviews.facebook.net/D35331
…rk in reverse CF Summary: Make ha_rocksdb::index_read_map() correctly handle find_flag=HA_READ_BEFORE_KEY. Explanation how it should be handled is provided in storage/rocksdb/rocksdb-range-access.txt Test Plan: mtr t/rocksdb_range.test, used gcov to check the new code is covered Reviewers: maykov, hermanlee4, jtolmer, yoshinorim Differential Revision: https://reviews.facebook.net/D35331
…verse CF Summary: Make ha_rocksdb::index_read_map() correctly handle find_flag=HA_READ_BEFORE_KEY. Explanation how it should be handled is provided in storage/rocksdb/rocksdb-range-access.txt Differential Revision: https://reviews.facebook.net/D35331 fbshipit-source-id: 891464f
…rk in reverse CF Summary: Make ha_rocksdb::index_read_map() correctly handle find_flag=HA_READ_BEFORE_KEY. Explanation how it should be handled is provided in storage/rocksdb/rocksdb-range-access.txt Differential Revision: https://reviews.facebook.net/D35331 fbshipit-source-id: 891464f
…rk in reverse CF Summary: Make ha_rocksdb::index_read_map() correctly handle find_flag=HA_READ_BEFORE_KEY. Explanation how it should be handled is provided in storage/rocksdb/rocksdb-range-access.txt Differential Revision: https://reviews.facebook.net/D35331 fbshipit-source-id: 891464f
…rk in reverse CF Summary: Make ha_rocksdb::index_read_map() correctly handle find_flag=HA_READ_BEFORE_KEY. Explanation how it should be handled is provided in storage/rocksdb/rocksdb-range-access.txt Differential Revision: https://reviews.facebook.net/D35331 fbshipit-source-id: 891464f
…rk in reverse CF Summary: Make ha_rocksdb::index_read_map() correctly handle find_flag=HA_READ_BEFORE_KEY. Explanation how it should be handled is provided in storage/rocksdb/rocksdb-range-access.txt Differential Revision: https://reviews.facebook.net/D35331 fbshipit-source-id: 891464f
…rk in reverse CF Summary: Make ha_rocksdb::index_read_map() correctly handle find_flag=HA_READ_BEFORE_KEY. Explanation how it should be handled is provided in storage/rocksdb/rocksdb-range-access.txt Differential Revision: https://reviews.facebook.net/D35331 fbshipit-source-id: 891464f
…rk in reverse CF Summary: Make ha_rocksdb::index_read_map() correctly handle find_flag=HA_READ_BEFORE_KEY. Explanation how it should be handled is provided in storage/rocksdb/rocksdb-range-access.txt Differential Revision: https://reviews.facebook.net/D35331 fbshipit-source-id: 336a08f04dc
…rk in reverse CF Summary: Make ha_rocksdb::index_read_map() correctly handle find_flag=HA_READ_BEFORE_KEY. Explanation how it should be handled is provided in storage/rocksdb/rocksdb-range-access.txt Differential Revision: https://reviews.facebook.net/D35331 fbshipit-source-id: 336a08f04dc
…rk in reverse CF Summary: Make ha_rocksdb::index_read_map() correctly handle find_flag=HA_READ_BEFORE_KEY. Explanation how it should be handled is provided in storage/rocksdb/rocksdb-range-access.txt Differential Revision: https://reviews.facebook.net/D35331 fbshipit-source-id: 336a08f04dc
…rk in reverse CF Summary: Make ha_rocksdb::index_read_map() correctly handle find_flag=HA_READ_BEFORE_KEY. Explanation how it should be handled is provided in storage/rocksdb/rocksdb-range-access.txt Differential Revision: https://reviews.facebook.net/D35331 fbshipit-source-id: 336a08f04dc
…rk in reverse CF Summary: Make ha_rocksdb::index_read_map() correctly handle find_flag=HA_READ_BEFORE_KEY. Explanation how it should be handled is provided in storage/rocksdb/rocksdb-range-access.txt Differential Revision: https://reviews.facebook.net/D35331 fbshipit-source-id: 336a08f04dc
…rk in reverse CF Summary: Make ha_rocksdb::index_read_map() correctly handle find_flag=HA_READ_BEFORE_KEY. Explanation how it should be handled is provided in storage/rocksdb/rocksdb-range-access.txt Differential Revision: https://reviews.facebook.net/D35331
…rk in reverse CF Summary: Make ha_rocksdb::index_read_map() correctly handle find_flag=HA_READ_BEFORE_KEY. Explanation how it should be handled is provided in storage/rocksdb/rocksdb-range-access.txt Differential Revision: https://reviews.facebook.net/D35331
…rk in reverse CF Summary: Make ha_rocksdb::index_read_map() correctly handle find_flag=HA_READ_BEFORE_KEY. Explanation how it should be handled is provided in storage/rocksdb/rocksdb-range-access.txt Differential Revision: https://reviews.facebook.net/D35331
…rk in reverse CF Summary: Make ha_rocksdb::index_read_map() correctly handle find_flag=HA_READ_BEFORE_KEY. Explanation how it should be handled is provided in storage/rocksdb/rocksdb-range-access.txt Differential Revision: https://reviews.facebook.net/D35331
…rk in reverse CF Summary: Make ha_rocksdb::index_read_map() correctly handle find_flag=HA_READ_BEFORE_KEY. Explanation how it should be handled is provided in storage/rocksdb/rocksdb-range-access.txt Differential Revision: https://reviews.facebook.net/D35331
Issue by spetrunia
Wednesday Jun 17, 2015 at 21:02 GMT
Originally opened as MySQLOnRocksDB#86
RocksDB's pessimistic transaction system handles locking and also takes care of storing not-yet-committed changes made by the transaction. That is, it has two counterparts in MyRocks:
If we just replace #.1, there will be data duplication (Row_table will have the same data as WriteBatchWithIndex).
Using the API to get SQL semantics
At start, we call
transaction->SetSnapshot()
this gives us:
then, reading, modifying and writing a key can be done with simple
Misc notes
Open issues
SELECT ... LOCK IN SHARE MODE
. There seems to be no way to achieve shared read locks in the API.The text was updated successfully, but these errors were encountered: