Performance drops over time for UPDATE to same row #1240

jeffreims · 2019-04-23T11:43:10Z

I noticed performance drops when updating a single row many times in a table.

The performance drop seems to increase the more the row is updated and the performance drops for both updating and selecting this row.

This is the performance using SELECT on a row that has been updated many times (>20k times)

postgres=# \timing on
Timing is on.
postgres=# select * from u latest;
                                id                                |               version                | updated_at | scheduled_at | engaged_at | engaged_by | status 
------------------------------------------------------------------+--------------------------------------+------------+--------------+------------+------------+--------
 632a13aa10c12ba91f8806febb1ba998cb154d9d02454e56bf221b2aff27c556 | 37160dd8-6f37-4eb9-a8ff-f613abfcbfeb |          0 |            0 |            |            | 
(1 row)

Time: 245.634 ms

postgres=# select version from latest WHERE id = '632a13aa10c12ba91f8806febb1ba998cb154d9d02454e56bf221b2aff27c556';
               version                
--------------------------------------
 37160dd8-6f37-4eb9-a8ff-f613abfcbfeb
(1 row)

Time: 352.805 ms
postgres=# select version from latest WHERE id = '632a13aa10c12ba91f8806febb1ba998cb154d9d02454e56bf221b2aff27c556';
               version                
--------------------------------------
 37160dd8-6f37-4eb9-a8ff-f613abfcbfeb
(1 row)

Time: 171.439 ms

Now inserting a new row in the same table:

INSERT 0 1
Time: 14.729 ms
postgres=# select * from latest where id = '632a13aa10c12ba91f8806febb1ba998cb154d9d02454e56bf221b2aff27c558';
                                id                                |               version                | updated_at | scheduled_at | engaged_at | engaged_by | status 
------------------------------------------------------------------+--------------------------------------+------------+--------------+------------+------------+--------
 632a13aa10c12ba91f8806febb1ba998cb154d9d02454e56bf221b2aff27c558 | 96643506-f838-43fb-8d01-2306f07db396 |          0 |            0 |            |            | 
(1 row)

Time: 2.291 ms

The text was updated successfully, but these errors were encountered:

kmuthukk · 2019-04-23T14:40:02Z

thx @jeffreims for reporting this issue, and providing the test case as well.

We were able to reproduce the issue, and problem seems specific to a row being updated several times, where we have intent records (from previous mutations to the row) that aren't being garbage collected aggressively. We are working on an optimization to speed up this case, and will keep you posted.

Meanwhile, one question: will it be a common workload pattern in your use case that the same row will undergo thousands of updates within a short span of time? Or was it more symptomatic of the way the test program was written. If in practice, the updates will be spread over many rows and updates to same row will spread out over time, then you should not really see this issue because the garbage collection of older intent records will keep happening periodically.

jeffreims · 2019-04-23T16:29:01Z

Meanwhile, one question: will it be a common workload pattern in your use case that the same row will undergo thousands of updates within a short span of time? Or was it more symptomatic of the way the test program was written. If in practice, the updates will be spread over many rows and updates to same row will spread out over time, then you should not really see this issue because the garbage collection of older intent records will keep happening periodically.

We normally will not be making that many updates within a short time on a single row.

kmuthukk · 2019-04-23T16:47:44Z

thx for the additional data point @jeffreims. Glad to hear this won't be a common pattern for you in the near term. Nevertheless, we'll work on optimizing this case.

Summary: RocksDB contains 2 implementations of skip list. `InlineSkipList`, which we've been using prior to this revision, allows concurrent writes and store keys of variable length in Node. `SkipList` allows concurrent reads, but only one writer at a time, so it should really be called `SingleWriterSkipList`. Since we write to RocksDB only from single thread, we could benefit from using `SkipList`. Also it is easier to implement erase in such list. This diff adapts interface of `SkipList` (the single-writer skip list) so it could be used by SkipListRep. Also it adds ability to store keys of variable length in Node. Test Plan: Jenkins Reviewers: timur, mikhail Reviewed By: mikhail Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D6595

ddorian · 2019-07-21T13:58:05Z

This can be fixed by changing how the cache works.
You can keep a per-row cache (instead of block) and "append update in disk" + "update in-memory".
Though the change code-wise is very big, it should provide better performance. Or per-kv in rocksdb.

kmuthukk · 2019-07-22T02:23:02Z

hi @ddorian -- There is fix for this in the works, and under code review. At a high level it is has to do with more aggressively garbage collecting provisional/intent records for distributed transactions.

Will keep you posted. We plan to move our code review (phabricator tool) also in the open soon. So in future, we would be able to reference/point to the "work-in-progress changes" easily.

Summary: When a YugaByte DB transaction writes data, it does so in two phases: 1) Intents are written to the intents RocksDB, showing that the transaction is planning to write the corresponding values when it commits. 2) After the transaction is committed, those intents are converted to values in the regular RocksDB and deleted from the intents RocksDB of the tablet. But RocksDB does not simply delete the original record from memory. Instead of that, it adds a "delete mark" for this record. So both the original record and the delete mark remain in the memtable. When the flush happens, RocksDB could drop the original record but has to keep the delete mark, because it does not "know" that this record is not present in earlier SSTables. By the default RocksDB logic, such delete marks could only be deleted during a major compaction. While such delete marks are not seen by the user, they are always fetched while checking for the appropriate key. That could significantly decrease performance, especially in a "hot row" use case where the same row keeps getting overwritten. This diff adds a new option to RocksDB, `in_memory_erase`, and this option is used for the intents RocksDB. When it is set to true, during deletion RocksDB would try to erase the key from the memtable directly. And if it succeeds, then the delete mark would not be written at all. There are the following restrictions: 1) All keys should be unique because otherwise the previous value of such key could be present in SSTable file and would not be deleted. It is already so for our RocksDBs because we always use key suffixes based on hybrid time. 2) With in-memory erase enabled, a RocksDB iterator won't be able to read a point-in-time snapshot of data anymore (which is normally based on RocksDB sequence numbers), because key/value pairs belonging to such a snapshot could be removed in parallel. For the intents DB it is manageable because we only delete intents for already applied or aborted transactions, so these intents can't participate in read results anyway. However, to make this work, we have to delay the deletion of intents until the transaction is removed from the transaction participant, which is another change in this diff. The following issues and improvements were done to make tests pass: 1) Store recently removed transactions, that helps to avoid: - Attempt to load metadata them from RocksDB. - Add intents for the recently removed transaction, since they will be cleaned much later. 2) Correctly fill `matadata_state` in the client part of the transaction when read operations are being prepared. 3) Ignore serial_no when notifying status waiter about the aborted transaction. Test Plan: ybd ybd --cxx-test snapshot-txn-test --gtest_filter SnapshotTxnTest.HotRow Reviewers: timur, alex, mikhail Reviewed By: mikhail Subscribers: bogdan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D6860

kmuthukk added the area/docdb YugabyteDB core features label Apr 23, 2019

kmuthukk assigned spolitov and m-iancu Apr 23, 2019

jeffreims closed this as completed Apr 23, 2019

jeffreims reopened this Apr 23, 2019

kmuthukk mentioned this issue Jun 19, 2019

[YSQL] Slowdown single insert 10x if once used bulk insert(transaction) #1570

Closed

mbautin mentioned this issue Jun 20, 2019

SerializableTxnTest.Increment keeps failing in gcc debug build #1584

Closed

frozenspider mentioned this issue Jul 8, 2019

[Jepsen] [YSQL] Extreme UPDATE slowdown, worsening over time #1722

Closed

yugabyte-ci added the community/request Issues created by external users label Jul 17, 2019

ttyusupov added the kind/enhancement This is an enhancement of an existing feature label Jul 18, 2019

spolitov closed this as completed Jul 24, 2019

kmuthukk mentioned this issue Aug 1, 2019

Intents of applied transaction are not removed at followers #1951

Closed

spolitov mentioned this issue Aug 9, 2019

Crash during flush triggered by memory monitor #2010

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance drops over time for UPDATE to same row #1240

Performance drops over time for UPDATE to same row #1240

jeffreims commented Apr 23, 2019

kmuthukk commented Apr 23, 2019 •

edited

Loading

jeffreims commented Apr 23, 2019

kmuthukk commented Apr 23, 2019

ddorian commented Jul 21, 2019

kmuthukk commented Jul 22, 2019

Performance drops over time for UPDATE to same row #1240

Performance drops over time for UPDATE to same row #1240

Comments

jeffreims commented Apr 23, 2019

kmuthukk commented Apr 23, 2019 • edited Loading

jeffreims commented Apr 23, 2019

kmuthukk commented Apr 23, 2019

ddorian commented Jul 21, 2019

kmuthukk commented Jul 22, 2019

kmuthukk commented Apr 23, 2019 •

edited

Loading