Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance drops over time for UPDATE to same row #1240

Closed
jeffreims opened this issue Apr 23, 2019 · 5 comments
Closed

Performance drops over time for UPDATE to same row #1240

jeffreims opened this issue Apr 23, 2019 · 5 comments
Assignees
Labels
area/docdb YugabyteDB core features community/request Issues created by external users kind/enhancement This is an enhancement of an existing feature

Comments

@jeffreims
Copy link

I noticed performance drops when updating a single row many times in a table.

The performance drop seems to increase the more the row is updated and the performance drops for both updating and selecting this row.

This is the performance using SELECT on a row that has been updated many times (>20k times)

postgres=# \timing on
Timing is on.
postgres=# select * from u latest;
                                id                                |               version                | updated_at | scheduled_at | engaged_at | engaged_by | status 
------------------------------------------------------------------+--------------------------------------+------------+--------------+------------+------------+--------
 632a13aa10c12ba91f8806febb1ba998cb154d9d02454e56bf221b2aff27c556 | 37160dd8-6f37-4eb9-a8ff-f613abfcbfeb |          0 |            0 |            |            | 
(1 row)

Time: 245.634 ms

postgres=# select version from latest WHERE id = '632a13aa10c12ba91f8806febb1ba998cb154d9d02454e56bf221b2aff27c556';
               version                
--------------------------------------
 37160dd8-6f37-4eb9-a8ff-f613abfcbfeb
(1 row)

Time: 352.805 ms
postgres=# select version from latest WHERE id = '632a13aa10c12ba91f8806febb1ba998cb154d9d02454e56bf221b2aff27c556';
               version                
--------------------------------------
 37160dd8-6f37-4eb9-a8ff-f613abfcbfeb
(1 row)

Time: 171.439 ms

Now inserting a new row in the same table:

INSERT 0 1
Time: 14.729 ms
postgres=# select * from latest where id = '632a13aa10c12ba91f8806febb1ba998cb154d9d02454e56bf221b2aff27c558';
                                id                                |               version                | updated_at | scheduled_at | engaged_at | engaged_by | status 
------------------------------------------------------------------+--------------------------------------+------------+--------------+------------+------------+--------
 632a13aa10c12ba91f8806febb1ba998cb154d9d02454e56bf221b2aff27c558 | 96643506-f838-43fb-8d01-2306f07db396 |          0 |            0 |            |            | 
(1 row)

Time: 2.291 ms


@kmuthukk
Copy link
Collaborator

kmuthukk commented Apr 23, 2019

thx @jeffreims for reporting this issue, and providing the test case as well.

We were able to reproduce the issue, and problem seems specific to a row being updated several times, where we have intent records (from previous mutations to the row) that aren't being garbage collected aggressively. We are working on an optimization to speed up this case, and will keep you posted.

Meanwhile, one question: will it be a common workload pattern in your use case that the same row will undergo thousands of updates within a short span of time? Or was it more symptomatic of the way the test program was written. If in practice, the updates will be spread over many rows and updates to same row will spread out over time, then you should not really see this issue because the garbage collection of older intent records will keep happening periodically.

@kmuthukk kmuthukk added the area/docdb YugabyteDB core features label Apr 23, 2019
@jeffreims
Copy link
Author

Meanwhile, one question: will it be a common workload pattern in your use case that the same row will undergo thousands of updates within a short span of time? Or was it more symptomatic of the way the test program was written. If in practice, the updates will be spread over many rows and updates to same row will spread out over time, then you should not really see this issue because the garbage collection of older intent records will keep happening periodically.

We normally will not be making that many updates within a short time on a single row.

@jeffreims jeffreims reopened this Apr 23, 2019
@kmuthukk
Copy link
Collaborator

thx for the additional data point @jeffreims. Glad to hear this won't be a common pattern for you in the near term. Nevertheless, we'll work on optimizing this case.

yugabyte-ci pushed a commit that referenced this issue Jun 13, 2019
Summary:
RocksDB contains 2 implementations of skip list.
`InlineSkipList`, which we've been using prior to this revision, allows concurrent writes and store keys of variable length in Node.
`SkipList` allows concurrent reads, but only one writer at a time, so it should really
be called `SingleWriterSkipList`.

Since we write to RocksDB only from single thread, we could benefit from using `SkipList`.
Also it is easier to implement erase in such list.

This diff adapts interface of `SkipList` (the single-writer skip list) so it could be used by SkipListRep.
Also it adds ability to store keys of variable length in Node.

Test Plan: Jenkins

Reviewers: timur, mikhail

Reviewed By: mikhail

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D6595
@yugabyte-ci yugabyte-ci added the community/request Issues created by external users label Jul 17, 2019
@ttyusupov ttyusupov added the kind/enhancement This is an enhancement of an existing feature label Jul 18, 2019
@ddorian
Copy link
Contributor

ddorian commented Jul 21, 2019

This can be fixed by changing how the cache works.
You can keep a per-row cache (instead of block) and "append update in disk" + "update in-memory".
Though the change code-wise is very big, it should provide better performance. Or per-kv in rocksdb.

@kmuthukk
Copy link
Collaborator

hi @ddorian -- There is fix for this in the works, and under code review. At a high level it is has to do with more aggressively garbage collecting provisional/intent records for distributed transactions.

Will keep you posted. We plan to move our code review (phabricator tool) also in the open soon. So in future, we would be able to reference/point to the "work-in-progress changes" easily.

spolitov added a commit that referenced this issue Jul 24, 2019
Summary:
When a YugaByte DB transaction writes data, it does so in two phases:
1) Intents are written to the intents RocksDB, showing that the transaction is planning to write the corresponding values when it commits.
2) After the transaction is committed, those intents are converted to values in the regular RocksDB and deleted from the intents RocksDB of the tablet.

But RocksDB does not simply delete the original record from memory. Instead of that, it adds a "delete mark" for this record.
So both the original record and the delete mark remain in the memtable.

When the flush happens, RocksDB could drop the original record but has to keep the delete mark, because it does not "know" that this record is not present in earlier SSTables. By the default RocksDB logic, such delete marks could only be deleted during a major compaction.
While such delete marks are not seen by the user, they are always fetched while checking for the appropriate key.
That could significantly decrease performance, especially in a "hot row" use case where the same row keeps getting overwritten.

This diff adds a new option to RocksDB, `in_memory_erase`, and this option is used for the intents RocksDB.
When it is set to true, during deletion RocksDB would try to erase the key from the memtable directly.
And if it succeeds, then the delete mark would not be written at all.

There are the following restrictions:
1) All keys should be unique because otherwise the previous value of such key could be present in SSTable file and would not be deleted.
   It is already so for our RocksDBs because we always use key suffixes based on hybrid time.
2) With in-memory erase enabled, a RocksDB iterator won't be able to read a point-in-time snapshot of data anymore (which is normally based on RocksDB sequence numbers), because key/value pairs belonging to such a snapshot could be removed in parallel.
   For the intents DB it is manageable because we only delete intents for already applied or aborted transactions, so these intents can't participate in read results anyway.
   However, to make this work, we have to delay the deletion of intents until the transaction is removed from the transaction participant, which is another change in this diff.

The following issues and improvements were done to make tests pass:
1) Store recently removed transactions, that helps to avoid:
  - Attempt to load metadata them from RocksDB.
  - Add intents for the recently removed transaction, since they will be cleaned much later.
2) Correctly fill `matadata_state` in the client part of the transaction when read operations are being prepared.
3) Ignore serial_no when notifying status waiter about the aborted transaction.

Test Plan: ybd ybd --cxx-test snapshot-txn-test --gtest_filter SnapshotTxnTest.HotRow

Reviewers: timur, alex, mikhail

Reviewed By: mikhail

Subscribers: bogdan, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D6860
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features community/request Issues created by external users kind/enhancement This is an enhancement of an existing feature
Projects
None yet
Development

No branches or pull requests

7 participants