Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long lived Iterator pinning flushed MemTable and compacted SSTs #10536

Open
rockeet opened this issue Aug 17, 2022 · 3 comments
Open

Long lived Iterator pinning flushed MemTable and compacted SSTs #10536

rockeet opened this issue Aug 17, 2022 · 3 comments

Comments

@rockeet
Copy link
Contributor

rockeet commented Aug 17, 2022

Expected behavior

Flushed MemTables and compacted SSTs should be deleted.

Actual behavior

Long lived Iterators referring Version objects which referring MemTables and compacted SSTs which lead to memory over consumption while writing to DB.

Steps to reproduce the behavior

Create an iterator and using it while writing to DB, newly created MemTables can not be freed.


We noticed this issue when we running sysbench on myrocks, we also have filed an issue and PR for MyRocks, we worked around this issue by creating a new Iterator and delete old Iterator periodically. It is better to resolve this issue in RocksDB.

@mdcallag
Copy link
Contributor

I agree it would be great to solve this in RocksDB. While it isn't for me to decide, I am not sure sure that will happen given the complexity.

For all of the MVCC engines that I understand a long-lived transaction isn't free. It blocks vacuum in Postgres, leading to much space-amp. It blocks purge in InnoDB, leading to much space-amp. And here you describe the cost of it for RocksDB.

The workaround (or solution) for Postgres and InnoDB is to kill long open transactions when they become a problem. Since this issue was inspired by a MyRocks problem, then the solution works there as well. But that is only a solution for long open transactions created by users, and in your case the problem is a long-open iterator used during index creation.

So my solution implies you can only avoid the problem in MyRocks if you don't create secondary indexes on existing large tables, or use OSC to do that. Unfortunately that isn't a great workaround.

I appreciate that you filed this and provided a PR for MyRocks to avoid this during create index. But again, I am not sure that RocksDB will fix this.

@cbi42
Copy link
Member

cbi42 commented Aug 17, 2022

Is this related to #10487? It seems doing an iterator refresh with snapshot could free the pinned resources.

@rockeet
Copy link
Contributor Author

rockeet commented Aug 18, 2022

Is this related to #10487? It seems doing an iterator refresh with snapshot could free the pinned resources.

Yes, this issue has some relation to #10487, but #10487 needs to change user API.

This issue should be solved without user API change, BTW: if #10487 is resolved, this issue can be solved based on it.

facebook-github-bot pushed a commit that referenced this issue Sep 15, 2023
Summary:
This PR resolves #10487 & #10536, user code needs to call Refresh() periodically.

The main code change is to support range deletions. A range tombstone iterator uses a sequence number as upper bound to decide which range tombstones are effective. During Iterator refresh, this sequence number upper bound needs to be updated for all range tombstone iterators under DBIter and LevelIterator. LevelIterator may create new table iterators and range tombstone iterator during scanning, so it needs to be aware of iterator refresh. The code path that propagates this change is `db_iter_->set_sequence(read_seq)  -> MergingIterator::SetRangeDelReadSeqno() -> TruncatedRangeDelIterator::SetRangeDelReadSeqno() and LevelIterator::SetRangeDelReadSeqno()`.

This change also fixes an issue where range tombstone iterators created by LevelIterator may access ReadOptions::snapshot, even though we do not explicitly require users to keep a snapshot alive after creating an Iterator.

Pull Request resolved: #10594

Test Plan:
* New unit tests.
* Add Iterator::Refresh(snapshot) to stress test. Note that this change only adds tests for refreshing to the same snapshot since this is the main target use case.

TODO in a following PR:
* Stress test Iterator::Refresh() to different snapshots or no snapshot.

Reviewed By: ajkr

Differential Revision: D48456896

Pulled By: cbi42

fbshipit-source-id: 2e642c04e91235cc9542ef4cd37b3c20823bd779
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants