Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eraser/Reaper/Repl-Src - Unbounded queues and crashing with large loop states #1807

Closed
martinsumner opened this issue Dec 3, 2021 · 3 comments
Assignees

Comments

@martinsumner
Copy link
Contributor

An attempt was made on a cluster to run a very large aae_fold erase_keys query. The fold ran ok in count mode, indicating 360M keys were available to be erased. However, when run with a change_mode of local, multiple nodes crashed.

The queue within the loop state of the riak_kv_eraser process is unbounded. It is expected that it might have to grow to a large value, as erase_keys folds that push to the queue may be fast, but the deletion process that consumes from the queue is slow. The references on the queue are small - but in this case 60M references were enough to cause memory allocation problems.

The issues is made worse as there is no format_status/2 function to restrict the logging of loop state when the process crashes - so any attempt to record the process crashing would have itself caused significant memory issues.

The riak_kv_reaper process has a similar issue - both an unbounded queue and a missing format_status/2 function.

The riak_kv_replrtq_src process has a bounded queue - but not format_status/2 function.

@martinsumner
Copy link
Contributor Author

Perhaps for eraser/reaper, rather than simply having a limit and discarding at the limit - disk_log could be used to persist when the limit has been reached, and then should the queue ever be empty a cache log logged erases could be read back from the disk_log.

This allows for very large jobs to be slowly worked on, without running into memory risks. The disk_log folder should be cleaned at startup (rather than potentially re-reading very old logged erases). the disk_log folder is intended to persist strictly for the purpose of preserving memory, not for surviving process restarts.

@martinsumner
Copy link
Contributor Author

#1808

@martinsumner
Copy link
Contributor Author

Related PR in kv_index_tictactree - martinsumner/kv_index_tictactree#103

@martinsumner martinsumner self-assigned this Mar 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant