-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GC may stall due to riak_cs_delete_fsm deadlock #949
Conversation
This commit addresses an issue where the use of delete concurrency for removal of the blocks of a single object can cause a riak_cs_delete_fsm worker process to deadlock and prevent the entire Riak CS garbage collection from making progress. This bug only manifests for objects that were uploaded using multipart upload. The problem rests on the fact that the delete_blocks_remaining field of the object manifest is treated as an ordered set except when that set is initially determined for a manifest that contains metadata indicating it was a multipart upload. The reason this is an issue and that riak_cs_delete_fsm processes can persist indefinitely is that the ordsets:del_element call is used to remove elements of the delete_blocks_remaining set as the delete workers (whose count is controlled by delete_concurrency) respond that their assigned blocks have been deleted. If a call to ordsets:del_element is made with an element that does not exist in the set or more importantly in this case is made on an unordered set where the element for deletion sorts less than any key ahead of it in the the set, the return value is identical to the input set. This can come into play when you have delete_concurrency greater than 1 and a block delete worker, call it worker A, assigned to delete a block identified by {UUID, BlockNumber} return a response to the riak_cs_delete_fsm process before another delete worker, worker B, whose assigned {UUID, BlockNumber} pair sorts greater than that of worker A, but appears before the pair of worker A int the set. More concisely, if the block assigned to worker A sorts less than that assigned to worker B, but appears after the block assigned to worker B in the delete_blocks_remaining set in the manifest and worker A responds before worker B, then the entry for worker A will not be removed from the delete_blocks_remaining set. Since the delete fsm relies on delete_blocks_remaining being empty as a termination condition, the process will hang forever and GC stalls out. This commit resolves the issue by ensuring that riak_cs_lfs_utils:block_sequences_for_manifest/1 only returns an ordered set regardless of the input manifest. Additionally this commit also adds some type specs to other key functions in the riak_cs_lfs_utils module and refactors a code block in riak_cs_delete_fsm:blocks_to_delete_from_manifest to not export variables from within a case statement.
Thanks to @bowrocker for providing an eqc property to verify the changes to the |
PartBlocks = initial_blocks(PartManifest?PART_MANIFEST.content_length, | ||
SafeBlockSize, | ||
PartManifest?PART_MANIFEST.part_id), | ||
lists:usort(Parts ++ PartBlocks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nitpick, Parts
is mostly a longer list than PartBlocks
, so PartBlocks ++ Parts
will be slightly lighter. And maybe the list doesn't need to be sorted here, because it is sorted inside ordsets:from_list
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am glad you mentioned the sorting here. I did it this way very intentionally. Here is some test data to show my motivation for doing so: https://gist.github.com/kellymclaughlin/3768652d9af269c45897.
I will take your suggestion and push a commit to addres the order of the list append operation. Good idea.
Recently repl_test fails non-deterministically, but that's another issue. |
+1 d09ed41 Regardless of my nitpick comment; either fixing or not fixing it would be fine. |
Note: I've been trying to reproduce the problem with my local env and failing for almost an hour, although I succeeded reproducing the problem with release/1.5 in minutes. |
GC may stall due to riak_cs_delete_fsm deadlock Reviewed-by: kuenishi
@borshop merge |
GC may stall due to riak_cs_delete_fsm deadlock Reviewed-by: bowrocker
GC can stall when a
riak_cs_delete_fsm
worker process encounters a deadlock condition. It is related to thedelete_concurrency
setting, but only has the potential to affect files that were uploaded using multipart upload.The problem rests on the fact that we treat the
delete_blocks_remaining
field of the object manifest as an ordered set except when that set is initially determined for a manifest that contains metadata indicating it was a multipart upload. The initial assignment fordelete_blocks_remaining
can be seen here. The code that fails to properly order the list of upload parts is here.The reason this is an issue and that
riak_cs_delete_fsm
processes hang around indefinitely is that theordsets:del_element
call is used to remove elements of thedelete_blocks_remaining
set as the delete workers (whose count is controlled bydelete_concurrency
) respond that their assigned blocks have been deleted. If a call toordsets:del_element
is made with an element that does not exist in the set or more imporantly in this case is made on an unordered set where the element for deletion sorts less than any key ahead of it in the the set, the return value is identical to the input set. This can come into play when you havedelete_concurrency
greater than 1 and a block delete worker, call it worker A, assigned to delete a block identified by{UUID, BlockNumber}
return a response to the riak_cs_delete_fsm process before another delete worker, worker B, whose assigned{UUID, BlockNumber}
pair sorts greater than that of worker A, but appears before the pair of worker A int the set. More concisely, if the block assigned to worker A sorts less than that assigned to worker B, but appears after the block assigned to worker B in thedelete_blocks_remaining
set in the manifest and worker A responds before worker B, then the entry for worker A will not be removed from thedelete_blocks_remaining
set . Since the delete fsm relies ondelete_blocks_remaining
being empty as a termination condition, the process will hang forever and GC stalls out.Here is a snippet of the return value from a call to
riak_cs_lfs_utils:block_sequences_for_manifest
for a multipart-uploaded file: