fix race condition leading to seg faults#741
Closed
anshalshukla wants to merge 2 commits into
Closed
Conversation
Collaborator
Author
|
Closing this as the other approach suggested by @ch4r10t33r seems better |
6 tasks
Member
|
What is the blocking issue right now for schedule network callback in the main eventloop(#326) right now? a lot of mutex added recently which could cause live/dead lock issue or perf regression, it should be easy to avoid race by reducing the logic in the network thread. @anshalshukla @ch4r10t33r @g11tech |
Collaborator
Author
|
Included in #765 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The fetched_blocks map in the network layer is accessed by both the main thread and the libp2p thread. The main thread was using a shallow copy of the blocks, which caused segmentation faults during long chain syncs. This happened because blocks could be pruned after finalization while still being referenced by the main thread.This issue doesn’t appear when the chain remains fully synced, since pruning and access don’t overlap in the same way.The issue happens because the hashmap resizes when new blocks are inserted on the network thread. This causes rehashing, which moves keys to different positions. Meanwhile, the main thread still relies on the old positions based on the previous size, so those references become invalid and lead to a panic.
To fix this, don’t share internal hashmap positions across threads. Either lock the hashmap while it’s being accessed or pass a full copy of the data instead of references.