Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: ensure existing p2p connection is removed before reconnecting #4045

Merged
merged 5 commits into from Sep 26, 2023

Conversation

msgmaxim
Copy link
Contributor

Pull Request

Closes: PRO-857

Checklist

Please conduct a thorough self-review before opening the PR.

  • I am confident that the code works.
  • I have updated documentation where appropriate.

Summary

  • handle an edge case where peer info gets updated while there is already scheduled reconnection for that peer (most likely reason for the panic observed on sisyphos) by removing the old connection (it is easier than trying to ensure that the existing connection is healty without introducing more race conditions)
  • log instead of panicking if connection already exist when trying to connect (it is unexpected, but not critical by any means)
  • refactor: when receiving our own peer info for the first time, we now use peer_infos instead of maintaining a separate list of peers that we delay connecting to; peer_infos should be more up-to-date and doesn't store duplicates (which could also cause the above panic). Added a unit test to check that this functionality still works.

`peer_infos` has the most up-to-date info on current nodes,
and can't contain duplicates, so we use that instead of maintaining
a separate list of peers to connect to.
@linear
Copy link

linear bot commented Sep 25, 2023

PRO-857 P2P panic

We have seen this assert panic on sisyphos: "assert!(self.active_connections.insert(account_id, connected_socket).is_none());"

https://grafana.monitoring/explore?orgId=1&left={"datasource":"5qp7Hx-nz","queries":[{"refId":"A","datasource":{"type":"loki","uid":"5qp7Hx-nz"},"editorMode":"code","expr":"{network%3D\"sisyphos\", unit%3D~\"chainflip-.*.service\"} |%3D \"panic\"","queryType":"range"}],"range":{"from":"1695383348000","to":"1695386948000"}}

This occurred when TomJ started a duplicate engine with the same node and keys as an existing one (i.e. two engines with the same p2p key etc).

This panic occured on all the nodes of the network.

engine/src/p2p/core/tests.rs Outdated Show resolved Hide resolved
engine/src/p2p/core/tests.rs Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Sep 26, 2023

Codecov Report

Merging #4045 (9bf950e) into main (3703cf2) will decrease coverage by 0%.
Report is 5 commits behind head on main.
The diff coverage is 94%.

@@          Coverage Diff           @@
##            main   #4045    +/-   ##
======================================
- Coverage     72%     72%    -0%     
======================================
  Files        369     369            
  Lines      58782   59322   +540     
  Branches   58782   59322   +540     
======================================
+ Hits       42398   42672   +274     
- Misses     14284   14530   +246     
- Partials    2100    2120    +20     
Files Coverage Δ
engine/src/p2p/core/tests.rs 100% <100%> (ø)
engine/src/p2p/core.rs 83% <67%> (-1%) ⬇️

... and 26 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@msgmaxim msgmaxim merged commit d69ebb6 into main Sep 26, 2023
44 checks passed
@msgmaxim msgmaxim deleted the fix/p2p-panic branch September 26, 2023 06:09
dandanlen pushed a commit that referenced this pull request Sep 28, 2023
…4045)

* refactor: simplify initial connection to peers

`peer_infos` has the most up-to-date info on current nodes,
and can't contain duplicates, so we use that instead of maintaining
a separate list of peers to connect to.

* fix: allow overwriting existing p2p connection

* chore: more accurate comment

* refactor: send_and_receive helper in p2p tests

* chore: use different ports in p2p tests
dandanlen pushed a commit that referenced this pull request Sep 29, 2023
…4045)

* refactor: simplify initial connection to peers

`peer_infos` has the most up-to-date info on current nodes,
and can't contain duplicates, so we use that instead of maintaining
a separate list of peers to connect to.

* fix: allow overwriting existing p2p connection

* chore: more accurate comment

* refactor: send_and_receive helper in p2p tests

* chore: use different ports in p2p tests
dandanlen pushed a commit that referenced this pull request Oct 9, 2023
…4045)

* refactor: simplify initial connection to peers

`peer_infos` has the most up-to-date info on current nodes,
and can't contain duplicates, so we use that instead of maintaining
a separate list of peers to connect to.

* fix: allow overwriting existing p2p connection

* chore: more accurate comment

* refactor: send_and_receive helper in p2p tests

* chore: use different ports in p2p tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants