Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random Sync Failure Reported by Multiple Users - Continues the 10788 Discussion. #10906

Closed
5 of 11 tasks
Tracked by #8
TippyFlitsUK opened this issue May 22, 2023 · 13 comments
Closed
5 of 11 tasks
Tracked by #8
Assignees
Labels
area/chain Area: Chain kind/bug Kind: Bug need/analysis Hint: Needs Analysis need/team-input Hint: Needs Team Input P1 P1: Must be resolved

Comments

@TippyFlitsUK
Copy link
Contributor

TippyFlitsUK commented May 22, 2023

Checklist

  • This is not a security-related bug/issue. If it is, please follow please follow the security policy.
  • I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
  • I am running the Latest release, the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
  • I did not make any code changes to lotus.

Lotus component

  • lotus daemon - chain sync
  • lotus fvm/fevm - Lotus FVM and FEVM interactions
  • lotus miner/worker - sealing
  • lotus miner - proving(WindowPoSt/WinningPoSt)
  • lotus JSON-RPC API
  • lotus message management (mpool)
  • Other

Lotus Version

Various - Please see discussion links in the description

Repro Steps

  1. Run lotus daemon
  2. Lotus loses and regains sync in a cyclical manner

Describe the Bug

This new issue aims to collate and organise multiple instances of SP feedback from the existing #10788 issue and across other messaging platforms such as Slack.

A new discussion thread has been created for each contributor and existing feedback has been pre-filled with the details that have already been provided.

The Lotus Team would be very grateful if returning contributors could add additional feedback and logs to their own dedicated discussion thread as and when it becomes available.

If you are experiencing this issue for the first time, please feel free to add your feedback and logs below this thread and the team will distribute it accordingly.

Many thanks all!! 🙏

Logging Information

Various - Please see discussion links in the description
@TippyFlitsUK TippyFlitsUK added P1 P1: Must be resolved kind/bug Kind: Bug need/team-input Hint: Needs Team Input need/analysis Hint: Needs Analysis area/chain Area: Chain labels May 22, 2023
@TippyFlitsUK TippyFlitsUK changed the title Random Sync Failure Reported by Multiple Users - Continues the [10788](https://github.com/filecoin-project/lotus/issues/10788) Discussion. Random Sync Failure Reported by Multiple Users - Continues the 10788 Discussion. May 22, 2023
@arajasek
Copy link
Contributor

arajasek commented Jun 12, 2023

This continues to be investigated by multiple parties. Here's an update on one particular hypothesis we've been testing:

Theory: A bug in Lotus causes us to incorrectly drop pubsub scores for peers when they are propagate "local" messages to us. This causes the local message propogator to stop receiving blocks from peers, and thus falling out of sync.

In order to attempt to determine whether this is the case, @magik6k and @TippyFlitsUK have been running nodes with extra logs that should provide some information on when messages coming from peers are either Ignored or Rejected. They have also been trying to stay in connection with each other, and reproducing the issue that causes them to fall out of sync.

Next steps: We need to look at these logs, and try to confirm the theory. We are interested in all logs introduced by the commit linked above in general, but especially those pertaining to @magik6k and @TippyFlitsUK's peer IDs, and ESPECIALLY those pertaining to those peer IDs when the node in question is out of sync. We should be able to piece all of this information together based on the info they have shared.

If we do see penalties being applied to their peers, we need to assess whether the penalties are valid (the messages being sent are, in fact, "wrong" in some way worthy of a score-cut), or whether they are invalid (the messages are "good", and shouldn't be penalized). Once we know that, we can discuss a fix.

Note that this is just one of MANY theories we're testing, this is NOT the definitive next path for the issue at hand.

@arajasek
Copy link
Contributor

Having investigated a bit more, there is one funky thing we appear to do: We penalize peers who send us a message we've already added to our mpool. I'm not sure that this is the cause, but it does just seem wrong -- I've opened #10973 to discuss / fix this.

It will unfortunately be a little tricky to test whether this helps with the issue. In order to confirm that, we'll have to reproduce the issue on one of our nodes ("easy" to do), while connected to at least 2 other nodes -- one running the fix in #10973, while the other doesn't. Ideally, we'll see that the node with the patch doesn't penalize us, and continues to send us blocks.

We'll only really know for sure when we have a large number of users running the patch on mainnet.

@arajasek
Copy link
Contributor

Based on logs shared by @TippyFlitsUK here, the most common reason for rejecting pubsub messages (and thus lowering peer scores) is the ErrExistingNonce error that was addressed in #10973. This also matches what I am seeing on my node.

This does not necessarily mean #10973 will solve the issue in the OP, but it gives us some confidence that it might. Next steps towards this theory could be:

  • confirming that if enough PubsubRejects happen, a Lotus node will stop sharing blocks with the peer in question
  • setting up connections between a node experiencing the issue in the OP and nodes running fixes / debug logging, and seeing what the peers report when the node falls out of sync

We should still be open to other theories, though. Trying to identify the exact triggers that would cause peers to stop sharing blocks (likely only 1-2 such), and then identifying the things that might lead to those triggers in Lotus could point us to the bug.

@marco-storswift
Copy link
Contributor

I don't agree the #10973 fix this sync failture case, I merged this commit and meet this case again
Chain: [sync behind! (6m9s behind)]
lotus net peers | wc -l
29

@arajasek
Copy link
Contributor

@marco-storswift Thanks for testing this out! I'm also not 100% confident #10973 will fix the issue. However, in order for it to help, we actually need your peers to be running the patch. Unfortunately you merging the commit won't help your own node 😞

Is it possible for you to merge the patch onto a few more nodes, and keep them connected to each other? That might give us some more insight (though we really won't know until the majority of the nodes have upgraded).

@TippyFlitsUK
Copy link
Contributor Author

I am also running the patch Marco. Please feel free to connect to me at 12D3KooWNVymjK1Q1UDLDFFK5iFYJ1v5NsrXDPTHaXpuk4abByYs

@marco-storswift
Copy link
Contributor

@arajasek @TippyFlitsUK good news, when i update github.com/libp2p/go-libp2p to v0.27.6. every things is ok.

@arajasek
Copy link
Contributor

arajasek commented Jun 15, 2023

@marco-storswift That's GREAT news! Lotus team will try to confirm this as well, but would be awesome if more users can try this.

@shrenujbansal Can you throw up a tag (not an RC) that bumps go-libp2p to 0.27.6 on top of the latest v1.23.1 RC and point some folks who were experiencing the issue at it? Fingers crossed we confirm the good news ❤️

@shrenujbansal
Copy link
Contributor

Here's the tag with libp2p 0.27.6 on top of v1.23.1-rc4: https://github.com/filecoin-project/lotus/releases/tag/v1.23.1-libp2p-0.27.6

@marco-storswift Did you also have #10973 in your source code where you saw the issue fixed?

@marco-storswift
Copy link
Contributor

Yes,had #10973 and bumps go-libp2p to 0.27.6.the sync is ok, I've been running the node for over 48 hours

@arajasek
Copy link
Contributor

@shrenujbansal Can we please get an update on this (ideally daily updates, even if they're "no progress")?

@shrenujbansal
Copy link
Contributor

Below is the summary from the debug done between myself and @MarcoPolo

As @arajasek pointed out above #10906 (comment), the ErrExistingNonce error was being treated as a reject and was penalizing peers incorrectly. This bug means that peers who republish their pending messages (automatically every ~5 min) will be penalized by their gossipsub peers. Eventually their peers will remove them from their topics and they won't learn about new blocks over gossipsub. Instead, their only sync mechanism will be the lotus hello protocol when they discover new peers via Kademlia.

This specific bug has been fixed with #10973 which is available on master and will be available with the next release. However, you will need several peers to have adopted this fix in order to see improvement

High level points:

  • Lotus' gossipsub validation logic was wrongly rejecting duplicate messages.
  • Lotus automatically republishes messages in its mpool every ~5 min
  • Peers will prune you from their topics for misbehaving, and they will remember you and prune you again in the future.
  • This prevents you from learning about blocks over gossipsub. Unless you find a new peer that doesn't know about you (works until you republish to them)
  • This only happens if you have pending msgs which tend to stick around a long time in your mpool, which leads to the republishes every ~5 min intervals. We have noticed that this issue happens more often (in testing and the public) with msgs like PSDs, which tend to stick around in the mpool for longer periods. This coincides with our observations noted above
  • The bug is fixed in master, but you will only see the effects when the network adopts the fix

@arajasek
Copy link
Contributor

I think we're comfortable saying fixed in #10973

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/chain Area: Chain kind/bug Kind: Bug need/analysis Hint: Needs Analysis need/team-input Hint: Needs Team Input P1 P1: Must be resolved
Projects
None yet
Development

No branches or pull requests

4 participants