Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
network: discard unrequested or stale block messages #5431
network: discard unrequested or stale block messages #5431
Changes from 3 commits
3855d0e
cedd485
295e2ab
4ca9242
476f9e2
ba886fe
02fcb7f
b0da5a0
7dd930d
0668f22
d4502d4
215a19c
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check
outstandingTopicRequests
counter here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at this point netB has already disconnected from netA so can't check that since it would have been on the peer. Trying to monitor it flipping to negative while it's in the process of disconnecting would cause a data race I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can't cause a race on an atomic counter though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The relevant counter here is on netBs peer struct representing netA. The race wouldn't be on the atomic counter but on the fact that peer that I'm trying to check the counter on is in the process of being destroyed. Either way I removed the offending log check for this one but kept it for the next case in netC so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checking log content seems brittle if the message changes. The concern here is to confirm that it didn't disconnect for any other reasons? If so, it might justify having counters for disconnect reasons which you can extend with this reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it is somewhat brittle but is also an easy fix if the log message changes. I'm open to adding counters anyhow but we do use this pattern elsewhere in the code already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are already bumping
networkConnectionsDroppedTotal.Inc(map[string]string{"reason": "protocol"})
so you could check that?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But isn't it good enough to assert the disconnect happens? (with the new reason code for example) asserting actual behavior vs logging seems better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that it's good enough and I've changed the reason code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
netA and netB are Stopped via defer, if the test fails, will this netC be left open?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a defer for netC as well on 4009
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I'm also stopping manually (doing the same for netB as well) is that otherwise reading the log would be a datarace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should not read the log then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed one of the log reading behaviors from netB since I agree that the disconnect reason is a good enough check but this case doesn't warrant a disconnect since we did make the request (or more) we are just no longer interested in the response.
Do you want me to introduce a new counter here and check that instead of the log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than assert log behavior, why not actual behavior, like that the message was discarded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that anything else happens as a side-effect here that I could check. We aren't disconnecting or bumping any counters. I wanted to distinguish it from the fall-through case of going through unmarshalling process in the switch statement below though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the behavior is that we're dropping the message so the handler is not called.. there are some similar tests that register handlers for certain tags and count the number of calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh this tag doesn't use a handler.. it's handled inline. so weird
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. Should we make a TS handler to make it more consistent as part of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale for punishing slow nodes by disconnecting from them? They are not malicious, just slow, so why is it a protocol violation to send you what you asked for, just a little late?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this would lead a lot of unnecessary disconnections if the network got into some kind of poorly-performing state, and peers were requesting blocks from each other while bogged down exhausting resources for other reasons, and on top of that they will start to disconnect from each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a valid point, we can increase this or drop it.
Alternatively we could potentially make it a protocol validation by checking our receive time and not sending anything out if we know that the requesting peer is guaranteed to not be expecting it any longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code below will fail to serve the request anyway since the handler to write response to is removed in 4 seconds
maybe 4 sec is too short for full blocks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah for a slow connection, 4 seconds seems kind of aggressive for a full block + cert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my node is on a slow connection or a slow box, or my node ran out of CPU/memory/diskIO for a little while, it could completely be my fault that I'm receiving messages slowly, not the other end — it doesn't seem to me like you can reliably bring the timing of messages into protocol rules...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed and removed any timestamp logic in favor of counting number of topic requests we've sent to the peer.
Should we change the timeout as part of this PR?
Check warning on line 526 in network/wsPeer.go
Codecov / codecov/patch
network/wsPeer.go#L524-L526
Check failure on line 529 in network/wsPeer.go
GitHub Actions / reviewdog-errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly maybe this is not important enough to be Warnf level? io.Discard.Write() can't fail but I guess reader.Read() could return err ...
oh I see, we seem to have a
wp.reportReadErr(err)
just for that, and has its own special handling of when and how to log read errors from peersThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, shouldn't you disconnect here? that's what happens for other reader Read errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, made the change
Check warning on line 532 in network/wsPeer.go
Codecov / codecov/patch
network/wsPeer.go#L531-L532
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warnf goes to telemetry by default, but this doesn't seem very important. Could we make this Infof
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure but it's nice to have a telemetry of how often this happened. Current behavior actually logs this case to telemetry but not until after it unmarshalls the message on line 581. This doesn't increase the number of telemetry messages we are expecting to receive but even so happy to downgrade if others agree and do the same for the other place where we log this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're doing all the work of taking the lock, could you isntead return the
len
directly, and let the caller decide to compare it with 0?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have a slight preference for the way it is currently but happy to change if there's a +1 .
I just don't think that we will be checking the length of this outside of this use-case and to me this parses slightly easier in the conditional.