network: discard unrequested or stale block messages #5431

iansuvak · 2023-05-30T19:29:05Z

Summary

This PR adds special handling immediately after reading the incoming message Tag when receiving TS message which is used to send catchup blocks.

It handles the following two cases:

We never requested a block from this peer in the first place -- disconnect in this case
We've requested a block but the context managing the handler governed by config.CatchupGossipBlockFetchTimeoutSec has timed out and there is no point in trying to process the message. Discard it and move on.

Test Plan

Added a new test to wsNetwork_test to confirm that we hit the disconnect case on case of unsolicited blocks.

I'm open to suggestions for cleaner testing of this and the second case where we did request a block but the handler has since timed out.

algorandskiy · 2023-06-05T14:56:38Z

network/wsPeer.go

+		// Skip the message if it's a response to a request we didn't make or has timed out
+		if msg.Tag == protocol.TopicMsgRespTag && !wp.hasOutstandingRequests() {
+			// We never requested anything from this peer so sending a response is breach protocol -- disconnect
+			if wp.getPeerData(lastSentRequestTime) == nil {


this is a weak condition since lastSentRequestTime is never removed from the peer data map.
I also traced the context all way up and do not see if it cancelled by timeout - could you point out? The only cancellation I found in catchup.innerFetch is case <-ledgerWaitCh when the ledger received the block by other means.

It appears the message the existence of a hash in responseChannels is a pretty good indication of "a request was sent" and an empty responseChannels opposite and a good opportunity to drop.

It will be more complex but more error proof if there would be a compliment data structure to responseChannels - like topic requests tags have been sent but this is complicated to manage and since there are lots of block requests on catchup.

Edit: agreed on lastSentRequestTime importance but it needs to be cleared after some period of time.

Instead of clearing it out, added a synchronous check and disconnect if the response is more than a minute late.

codecov · 2023-06-05T16:24:19Z

Codecov Report

Merging #5431 (215a19c) into master (96c9845) will decrease coverage by 6.81%.
The diff coverage is 71.05%.

@@            Coverage Diff             @@
##           master    #5431      +/-   ##
==========================================
- Coverage   55.61%   48.80%   -6.81%     
==========================================
  Files         447      447              
  Lines       63410    79843   +16433     
==========================================
+ Hits        35265    38969    +3704     
- Misses      25757    38498   +12741     
+ Partials     2388     2376      -12

Impacted Files	Coverage Δ
network/wsPeer.go	`63.24% <67.85%> (-5.49%)`	⬇️
util/metrics/counter.go	`87.01% <80.00%> (-3.17%)`	⬇️

... and 416 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

network/wsPeer.go

cce · 2023-06-05T17:06:31Z

network/wsPeer.go

+
+// If the peer sends us a block response after this threshold
+// we should disconnect from it
+const blockResponseDisconnectThreshold = 60 * time.Second


What is the rationale for punishing slow nodes by disconnecting from them? They are not malicious, just slow, so why is it a protocol violation to send you what you asked for, just a little late?

It seems like this would lead a lot of unnecessary disconnections if the network got into some kind of poorly-performing state, and peers were requesting blocks from each other while bogged down exhausting resources for other reasons, and on top of that they will start to disconnect from each other.

That's a valid point, we can increase this or drop it.

Alternatively we could potentially make it a protocol validation by checking our receive time and not sending anything out if we know that the requesting peer is guaranteed to not be expecting it any longer.

the code below will fail to serve the request anyway since the handler to write response to is removed in 4 seconds

childCtx, cancelFunc := context.WithTimeout(ctx, time.Duration(w.config.CatchupGossipBlockFetchTimeoutSec)*time.Second)

maybe 4 sec is too short for full blocks

Yeah for a slow connection, 4 seconds seems kind of aggressive for a full block + cert

Alternatively we could potentially make it a protocol validation by checking our receive time and not sending anything out if we know that the requesting peer is guaranteed to not be expecting it any longer.

If my node is on a slow connection or a slow box, or my node ran out of CPU/memory/diskIO for a little while, it could completely be my fault that I'm receiving messages slowly, not the other end — it doesn't seem to me like you can reliably bring the timing of messages into protocol rules...

Agreed and removed any timestamp logic in favor of counting number of topic requests we've sent to the peer.

Should we change the timeout as part of this PR?

network/wsNetwork_test.go

algorandskiy · 2023-06-05T17:46:31Z

network/wsPeer.go

+
+// If the peer sends us a block response after this threshold
+// we should disconnect from it
+const blockResponseDisconnectThreshold = 60 * time.Second


the code below will fail to serve the request anyway since the handler to write response to is removed in 4 seconds

childCtx, cancelFunc := context.WithTimeout(ctx, time.Duration(w.config.CatchupGossipBlockFetchTimeoutSec)*time.Second)

maybe 4 sec is too short for full blocks

…timestamps

algorandskiy · 2023-06-05T22:20:31Z

network/wsNetwork_test.go

+	// Stop and confirm that we hit the case of disconnecting a peer for sending an unrequested block response
+	netB.Stop()
+	lg := logBuffer.String()
+	require.Contains(t, lg, "sent TS response without a request")


check outstandingTopicRequests counter here?

at this point netB has already disconnected from netA so can't check that since it would have been on the peer. Trying to monitor it flipping to negative while it's in the process of disconnecting would cause a data race I believe.

you can't cause a race on an atomic counter though?

The relevant counter here is on netBs peer struct representing netA. The race wouldn't be on the atomic counter but on the fact that peer that I'm trying to check the counter on is in the process of being destroyed. Either way I removed the offending log check for this one but kept it for the next case in netC so far.

network/wsPeer.go

…d-blocks

iansuvak · 2023-06-08T14:37:28Z

The last update was just to merge in master and resolve the conflict in wsNetwork_test.go

AlgoAxel

The logic for holding the counter and disconnecting looks fine. Just some comments on testing and suggestion for changing hasOutstandingRequests

network/wsNetwork_test.go

AlgoAxel · 2023-06-08T18:45:20Z

network/wsNetwork_test.go

+	// Stop and confirm that we hit the case of disconnecting a peer for sending an unrequested block response
+	netB.Stop()
+	lg := logBuffer.String()
+	require.Contains(t, lg, "sent TS response without a request")


checking log content seems brittle if the message changes. The concern here is to confirm that it didn't disconnect for any other reasons? If so, it might justify having counters for disconnect reasons which you can extend with this reason.

Yeah it is somewhat brittle but is also an easy fix if the log message changes. I'm open to adding counters anyhow but we do use this pattern elsewhere in the code already.

You are already bumping networkConnectionsDroppedTotal.Inc(map[string]string{"reason": "protocol"}) so you could check that?

But isn't it good enough to assert the disconnect happens? (with the new reason code for example) asserting actual behavior vs logging seems better

Agreed that it's good enough and I've changed the reason code

AlgoAxel · 2023-06-08T18:46:02Z

network/wsNetwork_test.go

+	// Stop and confirm that we hit the case of disconnecting a peer for sending a stale block response
+	netC.Stop()


netA and netB are Stopped via defer, if the test fails, will this netC be left open?

there is a defer for netC as well on 4009

The reason why I'm also stopping manually (doing the same for netB as well) is that otherwise reading the log would be a datarace

Maybe we should not read the log then?

I removed one of the log reading behaviors from netB since I agree that the disconnect reason is a good enough check but this case doesn't warrant a disconnect since we did make the request (or more) we are just no longer interested in the response.

Do you want me to introduce a new counter here and check that instead of the log?

network/wsPeer.go

AlgoAxel · 2023-06-08T18:49:02Z

network/wsPeer.go

+func (wp *wsPeer) hasOutstandingRequests() bool {
+	wp.responseChannelsMutex.Lock()
+	defer wp.responseChannelsMutex.Unlock()
+	return len(wp.responseChannels) > 0
+}
+


Since you're doing all the work of taking the lock, could you isntead return the len directly, and let the caller decide to compare it with 0?

I think I have a slight preference for the way it is currently but happy to change if there's a +1 .

I just don't think that we will be checking the length of this outside of this use-case and to me this parses slightly easier in the conditional.

util/metrics/counter_test.go

network/wsPeer.go

cce · 2023-06-09T16:54:32Z

network/wsNetwork_test.go

+	// Stop and confirm that we hit the case of disconnecting a peer for sending a stale block response
+	netC.Stop()
+	lg = logBuffer.String()
+	require.Contains(t, lg, "wsPeer readLoop: received a TS response for a stale request ")


rather than assert log behavior, why not actual behavior, like that the message was discarded?

I don't think that anything else happens as a side-effect here that I could check. We aren't disconnecting or bumping any counters. I wanted to distinguish it from the fall-through case of going through unmarshalling process in the switch statement below though

the behavior is that we're dropping the message so the handler is not called.. there are some similar tests that register handlers for certain tags and count the number of calls

oh this tag doesn't use a handler.. it's handled inline. so weird

Indeed. Should we make a TS handler to make it more consistent as part of this?

cce · 2023-06-09T16:58:35Z

network/wsPeer.go

+				wp.net.log.Warnf("wsPeer readloop: could not discard timed-out TS message from %s : %s", wp.conn.RemoteAddr().String(), err)
+				continue
+			}
+			wp.net.log.Warnf("wsPeer readLoop: received a TS response for a stale request from %s. %d bytes discarded", wp.conn.RemoteAddr().String(), n)


Warnf goes to telemetry by default, but this doesn't seem very important. Could we make this Infof

Sure but it's nice to have a telemetry of how often this happened. Current behavior actually logs this case to telemetry but not until after it unmarshalls the message on line 581. This doesn't increase the number of telemetry messages we are expecting to receive but even so happy to downgrade if others agree and do the same for the other place where we log this

cce · 2023-06-09T17:01:09Z

network/wsPeer.go

+			// Peer sent us a response to a request we made but we've already timed out -- discard
+			n, err = io.Copy(io.Discard, reader)
+			if err != nil {
+				wp.net.log.Warnf("wsPeer readloop: could not discard timed-out TS message from %s : %s", wp.conn.RemoteAddr().String(), err)


similarly maybe this is not important enough to be Warnf level? io.Discard.Write() can't fail but I guess reader.Read() could return err ...

oh I see, we seem to have a wp.reportReadErr(err) just for that, and has its own special handling of when and how to log read errors from peers

also, shouldn't you disconnect here? that's what happens for other reader Read errors?

Agreed, made the change

…d-blocks

AlgoAxel

The comments I had were all addressed, I have reviewed the changes to tests regarding eventually and log checking.

Discard unrequested or stale block messages

3855d0e

iansuvak requested a review from algorandskiy May 30, 2023 19:29

iansuvak changed the title ~~Discard unrequested or stale block messages~~ network: discard unrequested or stale block messages May 30, 2023

iansuvak self-assigned this May 30, 2023

iansuvak added Team Carbon-11 Enhancement labels May 30, 2023

algorandskiy reviewed Jun 5, 2023

View reviewed changes

iansuvak added 2 commits June 5, 2023 11:37

Add a testcase for stale response

cedd485

Add a disconnect timeout

295e2ab

iansuvak marked this pull request as ready for review June 5, 2023 16:03

algorandskiy reviewed Jun 5, 2023

View reviewed changes

network/wsPeer.go Outdated Show resolved Hide resolved

network/wsPeer.go Outdated Show resolved Hide resolved

cce reviewed Jun 5, 2023

View reviewed changes

algorandskiy reviewed Jun 5, 2023

View reviewed changes

iansuvak added 3 commits June 5, 2023 15:17

address feedback

4ca9242

add test for the counters helper method

476f9e2

Switch to using a counter for outstanding Topic responses instead of …

ba886fe

…timestamps

algorandskiy reviewed Jun 5, 2023

View reviewed changes

remove unused constants

02fcb7f

algorandskiy previously approved these changes Jun 6, 2023

View reviewed changes

algorandskiy requested a review from cce June 6, 2023 15:12

Merge remote-tracking branch 'upstream/master' into reject-unsolicite…

b0da5a0

…d-blocks

iansuvak dismissed algorandskiy’s stale review via b0da5a0 June 7, 2023 22:20

algorandskiy previously approved these changes Jun 8, 2023

View reviewed changes

AlgoAxel reviewed Jun 8, 2023

View reviewed changes

review feedback

7dd930d

iansuvak dismissed algorandskiy’s stale review via 7dd930d June 8, 2023 19:28

hasOutstandingRequests -> lenResponseChannels

0668f22

algorandskiy previously approved these changes Jun 8, 2023

View reviewed changes

cce reviewed Jun 9, 2023

View reviewed changes

some review feedback

d4502d4

iansuvak dismissed algorandskiy’s stale review via d4502d4 June 9, 2023 19:47

Merge remote-tracking branch 'upstream/master' into reject-unsolicite…

215a19c

…d-blocks

AlgoAxel approved these changes Jun 20, 2023

View reviewed changes

algorandskiy approved these changes Jun 20, 2023

View reviewed changes

algorandskiy merged commit f83a656 into algorand:master Jun 20, 2023
24 checks passed

Algo-devops-service mentioned this pull request Jul 11, 2023

go-algorand 3.17.0-beta Release PR #5541

Merged

onetechnical mentioned this pull request Jul 24, 2023

go-algorand 3.17.0-beta Release PR #5601

Merged

Algo-devops-service mentioned this pull request Aug 3, 2023

go-algorand 3.17.0-stable Release PR #5633

Merged

		// Stop and confirm that we hit the case of disconnecting a peer for sending a stale block response
		netC.Stop()

network: discard unrequested or stale block messages #5431

network: discard unrequested or stale block messages #5431

Conversation

iansuvak commented May 30, 2023

Summary

Test Plan

algorandskiy Jun 5, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jun 5, 2023 • edited

Codecov Report

Choose a reason for hiding this comment

cce Jun 5, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iansuvak commented Jun 8, 2023

AlgoAxel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cce Jun 9, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cce Jun 9, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cce Jun 9, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlgoAxel left a comment

Choose a reason for hiding this comment

algorandskiy Jun 5, 2023 •

edited

codecov bot commented Jun 5, 2023 •

edited

cce Jun 5, 2023 •

edited

cce Jun 9, 2023 •

edited

cce Jun 9, 2023 •

edited

cce Jun 9, 2023 •

edited