Increase tested transfer count and data size #45

njgheorghita · 2023-04-26T13:45:32Z

Ok, well I think I've been able to track down at least one of the degenerative states we've been experiencing w/ utp inside trin. But, first a little context...

During some debugging with @mrferris yesterday, we were observing not a single fin packet being sent from the sender -> receiver. This behavior persisted. Until for an unrelated reason, I restarted my terminal and we started to observe fin packets being sent. I have no explanation for this 🤷
However, after this, we still observe degenerative behavior when running a bridge node against a single trin node. I'd estimate that ~30% of accept'ed content actually completes the offer/accept flow. Though, if we gossip only headers (eg relatively small data chunks) then we see 100% successful completes. Once we start gossiping bodies / receipts (eg relatively large data chunks) then we start to see the degenerative state. This implies that there is some bug when running many concurrent utp transfers.

I've updated the socket test to initiate 50 offer/accept flows for a relatively large piece of data. The error message here is familiar from the ones we've been seeing while debugging the live testnet. I'm happy to continue debugging this test failure and find the cause, but some input from @jacobkaufmann first would be nice to confirm that the test is correctly formed and that you can duplicate the failure.

njgheorghita · 2023-05-03T17:18:40Z

@jacobkaufmann Have you had the chance to look at this? Not necessarily looking for a fix, but more a confirmation that the test is correctly formed and exposes a bug that needs fixing.

jacobkaufmann · 2023-05-05T18:16:21Z

apologies for the delay. I ran the test based on your code here, which appears correctly formed, but I don't see any ERROR logs. can you specify the error message(s) that you refer to above? how large (in bytes) are the payloads for bodies and receipts?

njgheorghita · 2023-05-08T17:34:24Z

by error i guess i actually meant WARN logs wrt the utp library - the error logs are generated within trin

2023-05-08T13:54:22.659212Z  WARN uTP{send=172 recv=173}: utp_rs::conn: idle timeout expired, closing... unacked=[]

2023-05-08T13:54:22.659465Z  WARN uTP{send=102 recv=103}: utp_rs::conn: idle timeout expired, closing... unacked=[]
2023-05-08T13:54:22.659783Z  WARN uTP{send=100 recv=101}: utp_rs::conn: idle timeout expired, closing... unacked=[]
2023-05-08T13:54:22.660256Z  WARN uTP{send=196 recv=197}: utp_rs::conn: idle timeout expired, closing... unacked=[]
2023-05-08T13:54:22.659212Z  WARN uTP{send=170 recv=171}: utp_rs::conn: idle timeout expired, closing... unacked=[]
2023-05-08T13:54:22.685668Z  WARN uTP{send=104 recv=105}: utp_rs::conn: idle timeout expired, closing... unacked=[]
2023-05-08T13:54:22.889021Z  WARN uTP{send=168 recv=169}: utp_rs::conn: idle timeout expired, closing... unacked=[]
2023-05-08T13:54:22.891754Z  WARN uTP{send=166 recv=167}: utp_rs::conn: idle timeout expired, closing... unacked=[]
2023-05-08T13:54:22.891886Z  WARN uTP{send=128 recv=129}: utp_rs::conn: idle timeout expired, closing... unacked=[]
2023-05-08T13:54:22.892672Z  WARN uTP{send=160 recv=161}: utp_rs::conn: idle timeout expired, closing... unacked=[]
2023-05-08T13:54:22.893107Z  WARN uTP{send=176 recv=177}: utp_rs::conn: idle timeout expired, closing... unacked=[]

re bodies / receipts payload size, running a quick check on recent content values, the largest payload I'm seeing is approx 250_000 bytes out of a sample size of around 25 blocks. Maybe you're wondering so we can confirm whether or not this test is testing realistic bounds of the library as it is used in the wild? If so, fair enough. 1_000_000 bytes (as is tested in these tests) is probably excessive. However, I'm not sure if the cause of the problem here is payload size or actually # of concurrent transfers - since a larger payload size results in more concurrent transfers. If I adjust the test parameters for # of concurrent txs (50/100/1000/10000) and payload size in bytes (10_000, 100_000, 250_000, 500_000) - I got the following results for passing test / failure.

*	50	100	1000	10000
10_000	🟢	🟢	🔴	🔴
100_000	🟢	🔴	-	-
250_000	🟢	🔴	-	-
500_000	🔴	-	-	-

I'm having trouble collecting good data on how many utp transfers we can realistically expect to be going on at the same time while running trin... it's certainly something I can put more effort towards if you think it's useful. But, my intuition is that at the very least while running trin in bridge mode, set to backfill, we can reasonably expect a client to have 100 - 1000 concurrent utp transfers.

jacobkaufmann · 2023-05-08T19:33:42Z

so that WARN could occur if the local peer sends a FIN, the remote peer ACKs that FIN, but the remote peer does not transmit a FIN of its own. however, that should not occur if both peers use this uTP implementation. do you see that WARN when you execute the test that you wrote here, or does that WARN only appear in trin?

I was curious about the payload size because the send buffer has a preconfigured finite size, but that does not appear to affect you, since it is greater than one million.

njgheorghita · 2023-05-08T19:52:41Z

That WARN happens when I execute the test in this pr, in the 🔴 cases from the previous comment

jacobkaufmann · 2023-05-09T18:02:23Z

huh, well like I said I don't see that when I execute the tests locally. I can try to make the tests more intense and see if the WARN comes up

njgheorghita · 2023-05-09T19:41:44Z

Weird, so the test in this pr passes for you locally? It very consistently fails for me. So if it's passing for you, I guess it's time to get some outside confirmation. @ogenev @carver @mrferris can one of you pull down this branch and see if this test passes for you locally?

njgheorghita · 2023-05-11T17:43:35Z

@jacobkaufmann Any progress on transferring this repo over to the ethereum org? that way we could get circleci running which would help resolve our different test results in this case

jacobkaufmann · 2023-05-15T15:33:34Z

sorry I have been out for the past few days. I will create an issue for that today.

njgheorghita · 2023-05-25T13:27:04Z

@jacobkaufmann Just in case you missed this, but now that ci is building the failing run here is essentially the same type of error I've been experiencing while using this library in trin.

jacobkaufmann · 2023-05-26T03:56:57Z

hmm, that's strange that the timeout is exceeded so quickly, because the test stage runs for less than one minute in that CI failure.

one thing that could be going on in the test is that the CIDs are overlapping with each other, so that messages for CID (100, 101) and messages for CID (101, 102) are being lost to each other or something like that. that looks like there could be a bug in the CID management logic. but to isolate that possibility out, it would be good to see if increasing the gap between consecutive CIDs resolves the CI failure.

njgheorghita · 2023-05-26T16:27:52Z

@jacobkaufmann IIUC, that's not the problem here. In the test I step_by(2) while incrementing CID pairs, so a sequence of consecutive CID pairs will never overlap. eg (101, 102), (103, 104), (105, 106) etc. I've confirmed locally that this indeed is the sequence of CIDs that are generated and used in the stream, without any overlapping CIDs.

Imo, increasing the gap b/w consecutive CIDs wouldn't reveal anything.
aka (101, 102), (103, 104), (105, 106) is equivalent to (100, 101), (110, 111), (120, 121)

The error in this ci build is the exact invalid deadline error I've described in #55, so imo the issue isn't that a validly set timeout was exceeded, but rather that we enter a "degenerative" state in which an invalid timeout is set -> breaking delay_queue.insert_at(..)

Cleanup: change the high end of the range to equal the number of concurrent streams

carver · 2023-09-23T05:59:13Z

I am pretty sure that all the important bits of this were included in #81

njgheorghita mentioned this pull request Apr 26, 2023

Full send_buf prevents sending of FIN #44

Open

njgheorghita force-pushed the socket-test branch from 4901f33 to fa9cd05 Compare April 26, 2023 13:52

Increase tested transfer count and data size

27e9bc8

njgheorghita force-pushed the socket-test branch from fa9cd05 to 27e9bc8 Compare April 26, 2023 13:52

This was referenced May 17, 2023

sub-protocol data transmission ethereum/trin#746

Closed

debug utp ethereum/trin#747

Closed

njgheorghita force-pushed the socket-test branch from 1614650 to f139231 Compare May 22, 2023 18:32

Increase test tx count

76447d1

njgheorghita force-pushed the socket-test branch from f139231 to 76447d1 Compare May 22, 2023 18:57

njgheorghita mentioned this pull request May 22, 2023

Bug: stream needs to be spawned in new thread #47

Closed

emhane mentioned this pull request May 28, 2023

Fix FIN not sent bug #63

Open

carver added 7 commits June 1, 2023 16:19

Log if packet waits on socket for too long

5368c55

Log more information during an idle timeout

e7f35e7

Log error, don't crash, if socket closes mid-send

17a761e

Clamp the timeout to an upper limit

fbe3553

Prefer clearing the packet queue over timeouts

2bf5bb9

Log connection info on read_to_eof error

ee4804d

Limit worker threads to reproduce error in CI

44da498

carver added 2 commits June 2, 2023 12:05

Reduce concurrency to 20, plus cleanup

a684346

Cleanup: change the high end of the range to equal the number of concurrent streams

Clamp the retry timeout down to 4 seconds

c539a5c

carver closed this Sep 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase tested transfer count and data size #45

Increase tested transfer count and data size #45

njgheorghita commented Apr 26, 2023

njgheorghita commented May 3, 2023

jacobkaufmann commented May 5, 2023

njgheorghita commented May 8, 2023

jacobkaufmann commented May 8, 2023

njgheorghita commented May 8, 2023

jacobkaufmann commented May 9, 2023

njgheorghita commented May 9, 2023

njgheorghita commented May 11, 2023

jacobkaufmann commented May 15, 2023

njgheorghita commented May 25, 2023

jacobkaufmann commented May 26, 2023

njgheorghita commented May 26, 2023

carver commented Sep 23, 2023

Increase tested transfer count and data size #45

Increase tested transfer count and data size #45

Conversation

njgheorghita commented Apr 26, 2023

njgheorghita commented May 3, 2023

jacobkaufmann commented May 5, 2023

njgheorghita commented May 8, 2023

jacobkaufmann commented May 8, 2023

njgheorghita commented May 8, 2023

jacobkaufmann commented May 9, 2023

njgheorghita commented May 9, 2023

njgheorghita commented May 11, 2023

jacobkaufmann commented May 15, 2023

njgheorghita commented May 25, 2023

jacobkaufmann commented May 26, 2023

njgheorghita commented May 26, 2023

carver commented Sep 23, 2023