Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase tested transfer count and data size #45

Closed
wants to merge 11 commits into from

Conversation

njgheorghita
Copy link
Contributor

Ok, well I think I've been able to track down at least one of the degenerative states we've been experiencing w/ utp inside trin. But, first a little context...

  • During some debugging with @mrferris yesterday, we were observing not a single fin packet being sent from the sender -> receiver. This behavior persisted. Until for an unrelated reason, I restarted my terminal and we started to observe fin packets being sent. I have no explanation for this 🤷
  • However, after this, we still observe degenerative behavior when running a bridge node against a single trin node. I'd estimate that ~30% of accept'ed content actually completes the offer/accept flow. Though, if we gossip only headers (eg relatively small data chunks) then we see 100% successful completes. Once we start gossiping bodies / receipts (eg relatively large data chunks) then we start to see the degenerative state. This implies that there is some bug when running many concurrent utp transfers.

I've updated the socket test to initiate 50 offer/accept flows for a relatively large piece of data. The error message here is familiar from the ones we've been seeing while debugging the live testnet. I'm happy to continue debugging this test failure and find the cause, but some input from @jacobkaufmann first would be nice to confirm that the test is correctly formed and that you can duplicate the failure.

@njgheorghita
Copy link
Contributor Author

@jacobkaufmann Have you had the chance to look at this? Not necessarily looking for a fix, but more a confirmation that the test is correctly formed and exposes a bug that needs fixing.

@jacobkaufmann
Copy link
Collaborator

apologies for the delay. I ran the test based on your code here, which appears correctly formed, but I don't see any ERROR logs. can you specify the error message(s) that you refer to above? how large (in bytes) are the payloads for bodies and receipts?

@njgheorghita
Copy link
Contributor Author

by error i guess i actually meant WARN logs wrt the utp library - the error logs are generated within trin

2023-05-08T13:54:22.659212Z  WARN uTP{send=172 recv=173}: utp_rs::conn: idle timeout expired, closing... unacked=[]

2023-05-08T13:54:22.659465Z  WARN uTP{send=102 recv=103}: utp_rs::conn: idle timeout expired, closing... unacked=[]
2023-05-08T13:54:22.659783Z  WARN uTP{send=100 recv=101}: utp_rs::conn: idle timeout expired, closing... unacked=[]
2023-05-08T13:54:22.660256Z  WARN uTP{send=196 recv=197}: utp_rs::conn: idle timeout expired, closing... unacked=[]
2023-05-08T13:54:22.659212Z  WARN uTP{send=170 recv=171}: utp_rs::conn: idle timeout expired, closing... unacked=[]
2023-05-08T13:54:22.685668Z  WARN uTP{send=104 recv=105}: utp_rs::conn: idle timeout expired, closing... unacked=[]
2023-05-08T13:54:22.889021Z  WARN uTP{send=168 recv=169}: utp_rs::conn: idle timeout expired, closing... unacked=[]
2023-05-08T13:54:22.891754Z  WARN uTP{send=166 recv=167}: utp_rs::conn: idle timeout expired, closing... unacked=[]
2023-05-08T13:54:22.891886Z  WARN uTP{send=128 recv=129}: utp_rs::conn: idle timeout expired, closing... unacked=[]
2023-05-08T13:54:22.892672Z  WARN uTP{send=160 recv=161}: utp_rs::conn: idle timeout expired, closing... unacked=[]
2023-05-08T13:54:22.893107Z  WARN uTP{send=176 recv=177}: utp_rs::conn: idle timeout expired, closing... unacked=[]

re bodies / receipts payload size, running a quick check on recent content values, the largest payload I'm seeing is approx 250_000 bytes out of a sample size of around 25 blocks. Maybe you're wondering so we can confirm whether or not this test is testing realistic bounds of the library as it is used in the wild? If so, fair enough. 1_000_000 bytes (as is tested in these tests) is probably excessive. However, I'm not sure if the cause of the problem here is payload size or actually # of concurrent transfers - since a larger payload size results in more concurrent transfers. If I adjust the test parameters for # of concurrent txs (50/100/1000/10000) and payload size in bytes (10_000, 100_000, 250_000, 500_000) - I got the following results for passing test / failure.

* 50 100 1000 10000
10_000 🟢 🟢 🔴 🔴
100_000 🟢 🔴 - -
250_000 🟢 🔴 - -
500_000 🔴 - - -

I'm having trouble collecting good data on how many utp transfers we can realistically expect to be going on at the same time while running trin... it's certainly something I can put more effort towards if you think it's useful. But, my intuition is that at the very least while running trin in bridge mode, set to backfill, we can reasonably expect a client to have 100 - 1000 concurrent utp transfers.

@jacobkaufmann
Copy link
Collaborator

so that WARN could occur if the local peer sends a FIN, the remote peer ACKs that FIN, but the remote peer does not transmit a FIN of its own. however, that should not occur if both peers use this uTP implementation. do you see that WARN when you execute the test that you wrote here, or does that WARN only appear in trin?

I was curious about the payload size because the send buffer has a preconfigured finite size, but that does not appear to affect you, since it is greater than one million.

@njgheorghita
Copy link
Contributor Author

That WARN happens when I execute the test in this pr, in the 🔴 cases from the previous comment

@jacobkaufmann
Copy link
Collaborator

huh, well like I said I don't see that when I execute the tests locally. I can try to make the tests more intense and see if the WARN comes up

@njgheorghita
Copy link
Contributor Author

Weird, so the test in this pr passes for you locally? It very consistently fails for me. So if it's passing for you, I guess it's time to get some outside confirmation. @ogenev @carver @mrferris can one of you pull down this branch and see if this test passes for you locally?

@njgheorghita
Copy link
Contributor Author

@jacobkaufmann Any progress on transferring this repo over to the ethereum org? that way we could get circleci running which would help resolve our different test results in this case

@jacobkaufmann
Copy link
Collaborator

sorry I have been out for the past few days. I will create an issue for that today.

@njgheorghita
Copy link
Contributor Author

@jacobkaufmann Just in case you missed this, but now that ci is building the failing run here is essentially the same type of error I've been experiencing while using this library in trin.

@jacobkaufmann
Copy link
Collaborator

hmm, that's strange that the timeout is exceeded so quickly, because the test stage runs for less than one minute in that CI failure.

one thing that could be going on in the test is that the CIDs are overlapping with each other, so that messages for CID (100, 101) and messages for CID (101, 102) are being lost to each other or something like that. that looks like there could be a bug in the CID management logic. but to isolate that possibility out, it would be good to see if increasing the gap between consecutive CIDs resolves the CI failure.

@njgheorghita
Copy link
Contributor Author

@jacobkaufmann IIUC, that's not the problem here. In the test I step_by(2) while incrementing CID pairs, so a sequence of consecutive CID pairs will never overlap. eg (101, 102), (103, 104), (105, 106) etc. I've confirmed locally that this indeed is the sequence of CIDs that are generated and used in the stream, without any overlapping CIDs.

Imo, increasing the gap b/w consecutive CIDs wouldn't reveal anything.
aka (101, 102), (103, 104), (105, 106) is equivalent to (100, 101), (110, 111), (120, 121)

The error in this ci build is the exact invalid deadline error I've described in #55, so imo the issue isn't that a validly set timeout was exceeded, but rather that we enter a "degenerative" state in which an invalid timeout is set -> breaking delay_queue.insert_at(..)

@emhane emhane mentioned this pull request May 28, 2023
Cleanup: change the high end of the range to equal the number of
concurrent streams
@carver
Copy link
Contributor

carver commented Sep 23, 2023

I am pretty sure that all the important bits of this were included in #81

@carver carver closed this Sep 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants