-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupted network batch on client during stress testing (Received a packet with an invalid Hash Value.) #2893
Comments
Hi @SamTheBay, thank you for letting us know about this issue. To better understand the nature of the issue, can you provide us the transport protocol you are using for your project? If your project is not using UTP (com.unity.transport), can you try it with UTP and see if you are still encountering this issue? |
Yes, I am using UTP (version 2.2.1). Attached is a screenshot of my configuration. Note that I have played around with different sizes for the Max Packet Queue Size and Max Payload Size but they don't make a difference. I have also tried increasing the Spawn Timeout to 10 seconds but that didn't help either. |
@NoelStephensUnity Any ideas? |
A new finding as I am running more tests today. This time, I see the following message pattern... ** Client Receives ** ** Server Sends ** So, we see that the messages 16439399040642367407 and 10211885624164295971 get replayed on the client after ~16 minutes even though the server did not send them a second time. After this event, the server keeps on sending messages with all the updates, but the client stops receiving any of them. It only receives time sync messages from that point on... 2024-04-23 06:23:44 33625) Hash: 15766265757608103513 Size: 24 Batch Count: 1 Messages: (17,3) |
@SamTheBay Once I can replicate the issue, then I can add some more debug information to determine where the duplicated messages are coming from. |
@NoelStephensUnity right now my stress test runs my full game with one client and one server. It then continually generates groups of enemies which attack the player and then eventually die. The repro will happen within 15-30 minutes of that running. Let me try hacking it down to something smaller to see if I can get an easier repro to share. Worst case, I can work on making changes so that I can share the whole project without your needing access to the databases. You are correct that I am using the ClientRpc/ServeRpc (and Network Variables and Network Lists). |
@SamTheBay
You should get a bug ticket number via email when it is received and added to the internal QA DB. If you post that here I can get QA to migrate that issue into the NGO internal bug system and then just run the stress test on my end. |
@NoelStephensUnity I have submitted the bug report along with the full project and instructions on how to run the stress test to reproduce the issue. Let me now if you need anything additional from my end. Here is the bug report id... IN-75033 - Corrupt messages using Unity Transport Thanks! |
@SamTheBay |
@NoelStephensUnity were you able to reproduce the issue using the project I sent? |
@SamTheBay |
@NoelStephensUnity @ShadauxCat is there any update on finding the root cause or an ETA for this? I saw from the bug report that it looks like you have successfully reproduced the issue with the project I sent which is great. I am losing players over this one, so I am very anxious for a fix or work around. Let me know if there is anything else I can provide you that would help. Thanks! |
Hi @SamTheBay - I have not been able to start investigating this yet due to other work that has been on my plate, but this bug is a high priority for us. I expect to start investigating it by the end of this week. I've only just joined the Transport team and the subject matter expert is currently on leave, so it may take me some time to really track down what is going on here, but it is the next item on my todo list. |
@ShadauxCat any update? |
I'm actively working on it as we speak, but I don't have any updates to share. I'm unable to reproduce it consistently (I've only managed to reproduce it twice so far in two days of running the stress test almost non stop) so it's slow going trying to get data on what's going on here. I'll let you know as soon as I have anything to share. |
@ShadauxCat any progress? |
@SamTheBay I'm still having a really hard time reproducing it. I don't know if it's something about my hardware causing the network conditions to be different, or the logging I've added to try to track down what's going on changing the timing, or what, but it just won't reproduce for me at all now. If you have time and want to help, it would be really useful if you could pull down a branch with some logging in it, reproduce it (if you can), and then send me the logs from the client the error happened on + the server. (You might have to run it like To get my branch, go to package manager, add from git URL, and use If I can just get a trace of what's going on when the error happens, it would help me immensely in coming up with a fix. |
Ok. I will give it a try soon and let you know. It could be that the tracing itself is changing the timing enough to prevent a repro. We will see if it will repro for me after I switch to your branch. |
Oh, I should warn you, it's a LOT of log output. Please turn off stack traces in your build or it might get to gigabytes of log data. |
Ok. I am getting two errors building after grabbing your git... Library\PackageCache\com.unity.netcode.gameobjects@ca5a270\Editor\AnticipatedNetworkTransformEditor.cs(9,26): error CS0246: The type or namespace name 'AnticipatedNetworkTransform' could not be found (are you missing a using directive or an assembly reference?) Any idea how I should fix these? |
That is unusual. This branch is created directly off of the 1.9.1 release branch with just some log lines added, and AnticipatedNetworkTransform is definitely there in the branch. Do you maybe have a local version of the repo embedded in your project that's conflicting with the one added through package manager? If not, maybe try closing the editor, deleting your library directory, and launching again to clear the cache? |
Oh... yes. I have a local version from my own debugging. My bad. Let me remove it and try again. |
If you aren't able to get it with that branch, can you try |
And if that one also doesn't work, try |
Another thing that would be valuable to know - you said you reproduced it with 1.7, can you also reproduce it with 1.6? There were some changes in areas that might be relevant to this issue in 1.7. If it's not reproducible in 1.6, that at least would narrow down the places I need to look to figure out what's happening. |
So far, I haven't been able to repro it with the heavy logging. I will try with your lighter logging tomorrow and I will also try with 1.6 and report back. |
I have some coworkers who have been able to reproduce the issue with another project (I haven't been able to reproduce it on my hardware, but they have). I had them test 1.6 yesterday and they confirmed it still happens with 1.6, so you can skip that test. I added some new logs to the "even lighter" branch this morning. I think that branch is the best bet right now. I think it has the minimum amount of logs possible to still give me the info I need. So I'd say focus on that one today. We're doing some tests internally using said coworkers' project as well, so hopefully between you and them, I'll be able to get what I need to start making progress on a fix. |
Great. I have been trying to repro it today with my base build but haven't been successful so far :-(. I am running experiments to see if I can find a way to make it repro more reliably. If I manage that then I will try running your light wight logging. |
@ShadauxCat were you able to make any progress based on the repro from another project? I have been struggling to reliably repro it now on my side and don't know what changed. If the other project gives the information you need then I might not need to keep experimenting on my side. |
@ShadauxCat checking in once more. Is a repro from me still important given you have one from another project? |
Hey, sorry, I've been out of office this week. Someone else has been continuing the investigation, though, and I saw a message from them this morning that they finally managed to isolate the cause. Now that we know exactly what the logic error is, we can hopefully make fast progress on putting together a fix. |
This is amazing news! Thanks for the update. If you come up with a fix in a private branch the let me know since I would be anxious to adopt it if the full release process takes longer. |
@ShadauxCat |
@SamTheBay @ShadauxCat |
Hi @SamTheBay and @COV-KaiGai I believe we've found a fix for this. Initial testing looks good so far. Unfortunately I can't give you a branch to test on because the fix is in the transport package, which doesn't have a public repo, but we'll get a release out as fast as we can and I'll let you know when that lands. |
@ShadauxCat any update on the release for this? Will it be released as a new package version for the Unity Transport? Like version 2.2.2 for example? |
@SamTheBay Transport version 2.3.0 has released today that should have a fix for this issue. Can you retest and confirm for us? |
This is great! I have upgraded and deployed a new build out to my players. I will wait a few days to see if there are any new repro's. |
@ShadauxCat I have not heard of a repro for a couple weeks now so I think this looks resolved. Thanks for the support. I will go ahead and close it. |
Description
I am running a single client process and a single server process both on my local machine. I am running a stress test using my game which has hundreds of network objects with many network variables and also does a large number of RPCs. After about 15-30 minutes I will reproduce an issue where the client gets stuck and stops receiving any new network batches from the server. The client gives the following error log at that time...
"Received a packet with an invalid Hash Value. Please report this to the Netcode for GameObjects..."
I have instrumented the networking code to log out information about every batch on the client and the server. The result shows that the client receives a batch in which 20% of it has been reset to 0 instead of the expected messages. Here is an example...
Batch sent by the server
2024-04-20 10:24:29 30199) Hash: 2898211147757380501 Size: 5728 Batch Count: 111 Messages:(10,34)(10,48)(10,46)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,46)(10,46)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,45)(10,36)(10,45)(10,49)(10,49)(10,50)(10,45)(10,45)(10,50)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,45)(10,49)(10,50)(10,49)(10,45)(10,46)(10,49)(10,45)(10,38)(10,45)(10,49)(10,45)(10,45)(10,49)(10,49)(10,49)(10,45)(10,49)(10,49)(10,45)(10,45)(10,36)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,45)(10,49)(10,46)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,50)(10,49)(10,45)(10,45)(10,49)(10,49)(10,49)(10,49)(10,49)(10,50)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,50)(10,49)(10,50)(10,49)(10,49)(10,49)(10,49)(10,49)(5,56)(5,57)(5,57)(5,57)(5,57)(5,57)(5,69)(5,69)
Batch received by the client
2024-04-20 12:53:18 30195) Hash: 2898211147757380501 Size: 5728 Batch Count: 111 Messages: (10,34)(10,48)(10,46)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,46)(10,46)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,45)(10,36)(10,45)(10,49)(10,49)(10,50)(10,45)(10,45)(10,50)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,45)(10,49)(10,50)(10,49)(10,45)(10,46)(10,49)(10,45)(10,38)(10,45)(10,49)(10,45)(10,45)(10,49)(10,49)(10,49)(10,45)(10,49)(10,49)(10,45)(10,45)(10,36)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,45)(10,49)(10,46)(10,49)(10,49)(10,49)(10,49)(10,49)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)
You will notice that the batches hash matches (this is the hash sent by the server not the one computed on the client which doesn't match). However, towards the end of the data on the client the messages all turn to type 0 length 0 which is obviously wrong.
Note that the server seems to be totally healthy. If multiple clients are connected, the other clients continue playing fine. Furthermore, disconnecting and reconnecting the affected client temporarily resolves the issue (without restarting the server).
Reproduce Steps
I have not been able to figure out a deterministic set of steps to reproduce the issue. It seems to be a timing condition in the network transport. Even under an abusive workload it can take 30 minutes to reproduce. However, my players are hitting this regularly on my production servers which include many different client PC's and latency distances from the server.
My project is large and not easy to setup, so I am not sure it is practical for me to share my local repro. However, if you have any experiments you would like me to try, I can run them and report back. Furthermore, I could jump on a call and do pair debugging if you want to look at a repro in the debugger.
Actual Outcome
The client becomes stuck and has to be disconnected and re-connected from the server to recover. This is game breaking because the game uses permadeath mechanics so the disconnect is very dangerous.
Expected Outcome
There should not be any combination of events that leads to the client receiving a corrupted batch that it cannot process.
Environment
Additional Context
If having the full logs of messages between the client and server would be helpful, I can provide them.
The text was updated successfully, but these errors were encountered: