Lag with challenge response #87

Open
omair2084 opened this Issue Oct 9, 2016 · 46 comments

Projects

None yet

7 participants

@omair2084
omair2084 commented Oct 9, 2016 edited

I have an issue which I probably think is to do with the new client. (Since my server worked fine with the 1.7 clients)

The issue: When some clients connect the server, the server basically halts and every player gets a Connection Interrupted. On the server console, a hitch warning between 3-5 seconds is displayed.

After a lot of debugging I finally caught the packet which is causing the issue.

This is the ASCII dump of the packet from wireshark

....getchallenge 0 "406482d1313a35fde8f5e3fad5045d10"
....challengeResponse -1333251438 0 0 xproto
....challengeResponse -1333251438 0 0 xproto
....connect "\pbguid\406482d1313a35fde8f5e3fad5045d10\xver\16.1\qport\31844\challenge\-1333251438\protocol\16\cg_predictItems\1\cl_anonymous\0\cl_punkbuster\0\cl_voice\1\cl_wwwDownload\1\rate\25000\snaps\30\name\Debugger"
....connectResponse 
....connectResponse 
....d|.................
....d|.............
...............................
...............................
...............................C.@.....waiting.
...............................C.@.....waiting.
............................
............................
...............................
...............................
...............................C.@.....waiting.
...............................C.@.....waiting.
............................
............................
...............................
...............................
...............................C.@.....waiting.
...............................C.@.....waiting.

The following lines repeat over two hundred thousand lines (as seen in notepad) and then the client finally connects.

............................
............................
...............................
...............................
...............................C.@.....waiting.
...............................C.@.....waiting.

To add a little more, if it matters, the network is a flat /8 subnet. Meaning, the server is hosted on 10.0.0.2/255.0.0.0 and the client connects from 10.1.2.3/255.0.0.0. Where the router is basically configured as 10.0.0.1/255.0.0.0
FYI, happens to clients that connect from the internet as well.

@IceNinjaman
Collaborator

Can you maybe better screenshot the Wireshark dump you have there please?
Then the colors are at least remaining.

@omair2084

screenshot from 2016-10-11 03-07-31
Attached the screenshot.

@IceNinjaman
Collaborator

The Server sends out a lot of these packets without receiving something from the client?

@omair2084

Yes that seems to be the case.
I have shifted over to a better system. Let's see if it was actually the CPU or some VM resource choking.

@IceNinjaman
Collaborator

Well from observing your screenshot there seems to be something terrible wrong. Can you please upload the pcap file too?

@art567
art567 commented Oct 18, 2016 edited

We have pcap file from our server (COD4X 1.8 linux - i386 build 1485 9 Sep 2016)
Link: https://www.sendspace.com/file/lec0d4

There is a lot repeating packets like this, and it is very bad for our network infrastructure

wireshark_2016-10-18_21-48-34

@omair2084

Stream number 15 on the pcap that art567 shared, is the same as what I get.

@IceNinjaman
Collaborator
IceNinjaman commented Oct 20, 2016 edited

Okay well in case you want to debug it your self - please examine why the function SV_SendClientGamestate() keeps getting called very fast without client has sent a packet. I assume this function getting called multiple of times when it shouldn't happen so fast.
If you don't really know how to do it or you don't want to do it please find a way to give me control on that server so I can work with debugger on it on my own.

@art567
art567 commented Oct 20, 2016

@IceNinjaman Do you need real server to attach debugger or using test will be enought? We dont have a clue why it's happening and don't have proper condititions to 100% reproduce lag every time. Only several playerconnections could cause this bug. We need some automation for gdb/x32dbg in this case.

@IceNinjaman
Collaborator

So for your wireshark dump you did many attempts to get it?

@art567
art567 commented Oct 21, 2016

At least 1 of 5 was lag-positive with peak packetrate ~100k pkt/s , we tried to catch it using tcpdump and router's graphing. The server is installed on Ubuntu 14.04.5 LTS x86_64.
One of lag-positive dump is here.

@omair2084
omair2084 commented Oct 22, 2016 edited

Art is correct, this only happens when the server is under some load and not to every player that connects.
Since, I moved to a new server the lag has dropped from 4000msec to 600msec when a user connects. I guess the same issue remains, but since the processor is better it handles all these faster.
If you want to debug, I could give you credentials to connect. But if our time difference is huge and the number of players are not many during your convenient time, I don't think the issue will show up. I am GMT +5.30

Would it make sense if you could add some debug commands in the cod4x source and supply a patch. I could run that and send you the logs.

@art567
art567 commented Oct 23, 2016

We could provide all necessary to debug the server (including tools and credentials), It's possible to make a fork of live VM we're using and where we experience lags. We're also have time difference GMT+5.00 but i'm not limited at the time every weekend.

@IceNinjaman
Collaborator

Since the issue as you say can not be reproduced it is a lot more difficult to debug. Indeed it would only work by collecting statistics due to build in debug facilities. However that is going a time eating procedure since there will be many patches necessary i think.

@art567
art567 commented Oct 23, 2016

Maybe we just inject debugging markers using Com_PrintfNoRedirect() and watch console then?

@IceNinjaman
Collaborator

This is alone is not really efficient I think. It need to be counted or something similar and showing a message when too many hits come up in a second.

@IceNinjaman
Collaborator

Please get latest version and compile it. Please enable before compiling it debugging here: https://github.com/callofduty4x/CoD4x_Server/blob/master/src/sv_client.c#L22

@art567
art567 commented Oct 24, 2016

@IceNinjaman Thank you for support and #70a5308 Testing it now

@IceNinjaman
Collaborator

I forgot to add: Logging has to be enabled "set logfile 2". Output goes to debug_qconsole.log
Maybe the 80 on this line has to be adjusted lower: https://github.com/callofduty4x/CoD4x_Server/blob/master/src/sv_client.c#L1613

@art567
art567 commented Oct 24, 2016

It has compilation issues on Ubuntu LTS 14.04.5 x86_64: http://i.imgur.com/wlogiET.png
Debian Jessie i686 compiled successfully, maybe compiler version matters and it does a trick.

New binary installed for test on target system + config updated, we will report our test results and console logs soon

@omair2084

I have got it compiled and working. But it might take me a while before I can post some logs.

@omair2084

Currently I have put the value as 20.
This is what I am getting

Hitcount exceeded 80 in SV_SendClientGameState for client 6 Count 28
---------- Begin Backtrace ----------
10: .(SV_SendClientGameState+0xb4) [0x208c03d]
9: .(SV_ExecuteClientMessage+0x11e) [0x208c5f3]
8: .(NET_UDPPacketEvent+0x77) [0x2050f08]
7: .(NET_Event+0x7f) [0x20a7e8e]
6: .(NET_Sleep+0x1cc) [0x20a80fe]
5: .(SV_Frame+0x393) [0x209b90d]
4: .(Com_Frame+0x80) [0x20285ad]
3: .(Sys_Main+0x90) [0x20a32fa]
2: .(main+0x187) [0x20aae62]
1: /lib/libc.so.6(__libc_start_main+0xf6) [0xf734b5a6]
0: .() [0x2024063]

Tomorrow I should get more when people start connecting in the morning.

@IceNinjaman
Collaborator

Hmm well 28 isn't that much and would not explain that amount of packets. You should see if every server lag generated by connecting does result in logfile entries. If not it is probably the wrong function.

@art567
art567 commented Oct 27, 2016

Here is our debug stats:

Hitcount exceeded 80 in SV_SendClientGameState for client 9 Count 166
---------- Begin Backtrace ----------
10: /opt/games/cod4(SV_SendClientGameState+0xb4) [0x208ba56]
9: /opt/games/cod4(SV_ExecuteClientMessage+0x120) [0x208bff3]
8: /opt/games/cod4(NET_UDPPacketEvent+0x77) [0x2050f05]
7: /opt/games/cod4(NET_Event+0x7f) [0x20a75ee]
6: /opt/games/cod4(NET_Sleep+0x1e5) [0x20a7875]
5: /opt/games/cod4(SV_Frame+0x2e6) [0x209af87]
4: /opt/games/cod4(Com_Frame+0x80) [0x202855c]
3: /opt/games/cod4(Sys_Main+0x90) [0x20a29e8]
2: /opt/games/cod4(main+0x187) [0x20aa5de]
1: /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0xf743daf3]
0: /opt/games/cod4() [0x202406e]
-------- Backtrace Completed --------

There is highest value we reached yesterday

@IceNinjaman
Collaborator

Well still not really explaining the 100k packets/s
In this case it appears there was indeed an incoming packet. Did this log entry indeed generated on same time as the lag? Anyway I will make an update.

@IceNinjaman
Collaborator

I have added a new commit. Please use this new version. Please enable the define in qshared.h on line 37.
Delete old logfile. Then upload the full log file please once there are enough information logged.

@omair2084
omair2084 commented Oct 30, 2016 edited

Attached is the log for the past 8-10 hours. Updated the log file.

debug_qconsole.zip

@IceNinjaman
Collaborator

Okay please check with the latest changes again. I need the new logfile.

@omair2084

Great. Thanks, I will do that.
I have currently moved the hosting to AWS. It seems to provide better performance.
Can you tell me if the Packets Per Second for a 40 player server, at peak times goes to 100,000 pps (IN) and 40,000 (Out)? This is what I see on the Cloud Dashboard, I am just wondering if this is normal.

@IceNinjaman
Collaborator
IceNinjaman commented Nov 2, 2016 edited

IN: 40players * 333FPS = 13320pps
OUT: 40players * 20ticks = 800pps

That is what I would expect in averange

If you provide stuff like WWW-Download idk it.

@florczakraf
Contributor

I believe there's something wrong with your calculations (or my way of getting pps). Here's screenshot of my stats (over 120 players across the servers):

@art567
art567 commented Nov 2, 2016

This is our packetrate peaks, as you could see it is a hundreds times higher than usual
Almost the same event was stored in pcap file i've included above in this thread.

img

(Values/metrics are listed on graphs)

There is a logs archive from debug build 1545:
logs_2016-11-03.zip

@IceNinjaman
Collaborator

Well run the new code again please. You have to uncomment debug in q_shared.h

@omair2084

Since moving to the new server on AWS I don't see any Hitch warnings on the server. I am assuming there will be no debug log produced.

@slhosting
slhosting commented Nov 12, 2016 edited

Hi,
I have a same problem with cod4x 1.8 linux server file.
I use version "CoD4 X 1.8 linux-i386 build 1060 Sep 1 2016", and sometimes when new player connect, the server send a lot of packages (~100K PPS) to new connection. Unfortunately when the client get these huge packets, lots of people freezing from internet... (so they client site router/modem are crashing from huge pps). I have checked what was the package lengths what the server sent. These were the lengths in tcpdump capture: (89,70,73). I tried to filter with iptables, and i could eliminated these huge packets, and now the clients router/modem freezing are stopped. But this is not a 100% solution, because when the iptables dropping these bad packets the CPU usage is grow to very high and the game server start to lagging until the iptables drop working.

My firewall with iptables is working fine... i dropped only the faulty package when client connecting, so if the client connected fine and playing on the server the iptables doesn't filtering anymore for this client.

So actually I don't know what should i do for resolve this problem.

Anybody have a good idea?
Where does this connection problem come from?
And how can i resolve it?

@IceNinjaman
Collaborator

Please run the latest code with debugging enabled as mentioned before. I need the resulting logfile.

@D4edalus D4edalus added the bug label Nov 20, 2016
@Kingwolf123

Yes Windows Build having this issue. Packet rate increases causing CPU usage to spike. Which results in Server being Stuck and players teleporting

@slhosting

Someone is working on the problem?

@IceNinjaman
Collaborator

Sorry I didn't. However I did now. From what I can see this logfiles show counts for all clients instead a single one because of a function parameter mixup in SV_SetupReliableMessageProtocol() --> ReliableSetup(). So they are not useful at all.
Please do the procedure again with updated sourcecode.

@slhosting

Dear IceNinjaman,

I enabled the debugging, but I nothing see in the debug_qconsole.log what is important for us.
"SV_SpawnServer Begin" AND "SV_SpawnServer Ended" is not important because, the server never lagged while changing map. But if new player connecting to server there are big lagging ("teleporting"), to all people how are playing on the server, but just, when somebody (new player) connecting. (It takes about 10 seconds)

I checked this strange phenomenon and i saw in the TCPDUMP that, the server send lots of packets to the connecting client in a short time and the client can not answer because channel was overloaded. It's like that the server sending packet in while(1) cycle to connecting client. In my opinion this is the reason that the client can not answer to server and everybody teleporting on the server while server sending huge packets to connecting client.

Can you checked that what happening in the server site when new player starting to connect ?
For example: What kind of packets set to client? How many packets sent by server? and when?

Thanks a lot!

@IceNinjaman
Collaborator

debug_qconsole.log
Should contain packet limit exceeded messages. At least for the message type. 30 packets per minute for each player are still normal. Can you please make a tcpdump log when this issue is ongoing? Shouldn't be possible to see 200packets per second with getting messages in debug log file unless it is a different packet type.

@slhosting

Hi,

This capcture.cap is a tcpdump from new player connection.
You can see in this capcture that:

  1. client send request to server for connection.
  2. server responds to the client
  3. client responds to the server
    ...
    7. server starting to send lots of packets to client without stop. (~100.000 packets)

After that:
Client side has been frozen for lots of packets.

Unfortunately this capture was made from version "CoD4 X 1.8 linux-i386 build 1060 Sep 1 2016"
But i think this phenomenon It has't been fixed yet in in the newer versions.

I don't know that what was changed in newest version "CoD4 X 1.8 linux-i386 build 1726 Dec 10 2016" because I started to check yesterday and I have't meet with this phenomenon yet.
If I meet with it, I will take new tcpdump, and I will send for you. (I need more time for this.)

Can you check that this packet sending has been limited already in new code or not?

capture.zip

@art567
art567 commented Dec 26, 2016

Is anybody fixed this using iptables ?

@slhosting

Here is the logfile what u need.

debug_qconsole.log.txt

@IceNinjaman
Collaborator

Okay just looked at this again. A packet capture for version 1060 is not really useful. The packet which gets sent at a high rate in version 1060 shouldn't get out at such a rate without generating logdata. At least in my understanding it shouldn't be possible. From your current logfile I don't see much suspect stuff in it at all except Hitch warning few times at places when they shouldn't occur. But it isn't
too heavy either. Also for that you have to ensure to run a single instance of server on dedicated hardware so load from other processes running can be excluded. Still I would have no idea from where it comes from. Could be even script.

@IceNinjaman IceNinjaman was assigned by D4edalus Jan 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment