Lost timestamp packets at high data rates #91

tristan-ovh · 2015-08-11T14:03:22Z

Hi,
I am using MoonGen to fill a 10 Gb/s link with TCP SYN flooding, and I try to measure the latency by using the measureLatency function.
But most or all timestamp packets are lost when the load is important. I use two ports directly connected by a cable, so the loss is not due to an external equipment.
When counting packets, I see the same number of sent and received packets.
I have setup filters to get only PTP packets in the receiving loop. This way the supported load gets higher, but I still lose most or all packets at full link rate.
I do not know if packets are lost during sending or reception.

Do you have an idea of what could cause the problem?

emmericp · 2015-08-11T14:30:20Z

It shouldn't lose packets in that scenario.

I can't reproduce this with the l3-load-latency example which still works at full line rate as expected.

Command line: ./build/MoonGen examples/l3-load-latency.lua 14 15 0 1 60
(with commit 6d6cc3b which fixes this script when used with small packets, otherwise use the default packet size)

Can you post the script you are using?

BTW: what you are trying to do is probably not a good idea. Latency on full line rate is often problematic: every small pause causes buffers to fill up and these buffers can never be emptied as the packets are coming in at the same rate that you can send them out on.

The latency will just be a function of the buffer size after a short time.
Well, unless you are testing a hardware device which usually doesn't have this problem.

tristan-ovh · 2015-08-11T14:38:22Z

I do reproduce the bug with the example script: MoonGen /opt/moongen/examples/l3-load-latency.lua 0 1 0 1 60

[Device: id=0] Sent 14878234 packets, current rate 14.88 Mpps, 7617.60 MBit/s, 9998.10 MBit/s wire rate.
[Device: id=1] Received 14878826 packets, current rate 14.88 Mpps, 7617.63 MBit/s, 9998.14 MBit/s wire rate.
[Device: id=0] Sent 29758621 packets, current rate 14.88 Mpps, 7618.75 MBit/s, 9999.61 MBit/s wire rate.
[Device: id=1] Received 29759202 packets, current rate 14.88 Mpps, 7618.75 MBit/s, 9999.61 MBit/s wire rate.
^C[Device: id=0] Sent 42092496 packets with 2693919744 bytes payload (including CRC).
[Device: id=0] Sent 14.880372 (StdDev nan) Mpps, 7618.750699 (StdDev nan) MBit/s, 9999.610292 (StdDev nan) MBit/s wire rate on average.
[Device: id=1] Received 42092496 packets with 2693919744 bytes payload (including CRC).
[Device: id=1] Received 14.880374 (StdDev nan) Mpps, 7618.751910 (StdDev nan) MBit/s, 9999.611721 (StdDev nan) MBit/s wire rate on average.
Samples: 0, Average: nan ns, StdDev: 0.0 ns, Quartiles: nan/nan/nan ns
Saving histogram to 'histogram.csv'

emmericp · 2015-08-11T14:39:49Z

Did you update to 6d6cc3b or later?
Timestamping was broken with small packets in this specific script independent from the rate.

Try to update or use the default packet size (124).

tristan-ovh · 2015-08-11T14:41:48Z

Thank you for your answers by the way :)
I am at the most up to date version, and it works fine at lower rates : MoonGen /opt/moongen/examples/l3-load-latency.lua 0 1

[Device: id=0] Sent 1953096 packets, current rate 1.95 Mpps, 1999.92 MBit/s, 2312.41 MBit/s wire rate.
[Device: id=1] Received 1953183 packets, current rate 1.95 Mpps, 1999.92 MBit/s, 2312.41 MBit/s wire rate.
[Device: id=0] Sent 3907147 packets, current rate 1.95 Mpps, 2000.94 MBit/s, 2313.59 MBit/s wire rate.
[Device: id=1] Received 3907269 packets, current rate 1.95 Mpps, 2000.94 MBit/s, 2313.59 MBit/s wire rate.
[Device: id=0] Sent 5861205 packets, current rate 1.95 Mpps, 2000.95 MBit/s, 2313.60 MBit/s wire rate.
[Device: id=1] Received 5861327 packets, current rate 1.95 Mpps, 2000.95 MBit/s, 2313.60 MBit/s wire rate.
^C[Device: id=0] Sent 6496110 packets with 831501952 bytes payload (including CRC).
[Device: id=0] Sent 1.954047 (StdDev 0.000006) Mpps, 2000.944480 (StdDev 0.004315) MBit/s, 2313.592055 (StdDev 0.005216) MBit/s wire rate on average.
[Device: id=1] Received 6496110 packets with 831501952 bytes payload (including CRC).
[Device: id=1] Received 1.954047 (StdDev 0.000006) Mpps, 2000.944186 (StdDev 0.006130) MBit/s, 2313.591715 (StdDev 0.007087) MBit/s wire rate on average.
Samples: 2320, Average: 254.4 ns, StdDev: 18.4 ns, Quartiles: 243.2/256.0/268.8 ns
Saving histogram to 'histogram.csv'

tristan-ovh · 2015-08-11T14:42:57Z

woops no, I was not at the latest version. It seems to work better now!

tristan-ovh · 2015-08-11T14:46:04Z

I still have problems with my own script but not with the example. I will see what you have changed. Thank you.

emmericp · 2015-08-11T14:51:48Z

My recent changes shouldn't affect your code

tristan-ovh · 2015-08-11T14:54:57Z

Indeed they do not. I see that you use UDP packets instead of PTP packets for your example. Is there a reason to prefer one over the other?

While using measureLatency, I have also noted that you do not set pktLength in the UDP or PTP packets, resulting in broken packets that work anyway, but can be misinterpreted by some equipments.

emmericp · 2015-08-11T15:13:59Z

It doesn't really matter here since we don't modify any of the PTP fields. The timestamper uses a PTP packet internally.
The packet type from :get*Packet() is just a fancy way to generate a C struct with some getters and setters.

You are right, the timestamper could set the size which would avoid problems when someone sets the wrong size here (note: my script does that in fillPacket())

tristan-ovh · 2015-08-11T15:20:38Z

I am trying to find the differences between your script and mine, as mine loses all timestamp packets. I see that you declare 3 TX and 3 RX queues for both devices. Are there reasons why you need more than 2 TX for load and timestamp sending, and 1 RX for timestamp reception?

emmericp · 2015-08-11T15:24:58Z

One of the queues is for ARP rx/tx

tristan-ovh · 2015-08-11T16:02:32Z

The example l3-load-latency.lua does not report lost timestamps (the number of times measureLatency returns nil). When adding the reporting, I see that it is not 0 (about 5%), although the count of sent and received packets is exactly identical (same problem as with my script).

emmericp · 2015-08-11T16:11:53Z

Okay, that means that timestamping fails for some reason.
This means that the packet was received successfully but it wasn't timestamped by the NIC for whatever reason.

I guess 5% loss of timestamping information at full line rate is okay, since latency measurements at full linerate are usually pointless anyways (see my previous comment).
I would be really worried if you were actually losing packets in a cable ;)

I'll keep this issue open and I'll have a look at the timestamping logic which determines whether the timestamping was sucessful or not using sequence numbers.

tristan-ovh · 2015-08-11T17:04:22Z

Indeed it is very acceptable.
My script works now. the problem was the size of the packets.

But I do not understand how the filtering works in your example. I believe that packets matching a filter will be sent to the chosen queue, but other packets may also be sent to that queue.
So for my script, I had to add a rule with a lower priority that sends all packets to another queue.

In your example, you have no such rule, which should mean that you receive all packets on this core. Moreover, the filter you use (filterTimestamps) seems to match only PTP packets, not PTP/UDP packets, so it should have no effect in your case.

Am I misunderstanding something?

emmericp · 2015-08-11T22:42:48Z

All packets go to queue 0 by default. Only filters and RSS can redirect packets.

RSS is disabled by default, it has to enabled explicitly when configuring the device. And you would probably use a different set of queues for RSS.

Regarding the timestamp filter: this can actually be improved, yes. It currently just checks the PTP version at a specific offset in IP packets (mask.only_ip_flow) and it actually ignores the L4 protocol.
This should be changed, also checking if it is UDP is quite important for use cases like your TCP example in which this would match on some specific sequence numbers.

emmericp · 2015-08-12T11:27:27Z

Commit 7e68758 changes the timestamp filter to check the L4 protocol.

emmericp · 2015-08-12T11:30:33Z

I cannot reproduce the packet loss at full line rate.
I added a check for failed timestamps locally and I get all timestamps at full line rate.

What NIC are you using? I tested this with an Intel X540.

tristan-ovh · 2015-08-12T11:37:43Z

I use an Intel 82599ES. I made the test again and still have lost timestamps.
But it isn't really a problem for me.

emmericp · 2015-08-12T11:40:11Z

I cannot reproduce the packet loss at full line rate.
I added a check for failed timestamps locally and I get all timestamps at full line rate.

But I'm using an Intel X540 which is basically the same NIC (datasheets are almost identical, same driver) just with 10GBase-T instead of SFP+ and a lot of bug fixes.

I would not be surprised if this is just a hardware problem in the 82599 NIC. I've seen some strange problems with that NIC that just don't happen on X540 NICs.

emmericp closed this as completed Aug 12, 2015

emmericp mentioned this issue Oct 31, 2015

Investigate negative timestamps #101

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lost timestamp packets at high data rates #91

Lost timestamp packets at high data rates #91

tristan-ovh commented Aug 11, 2015

emmericp commented Aug 11, 2015

tristan-ovh commented Aug 11, 2015

emmericp commented Aug 11, 2015

tristan-ovh commented Aug 11, 2015

tristan-ovh commented Aug 11, 2015

tristan-ovh commented Aug 11, 2015

emmericp commented Aug 11, 2015

tristan-ovh commented Aug 11, 2015

emmericp commented Aug 11, 2015

tristan-ovh commented Aug 11, 2015

emmericp commented Aug 11, 2015

tristan-ovh commented Aug 11, 2015

emmericp commented Aug 11, 2015

tristan-ovh commented Aug 11, 2015

emmericp commented Aug 11, 2015

emmericp commented Aug 12, 2015

emmericp commented Aug 12, 2015

tristan-ovh commented Aug 12, 2015

emmericp commented Aug 12, 2015

Lost timestamp packets at high data rates #91

Lost timestamp packets at high data rates #91

Comments

tristan-ovh commented Aug 11, 2015

emmericp commented Aug 11, 2015

tristan-ovh commented Aug 11, 2015

emmericp commented Aug 11, 2015

tristan-ovh commented Aug 11, 2015

tristan-ovh commented Aug 11, 2015

tristan-ovh commented Aug 11, 2015

emmericp commented Aug 11, 2015

tristan-ovh commented Aug 11, 2015

emmericp commented Aug 11, 2015

tristan-ovh commented Aug 11, 2015

emmericp commented Aug 11, 2015

tristan-ovh commented Aug 11, 2015

emmericp commented Aug 11, 2015

tristan-ovh commented Aug 11, 2015

emmericp commented Aug 11, 2015

emmericp commented Aug 12, 2015

emmericp commented Aug 12, 2015

tristan-ovh commented Aug 12, 2015

emmericp commented Aug 12, 2015