Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lost timestamp packets at high data rates #91

Closed
tristan-ovh opened this issue Aug 11, 2015 · 19 comments
Closed

Lost timestamp packets at high data rates #91

tristan-ovh opened this issue Aug 11, 2015 · 19 comments

Comments

@tristan-ovh
Copy link

Hi,
I am using MoonGen to fill a 10 Gb/s link with TCP SYN flooding, and I try to measure the latency by using the measureLatency function.
But most or all timestamp packets are lost when the load is important. I use two ports directly connected by a cable, so the loss is not due to an external equipment.
When counting packets, I see the same number of sent and received packets.
I have setup filters to get only PTP packets in the receiving loop. This way the supported load gets higher, but I still lose most or all packets at full link rate.
I do not know if packets are lost during sending or reception.

Do you have an idea of what could cause the problem?

@emmericp
Copy link
Owner

It shouldn't lose packets in that scenario.

I can't reproduce this with the l3-load-latency example which still works at full line rate as expected.

Command line: ./build/MoonGen examples/l3-load-latency.lua 14 15 0 1 60
(with commit 6d6cc3b which fixes this script when used with small packets, otherwise use the default packet size)

Can you post the script you are using?

BTW: what you are trying to do is probably not a good idea. Latency on full line rate is often problematic: every small pause causes buffers to fill up and these buffers can never be emptied as the packets are coming in at the same rate that you can send them out on.

The latency will just be a function of the buffer size after a short time.
Well, unless you are testing a hardware device which usually doesn't have this problem.

@tristan-ovh
Copy link
Author

I do reproduce the bug with the example script: MoonGen /opt/moongen/examples/l3-load-latency.lua 0 1 0 1 60

[Device: id=0] Sent 14878234 packets, current rate 14.88 Mpps, 7617.60 MBit/s, 9998.10 MBit/s wire rate.
[Device: id=1] Received 14878826 packets, current rate 14.88 Mpps, 7617.63 MBit/s, 9998.14 MBit/s wire rate.
[Device: id=0] Sent 29758621 packets, current rate 14.88 Mpps, 7618.75 MBit/s, 9999.61 MBit/s wire rate.
[Device: id=1] Received 29759202 packets, current rate 14.88 Mpps, 7618.75 MBit/s, 9999.61 MBit/s wire rate.
^C[Device: id=0] Sent 42092496 packets with 2693919744 bytes payload (including CRC).
[Device: id=0] Sent 14.880372 (StdDev nan) Mpps, 7618.750699 (StdDev nan) MBit/s, 9999.610292 (StdDev nan) MBit/s wire rate on average.
[Device: id=1] Received 42092496 packets with 2693919744 bytes payload (including CRC).
[Device: id=1] Received 14.880374 (StdDev nan) Mpps, 7618.751910 (StdDev nan) MBit/s, 9999.611721 (StdDev nan) MBit/s wire rate on average.
Samples: 0, Average: nan ns, StdDev: 0.0 ns, Quartiles: nan/nan/nan ns
Saving histogram to 'histogram.csv'

@emmericp
Copy link
Owner

Did you update to 6d6cc3b or later?
Timestamping was broken with small packets in this specific script independent from the rate.

Try to update or use the default packet size (124).

@tristan-ovh
Copy link
Author

Thank you for your answers by the way :)
I am at the most up to date version, and it works fine at lower rates : MoonGen /opt/moongen/examples/l3-load-latency.lua 0 1

[Device: id=0] Sent 1953096 packets, current rate 1.95 Mpps, 1999.92 MBit/s, 2312.41 MBit/s wire rate.
[Device: id=1] Received 1953183 packets, current rate 1.95 Mpps, 1999.92 MBit/s, 2312.41 MBit/s wire rate.
[Device: id=0] Sent 3907147 packets, current rate 1.95 Mpps, 2000.94 MBit/s, 2313.59 MBit/s wire rate.
[Device: id=1] Received 3907269 packets, current rate 1.95 Mpps, 2000.94 MBit/s, 2313.59 MBit/s wire rate.
[Device: id=0] Sent 5861205 packets, current rate 1.95 Mpps, 2000.95 MBit/s, 2313.60 MBit/s wire rate.
[Device: id=1] Received 5861327 packets, current rate 1.95 Mpps, 2000.95 MBit/s, 2313.60 MBit/s wire rate.
^C[Device: id=0] Sent 6496110 packets with 831501952 bytes payload (including CRC).
[Device: id=0] Sent 1.954047 (StdDev 0.000006) Mpps, 2000.944480 (StdDev 0.004315) MBit/s, 2313.592055 (StdDev 0.005216) MBit/s wire rate on average.
[Device: id=1] Received 6496110 packets with 831501952 bytes payload (including CRC).
[Device: id=1] Received 1.954047 (StdDev 0.000006) Mpps, 2000.944186 (StdDev 0.006130) MBit/s, 2313.591715 (StdDev 0.007087) MBit/s wire rate on average.
Samples: 2320, Average: 254.4 ns, StdDev: 18.4 ns, Quartiles: 243.2/256.0/268.8 ns
Saving histogram to 'histogram.csv'

@tristan-ovh
Copy link
Author

woops no, I was not at the latest version. It seems to work better now!

@tristan-ovh
Copy link
Author

I still have problems with my own script but not with the example. I will see what you have changed. Thank you.

@emmericp
Copy link
Owner

My recent changes shouldn't affect your code

@tristan-ovh
Copy link
Author

Indeed they do not. I see that you use UDP packets instead of PTP packets for your example. Is there a reason to prefer one over the other?

While using measureLatency, I have also noted that you do not set pktLength in the UDP or PTP packets, resulting in broken packets that work anyway, but can be misinterpreted by some equipments.

@emmericp
Copy link
Owner

It doesn't really matter here since we don't modify any of the PTP fields. The timestamper uses a PTP packet internally.
The packet type from :get*Packet() is just a fancy way to generate a C struct with some getters and setters.

You are right, the timestamper could set the size which would avoid problems when someone sets the wrong size here (note: my script does that in fillPacket())

@tristan-ovh
Copy link
Author

I am trying to find the differences between your script and mine, as mine loses all timestamp packets. I see that you declare 3 TX and 3 RX queues for both devices. Are there reasons why you need more than 2 TX for load and timestamp sending, and 1 RX for timestamp reception?

@emmericp
Copy link
Owner

One of the queues is for ARP rx/tx

@tristan-ovh
Copy link
Author

The example l3-load-latency.lua does not report lost timestamps (the number of times measureLatency returns nil). When adding the reporting, I see that it is not 0 (about 5%), although the count of sent and received packets is exactly identical (same problem as with my script).

@emmericp
Copy link
Owner

Okay, that means that timestamping fails for some reason.
This means that the packet was received successfully but it wasn't timestamped by the NIC for whatever reason.

I guess 5% loss of timestamping information at full line rate is okay, since latency measurements at full linerate are usually pointless anyways (see my previous comment).
I would be really worried if you were actually losing packets in a cable ;)

I'll keep this issue open and I'll have a look at the timestamping logic which determines whether the timestamping was sucessful or not using sequence numbers.

@tristan-ovh
Copy link
Author

Indeed it is very acceptable.
My script works now. the problem was the size of the packets.

But I do not understand how the filtering works in your example. I believe that packets matching a filter will be sent to the chosen queue, but other packets may also be sent to that queue.
So for my script, I had to add a rule with a lower priority that sends all packets to another queue.

In your example, you have no such rule, which should mean that you receive all packets on this core. Moreover, the filter you use (filterTimestamps) seems to match only PTP packets, not PTP/UDP packets, so it should have no effect in your case.

Am I misunderstanding something?

@emmericp
Copy link
Owner

All packets go to queue 0 by default. Only filters and RSS can redirect packets.

RSS is disabled by default, it has to enabled explicitly when configuring the device. And you would probably use a different set of queues for RSS.

Regarding the timestamp filter: this can actually be improved, yes. It currently just checks the PTP version at a specific offset in IP packets (mask.only_ip_flow) and it actually ignores the L4 protocol.
This should be changed, also checking if it is UDP is quite important for use cases like your TCP example in which this would match on some specific sequence numbers.

@emmericp
Copy link
Owner

Commit 7e68758 changes the timestamp filter to check the L4 protocol.

@emmericp
Copy link
Owner

I cannot reproduce the packet loss at full line rate.
I added a check for failed timestamps locally and I get all timestamps at full line rate.

What NIC are you using? I tested this with an Intel X540.

@tristan-ovh
Copy link
Author

I use an Intel 82599ES. I made the test again and still have lost timestamps.
But it isn't really a problem for me.

@emmericp
Copy link
Owner

I cannot reproduce the packet loss at full line rate.
I added a check for failed timestamps locally and I get all timestamps at full line rate.

But I'm using an Intel X540 which is basically the same NIC (datasheets are almost identical, same driver) just with 10GBase-T instead of SFP+ and a lot of bug fixes.

I would not be surprised if this is just a hardware problem in the 82599 NIC. I've seen some strange problems with that NIC that just don't happen on X540 NICs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants