New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support TCP transport for elk flow collector #2387
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding TCP support as an alternative to UDP.
docs/network-flow-visibility.md
Outdated
| @@ -264,7 +264,7 @@ then please use the address: | |||
| `<Ipfix-Collector Cluster IP>:<port>:<TCP|UDP>` | |||
| * If you have deployed the [ELK | |||
| flow collector](#deployment-steps-1), then please use the address: | |||
| `<Logstash Cluster IP>:4739:UDP` | |||
| `<Logstash Cluster IP>:4739:UDP` or `<Logstash Cluster IP>:4738:TCP` | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why a different port for TCP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tired using same port for TCP and UDP and found there will be decoding issues in logstash, which may need further investigation. So I separated it to two different ports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaned my local setting and had another try. It is working now with one port.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Maybe we can support template refresh in go-ipfix for TCP exporting process too just for the sake of ELK IPFIX flow collector like UDP exporting process, even though it is not IANA standard. We can have an issue in go-ipfix repo for this purpose.
|
@zyiou do we need to run any Jenkins tests for this? |
I don't think we need since the changes are on manifests yaml. |
Previously we do not support tcp transport for elk flow collector because of template expiration is set by default for tcp transport, which does not align with IANA standard. Now we set expiration for templates from tcp to 365 days as workaround to provide support for tcp transport. Signed-off-by: zyiou <zyiou@vmware.com>
|
@antoninbas can you approve and merge this PR? Thanks! |
|
/skip-all |
|
I nominate this as a candidate for the v1.2 patch release. This patch is required in setups where the UDP transport is broken because of unconventionally large metadata. |
Yes this can be backported to release-1.2, but what do you mean by "unconventionally large metadata"? |
If there are many labels for the pods (more than two), then we are going over the minimum MTU of 512 bytes in the case of UDP. |
|
I think you mean the maximum MTU of 512B when the actual path MTU is not known. I can see the following in RFC7011:
Do you interpret this as "the exporting process should not rely on IP fragmentation by the OS / network"? If yes, then I am not sure why this is the case. This seems like a hard restriction, especially when the message size is not fixed (variable-length IEs, which we have for Antrea). |
Yes, I meant maximum. My bad. When the pathMTU is not known, we are setting it as default.
Yes, from the RFC MTU excerpt, I interpreted that IP fragmentation is not recommended for UDP transport in IPFIX protocol. This may have to do nothing with IPFIX, but with UDP and IP fragmentation, we may have reorder issues and packet loss. One straightforward thing is to ignore the flow records that are in malformed packets after reassembly. Fo in-cluster traffic, relying on flow aggregator Pod's interface MTU will probably be ok. We are doing the same for Flow Exporter to Flow Aggregator IPFIX channel.
Yes, we were thinking of managing this with errors in the code and documentation on when to move to TCP transport. Do you have suggestions in mind? |
|
Do you think it make sense to keep supporting UDP? Maybe we should just remove UDP support altogether and focus on TCP (and possibly SCTP) testing. |
Before making this decision, I think it is better to test out Antrea with the ELK flow collector (third-party) with a bigger MTU UDP packet say 9000 on ~1500 interface. In this scenario, we can see how Logstash is receiving the flow records; if there is any packet loss leading to missing records at the collector. What do you think of this experiment? |
I'm not sure I understand. "a bigger MTU UDP packet say 9000 on ~1500 interface" would mean that you are fragmenting the packet and it seems this is strongly discouraged by the RFC. If the following are true:
If we want to fragment the UDP packets, then we just need to remove the check in the ipfix code that drops large UDP packets. In any case, we should probably default to TCP for transport. |
|
Sorry for the late response. I planned to do some experiments and hence the delay.
Yes, I wanted to go against IPIFX RFC's recommendation of path MTU and see what happens with the ELK flow collector that is there inside the cluster. I tried to send large IPFIX packets by bloating up the Pod Labels in the workload traffic from the Flow Aggregator with 1450 MTU and saw how the ELK flow collector responds. I see that flow records are received properly at the ELK collector even after the fragmentation and reassembly--there is no loss of records over the time frame of 30mins for 15 workload flows. Maybe my interpretation of IPFIX RFC recommendation of using path MTU is not entirely correct. Do you have an alternative interpretation? [EDIT] In addition, going forward we want to make sure that the number of fields is restricted in IPFIX packet to make sure the size in acceptable limits. This is will help in the processing of records at the collector as well. |
Previously we do not support tcp transport for elk flow collector because template expiration is set by default for tcp transport, which does not align with IANA standard.
Now we set expiration for templates from tcp to 365 days as workaround to provide support for tcp transport.