Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traceflow fails with "bundle reply is timeout" #937

Closed
abhiraut opened this issue Jul 10, 2020 · 4 comments · Fixed by #951
Closed

Traceflow fails with "bundle reply is timeout" #937

abhiraut opened this issue Jul 10, 2020 · 4 comments · Fixed by #951
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@abhiraut
Copy link
Contributor

Describe the bug
Start a trace between two pods using traceflow CRD. Status fails with the following error

status:
  dataplaneTag: 1
  phase: Failed
  reason: 'Node: tf1, error: bundle reply is timeout'

To Reproduce
Following yaml used for Traceflow

apiVersion: ops.antrea.tanzu.vmware.com/v1alpha1
kind: Traceflow
metadata:
    name: tf1
spec:
        source:
                pod: appdns
                namespace: default
        destination:
                pod: appserver
                namespace: default
        packet:
                transportHeader:
                        tcp:
                                dstPort: 80
                ipHeader:
                        protocol: 6

Expected
Expected trace to succeed

Actual behavior
Trace failed with "bundle reply is timeout"

Versions:
Please provide the following information:

  • Linux kernel version on the Kubernetes Nodes (uname -r).
    4.15.0-88-generic

  • If you chose to compile the Open vSwitch kernel module manually instead of using the kernel module built into the Linux kernel, which version of the OVS kernel module are you using? Include the output of modinfo openvswitch for the Kubernetes Nodes.

modinfo openvswitch
filename:       /lib/modules/4.15.0-88-generic/kernel/net/openvswitch/openvswitch.ko
alias:          net-pf-16-proto-16-family-ovs_meter
alias:          net-pf-16-proto-16-family-ovs_packet
alias:          net-pf-16-proto-16-family-ovs_flow
alias:          net-pf-16-proto-16-family-ovs_vport
alias:          net-pf-16-proto-16-family-ovs_datapath
license:        GPL
description:    Open vSwitch switching datapath
srcversion:     304614E578D29023BC545F1
depends:        nf_conntrack,nf_nat,libcrc32c,nf_nat_ipv6,nf_nat_ipv4,nf_defrag_ipv6,nsh
retpoline:      Y
intree:         Y
name:           openvswitch
vermagic:       4.15.0-88-generic SMP mod_unload 
signat:         PKCS#7
signer:         
sig_key:        
sig_hashalgo:   md4
@abhiraut abhiraut added the kind/bug Categorizes issue or PR as related to a bug. label Jul 10, 2020
@abhiraut
Copy link
Contributor Author

/cc @tnqn

@gran-vmv
Copy link
Contributor

@abhiraut Could you share environment info with @wenyingd and me? We have hit this issue before, but cannot analyze the root cause without access to OVS.

@gran-vmv
Copy link
Contributor

@abhiraut Wenying and I found the root cause.
I'll enhance the reliability on below code snippet in pkg/agent/openflow/packetin.go:

	wait.PollUntil(time.Second, func() (done bool, err error) {
		pktIn := <-ch
		for name, handler := range c.packetInHandlers {
			err = handler.HandlePacketIn(pktIn)
			if err != nil {
				klog.Errorf("PacketIn handler %s failed to process packet: %+v", name, err)
			}
		}
		return false, err
	}, stopCh)

You won't get this error if you have fixed the comment #918 (review)
But if you get this error again, please workaround this by changing from return false, err to return false, nil

@wenyingd
Copy link
Contributor

The root cause is, some error happend in PacketInHandler, which causes the thread jump out of the for-loop. There is a channel between the PacketInHandler and the ofnet, ofnet is blocking at sending new "PacketIn" message into the channel (no consumer is at the other side of the channel at that time). Hence, ofnet could not handle the next "inbound" message. But ofnet's "outbound" channel is working well, so we could continue to sending Bundle control message out to OVS. But ofnet can't receive the reply for Bundle control message, hence Antrea got the timeout error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants