Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix agent nil panic in pure IPv6 cluster #2655

Merged
merged 1 commit into from Sep 3, 2021
Merged

Conversation

wenqiq
Copy link
Contributor

@wenqiq wenqiq commented Aug 26, 2021

Fixed agent panic because of nil NodeIP in pure IPv6 cluster

The nodeConfig.NodeIPv4Addr is nil which would cause panic in agent,
when starting agent with Egress feature enabled in pure IPv6 cluster.
Related #2436

Add Egress IPv6 test cases in dual-stack or pure IPv6 cluster. Related #2196

Signed-off-by: Wenqi Qiu wenqiq@vmware.com

@wenqiq
Copy link
Contributor Author

wenqiq commented Aug 26, 2021

/test-e2e /test-ipv6-only-conformance /test-ipv6-only-e2e

@codecov-commenter
Copy link

codecov-commenter commented Aug 26, 2021

Codecov Report

Merging #2655 (b0a9295) into main (bfa1bc4) will decrease coverage by 0.17%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2655      +/-   ##
==========================================
- Coverage   60.66%   60.49%   -0.18%     
==========================================
  Files         285      285              
  Lines       23006    23017      +11     
==========================================
- Hits        13957    13923      -34     
- Misses       7550     7598      +48     
+ Partials     1499     1496       -3     
Flag Coverage Δ
kind-e2e-tests 48.20% <0.00%> (-0.17%) ⬇️
unit-tests 41.01% <100.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/agent/memberlist/cluster.go 71.63% <100.00%> (+0.07%) ⬆️
pkg/agent/controller/traceflow/packetin.go 59.49% <0.00%> (-5.49%) ⬇️
...agent/controller/traceflow/traceflow_controller.go 70.21% <0.00%> (-3.20%) ⬇️
pkg/controller/traceflow/controller.go 70.30% <0.00%> (-2.43%) ⬇️
pkg/agent/cniserver/ipam/ipam_delegator.go 48.83% <0.00%> (-2.33%) ⬇️
pkg/agent/wireguard/client_linux.go 65.17% <0.00%> (-2.17%) ⬇️
pkg/controller/egress/controller.go 84.12% <0.00%> (-1.02%) ⬇️
pkg/agent/openflow/pipeline.go 74.14% <0.00%> (-1.02%) ⬇️
...g/agent/cniserver/interface_configuration_linux.go 16.15% <0.00%> (-0.69%) ⬇️
pkg/agent/agent.go 50.91% <0.00%> (ø)
... and 4 more

@wenqiq
Copy link
Contributor Author

wenqiq commented Aug 27, 2021

/test-e2e

/test-ipv6-only-e2e

@wenqiq wenqiq force-pushed the ipv6-e2e branch 2 times, most recently from 1577d3a to e0a6988 Compare August 27, 2021 18:43
@wenqiq
Copy link
Contributor Author

wenqiq commented Aug 27, 2021

/test-ipv6-only-e2e

@wenqiq
Copy link
Contributor Author

wenqiq commented Aug 27, 2021

/test-ipv6-only-e2e

@wenqiq wenqiq force-pushed the ipv6-e2e branch 3 times, most recently from d56d20f to b4046c7 Compare August 30, 2021 08:40
@wenqiq
Copy link
Contributor Author

wenqiq commented Aug 30, 2021

/test-ipv6-only-e2e

@wenqiq wenqiq changed the title [WIP]Add Egress IPv6 test cases Add Egress IPv6 test cases Aug 30, 2021
@wenqiq wenqiq marked this pull request as ready for review August 30, 2021 09:59
@wenqiq
Copy link
Contributor Author

wenqiq commented Aug 30, 2021

/test-e2e
/test-ipv6-only-conformance
/test-ipv6-only-e2e

@wenqiq wenqiq changed the title Add Egress IPv6 test cases Fix agent nil panic in pure IPv6 cluster Aug 30, 2021
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this will need to be backported to release-1.3; @tnqn @wenqiq

cmd/antrea-agent/agent.go Outdated Show resolved Hide resolved
Comment on lines 310 to 311
// skipIfNotIPv6Cluster(t)
if clusterInfo.podV6NetworkCIDR == "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't that the same check as skipIfNotIPv6Cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch, will fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

for _, ipR := range ipRanges {
ipVsersion := fmt.Sprintf("-IPv%d", ipR.ipVersion)
expectedTotal := ipR.expectedTotalIpNum
t.Run(tt.name+ipVsersion, func(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't you going to have several tests with the same name here? it seems the name is the same for all IPRanges sharing the same IP family.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this will need to be backported to release-1.3; @tnqn @wenqiq

I think yes. Do you have any comments? @tnqn

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wenqiq You should reply comment that is not specific to a line in the comment of conversation page. Replying here is confusing and hard to track the original comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks for reminding.

return exists == false, nil
})
require.NoError(t, err, "Failed to check if IP exists on Node")
// assert.False(t, exists, "Found stale IP on Node")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove if not needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that I comment this line by mistake, will fix it.

Comment on lines 1347 to 1359
var builder strings.Builder
pods, errGet := data.clientset.CoreV1().Pods(antreaNamespace).List(context.TODO(), metav1.ListOptions{
LabelSelector: "app=antrea,component=antrea-agent",
})
if errGet != nil {
builder.WriteString(errGet.Error())
}
for _, pod := range pods.Items {
code, stdout, stderr, errCmd := provider.RunCommandOnNode(controlPlaneNodeName(), fmt.Sprintf("kubectl -n %s logs %s antrea-agent", antreaNamespace, pod.Name))
builder.WriteString(fmt.Sprintf("RunCommandOnNode, code: %d, stdout: %s, stderr: %s, error: %v", code, stdout, stderr, errCmd))
}
return fmt.Errorf("restartAntreaAgentPods error: %v, logs: %s", err, builder.String())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems unrelated to the rest of the PR and it's unclear what the purpose is since there is no comment
if there is an issue with the e2e test framework, it should be addressed in a separate PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the agent starting failed it just describe the agent pods, I think we need to print agent starting log and makes it easier to debug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @antoninbas. And we already collect the component logs in specific directory if the test fails. I'm not sure if it will collect logs when deploying fails. But even if it doesn't, it's easy to change its behavior to do it. Dumpping the whole log in the error may mess up the test output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't have the same test environment. It`s hard to find the panic cause, because the describe pod logs in Jenkins CI output just say 'Back off....'.

          Normal   Created    116s (x4 over 2m53s)  kubelet            Created container antrea-agent
          Normal   Started    116s (x4 over 2m53s)  kubelet            Started container antrea-agent
          Warning  BackOff    112s (x7 over 2m48s)  kubelet            Back-off restarting failed container

} else if nodeConfig.NodeIPv6Addr != nil {
nodeIP = nodeConfig.NodeIPv6Addr.IP
} else {
return fmt.Errorf("NodeIPAddr in Node config invalid: %v", nodeConfig)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeIPAddr in Node config is invalid or invalid NodeIPAddr in Node config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@wenqiq wenqiq force-pushed the ipv6-e2e branch 3 times, most recently from de06b87 to 8d9c26b Compare August 31, 2021 04:09
}
if egress.Spec.EgressIP != tt.expectedEgressIP {
return false, nil
for _, ipR := range ipRanges {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nested subtests don't seem necessary and hard to understand what the test focus on and expect.
Can you just add subtests for several IPv6 cases here? e.g. "single matching Node with IPv6 range". I don't think we need to cover all scenario with all address families with all possible ip range type. Just typical is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

require.NoError(t, err, "Failed to check if IP exists on Node")
assert.False(t, exists, "Found stale IP on Node")
})
for _, ipR := range ipRanges {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 1347 to 1359
var builder strings.Builder
pods, errGet := data.clientset.CoreV1().Pods(antreaNamespace).List(context.TODO(), metav1.ListOptions{
LabelSelector: "app=antrea,component=antrea-agent",
})
if errGet != nil {
builder.WriteString(errGet.Error())
}
for _, pod := range pods.Items {
code, stdout, stderr, errCmd := provider.RunCommandOnNode(controlPlaneNodeName(), fmt.Sprintf("kubectl -n %s logs %s antrea-agent", antreaNamespace, pod.Name))
builder.WriteString(fmt.Sprintf("RunCommandOnNode, code: %d, stdout: %s, stderr: %s, error: %v", code, stdout, stderr, errCmd))
}
return fmt.Errorf("restartAntreaAgentPods error: %v, logs: %s", err, builder.String())
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @antoninbas. And we already collect the component logs in specific directory if the test fails. I'm not sure if it will collect logs when deploying fails. But even if it doesn't, it's easy to change its behavior to do it. Dumpping the whole log in the error may mess up the test output.

@wenqiq
Copy link
Contributor Author

wenqiq commented Sep 1, 2021

I would suggest the following:

Fix agent panic because of nil NodeIP in pure IPv6 cluster

The nodeConfig.NodeIPv4Addr is nil which would cause panic in agent,
when starting agent with Egress feature enabled in pure IPv6 cluster.

It also adds Egress IPv6 test cases in dual-stack and pure IPv6 cluster. 

Related #2196 

Done

/test-all
/test-e2e
/test-ipv6-only-conformance
/test-ipv6-only-e2e
/test-ipv6-e2e

tnqn
tnqn previously approved these changes Sep 1, 2021
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wenqiq
Copy link
Contributor Author

wenqiq commented Sep 1, 2021

/test-e2e

@wenqiq
Copy link
Contributor Author

wenqiq commented Sep 2, 2021

@tnqn @antoninbas It seems all the test checks related have finished successfully. Do you have any more comments about this PR?

@antoninbas
Copy link
Contributor

@wenqiq I see failure for dual-stack e2e tests. Is that expected?

--- FAIL: TestEgress (120.95s)
    --- FAIL: TestEgress/testEgressClientIP (46.04s)
        --- FAIL: TestEgress/testEgressClientIP/ipv4-cluster (21.02s)
        --- PASS: TestEgress/testEgressClientIP/ipv6-cluster (25.02s)
    --- FAIL: TestEgress/testEgressCRUD (4.56s)
        --- FAIL: TestEgress/testEgressCRUD/single_matching_Node (0.75s)
        --- FAIL: TestEgress/testEgressCRUD/single_matching_Node_with_IPv6_range (0.75s)
        --- FAIL: TestEgress/testEgressCRUD/two_matching_Nodes (1.45s)
        --- PASS: TestEgress/testEgressCRUD/no_matching_Node (1.60s)
    --- PASS: TestEgress/testEgressUpdateEgressIP (5.40s)
        --- PASS: TestEgress/testEgressUpdateEgressIP/same_Node (1.80s)
        --- PASS: TestEgress/testEgressUpdateEgressIP/different_Nodes (1.80s)
        --- PASS: TestEgress/testEgressUpdateEgressIP/different_Nodes_in_IPv6_cluster (1.80s)
    --- PASS: TestEgress/testEgressUpdateNodeSelector (5.20s)
        --- PASS: TestEgress/testEgressUpdateNodeSelector/IPv4_cluster (2.60s)
        --- PASS: TestEgress/testEgressUpdateNodeSelector/IPv6_cluster (2.60s)
    --- FAIL: TestEgress/testEgressNodeFailure (3.89s)
        --- FAIL: TestEgress/testEgressNodeFailure/IPv4_cluster (1.40s)
        --- PASS: TestEgress/testEgressNodeFailure/IPv6_cluster (2.49s)

@wenqiq
Copy link
Contributor Author

wenqiq commented Sep 2, 2021

/test-ipv6-e2e
Retrigger. It seems dual-stack e2e tests failure related to test machine. I see some error logs as following:

=== RUN   TestEgress/testEgressNodeFailure/IPv4_cluster
    egress_test.go:554: Error when running command 'pkill -STOP antrea-agent' on Node 'antrea-ipv6-5-1', rc: 0, stdout: , stderr: , error: unable to find 'HostName' for 'antrea-ipv6-5-1' in SSH config
    egress_test.go:554: Error when running command 'pkill -CONT antrea-agent' on Node 'antrea-ipv6-5-1', rc: 0, stdout: , stderr: , error: unable to find 'HostName' for 'antrea-ipv6-5-1' in SSH config
    egress_test.go:554: Error when running command 'pkill -CONT antrea-agent' on Node 'antrea-ipv6-5-1', rc: 0, stdout: , stderr: , error: unable to find 'HostName' for 'antrea-ipv6-5-1' in SSH config
=== RUN   TestEgress/testEgressNodeFailure/IPv6_cluster

@wenqiq
Copy link
Contributor Author

wenqiq commented Sep 2, 2021

/test-ipv6-e2e

@wenqiq
Copy link
Contributor Author

wenqiq commented Sep 2, 2021

/test-all
/test-e2e
/test-ipv6-only-conformance
/test-ipv6-only-e2e
/test-ipv6-e2e

@wenqiq
Copy link
Contributor Author

wenqiq commented Sep 2, 2021

Dual-stack e2e tests failed.
/test-ipv6-e2e

=== RUN   TestEgress/testEgressNodeFailure/IPv4_cluster
    egress_test.go:575: Error when running command 'pkill -STOP antrea-agent' on Node 'antrea-ipv6-9-1', rc: 0, stdout: , stderr: , error: unable to find 'HostName' for 'antrea-ipv6-9-1' in SSH config
    egress_test.go:575: Error when running command 'pkill -CONT antrea-agent' on Node 'antrea-ipv6-9-1', rc: 0, stdout: , stderr: , error: unable to find 'HostName' for 'antrea-ipv6-9-1' in SSH config
    egress_test.go:575: Error when running command 'pkill -CONT antrea-agent' on Node 'antrea-ipv6-9-1', rc: 0, stdout: , stderr: , error: unable to find 'HostName' for 'antrea-ipv6-9-1' in SSH config
=== RUN   TestEgress/testEgressNodeFailure/IPv6_cluster
    egress_test.go:575: Error when running command 'pkill -STOP antrea-agent' on Node 'antrea-ipv6-9-1', rc: 0, stdout: , stderr: , error: unable to find 'HostName' for 'antrea-ipv6-9-1' in SSH config
    egress_test.go:575: Error when running command 'pkill -CONT antrea-agent' on Node 'antrea-ipv6-9-1', rc: 0, stdout: , stderr: , error: unable to find 'HostName' for 'antrea-ipv6-9-1' in SSH config
    egress_test.go:575: Error when running command 'pkill -CONT antrea-agent' on Node 'antrea-ipv6-9-1', rc: 0, stdout: , stderr: , error: unable to find 'HostName' for 'antrea-ipv6-9-1' in SSH config
=== CONT  TestEgress
    fixtures.go:257: Exporting test logs to '/var/lib/jenkins/workspace/antrea-ipv6-ds-e2e-for-pull-request/antrea-test-logs/TestEgress/beforeTeardown.Sep02-08-59-31'
    fixtures.go:363: Error when exporting kubelet logs: error when running journalctl on Node 'antrea-ipv6-9-0', is it available? Error: <nil>
    fixtures.go:384: Deleting 'antrea-test' K8s Namespace

@wenqiq
Copy link
Contributor Author

wenqiq commented Sep 2, 2021

/test-all
/test-e2e
/test-ipv6-only-conformance
/test-ipv6-only-e2e

@wenqiq
Copy link
Contributor Author

wenqiq commented Sep 2, 2021

/test-ipv6-only-e2e
ipv6 only e2e test failed:

=== RUN   TestEgress/testEgressNodeFailure/IPv6_cluster
    egress_test.go:575: Error when running command 'pkill -STOP antrea-agent' on Node 'antrea-ipv6-8-1', rc: 0, stdout: , stderr: , error: unable to find 'HostName' for 'antrea-ipv6-8-1' in SSH config
    egress_test.go:575: Error when running command 'pkill -CONT antrea-agent' on Node 'antrea-ipv6-8-1', rc: 0, stdout: , stderr: , error: unable to find 'HostName' for 'antrea-ipv6-8-1' in SSH config
    egress_test.go:575: Error when running command 'pkill -CONT antrea-agent' on Node 'antrea-ipv6-8-1', rc: 0, stdout: , stderr: , error: unable to find 'HostName' for 'antrea-ipv6-8-1' in SSH config

@wenqiq
Copy link
Contributor Author

wenqiq commented Sep 2, 2021

/test-ipv6-conformance
/test-ipv6-networkpolicy
/test-ipv6-only-networkpolicy

@lzhecheng
Copy link
Contributor

@wenqiq the failure about unable to find Hostname results from ssh-config not properly configured. In this PR, you are using runCommandFromPod to run command from a node other than master/control plane, right? Before, there's no such case in IPv6-only and dual-stack tests so there's no error. @xliuxu has a PR to support this in this PR: #2675

@tnqn Branch 1.3 seems waiting for this PR so I suggest merging Xu's PR now, ignoring the comment to add a comment in script.

@tnqn
Copy link
Member

tnqn commented Sep 2, 2021

@wenqiq DCO is missing in the second patch

@wenqiq
Copy link
Contributor Author

wenqiq commented Sep 2, 2021

Thanks @lzhecheng @tnqn @xliuxu , It seems I should rebase and merge my two commit after #2675 merged.

@wenqiq
Copy link
Contributor Author

wenqiq commented Sep 2, 2021

/test-ipv6-e2e

@tnqn
Copy link
Member

tnqn commented Sep 2, 2021

#2675 has been merged, you could rebase now.

The nodeConfig.NodeIPv4Addr is nil which would cause panic in agent,
when starting agent with Egress feature enabled in pure IPv6 cluster.

It also adds Egress IPv6 test cases in dual-stack and pure IPv6 cluster.

Related antrea-io#2196

Signed-off-by: Wenqi Qiu <wenqiq@vmware.com>
@@ -552,7 +717,7 @@ func (data *TestData) createEgress(t *testing.T, generateName string, matchExpre
}

func (data *TestData) waitForEgressRealized(egress *v1alpha2.Egress) (*v1alpha2.Egress, error) {
err := wait.PollImmediate(200*time.Millisecond, 3*time.Second, func() (done bool, err error) {
err := wait.PollImmediate(200*time.Millisecond, 5*time.Second, func() (done bool, err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So 3 seconds is enough in some cases? I feel it may indicate there is some problem but I don't think it's related to this PR so I'm fine with the change. We'd better to look at the logs to understand why the delay is so long.

Copy link
Contributor Author

@wenqiq wenqiq Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 seconds is enough in most cases. I have tested hundreds of times in my local env (dual-stack cluster) and nerver failed.It failed once in the jenkins CI workflow:

=== RUN   TestEgress/testEgressNodeFailure/IPv6_cluster
    egress_test.go:611: 
        	Error Trace:	egress_test.go:611
        	            				egress_test.go:589
        	Error:      	Received unexpected error:
        	            	timed out waiting for the condition
        	Test:       	TestEgress/testEgressNodeFailure/IPv6_cluster

So I changed it to 5 seconds.

@wenqiq
Copy link
Contributor Author

wenqiq commented Sep 2, 2021

/test-all
/test-e2e
/test-ipv6-only-conformance
/test-ipv6-only-e2e
/test-ipv6-e2e

/test-ipv6-conformance
/test-ipv6-networkpolicy
/test-ipv6-only-networkpolicy

@tnqn
Copy link
Member

tnqn commented Sep 3, 2021

The only failure in "jenkins-ipv6-only-e2e" is not related to this PR and is going to be fixed by #2712:

=== RUN   TestWireGuard/testServiceConnectivity
    wireguard_test.go:149: 
        	Error Trace:	wireguard_test.go:149
        	            				wireguard_test.go:73
        	Error:      	Received unexpected error:
        	            	nc stdout: <>, stderr: <>, err: <command terminated with exit code 1>
        	Test:       	TestWireGuard/testServiceConnectivity
        	Messages:   	Pod hostnetwork-pod should be able to connect the service's NodePort 127.0.0.1:%!s(int32=31776), but was not able to connect
=== CONT  TestWireGuard
    fixtures.go:257: Exporting test logs to '/var/lib/jenkins/workspace/antrea-ipv6-only-e2e-for-pull-request/antrea-test-logs/TestWireGuard/beforeTeardown.Sep02-18-16-05'
    fixtures.go:363: Error when exporting kubelet logs: error when running journalctl on Node 'antrea-ipv6-8-0', is it available? Error: <nil>
    fixtures.go:384: Deleting 'antrea-test' K8s Namespace
--- FAIL: TestWireGuard (102.19s)
    --- PASS: TestWireGuard/testPodConnectivity (16.26s)
    --- FAIL: TestWireGuard/testServiceConnectivity (27.28s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants