Fix OVS "flow" replay for groups #2134

antoninbas · 2021-04-28T18:01:09Z

The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes #2127

The Group objects were not reset correctly when attempting to replay them, leading to confusing error log messages and invalid datapath state. We fix the implementation of Reset() for groups and we ensure that the method is called during replay. We also update the TestOVSFlowReplay e2e test to make sure it is more comprehensive: instead of just checking Pod-to-Pod connectivity after a replay, we ensure that the number of OVS flows / groups is the same before and after a restart / replay. We confirmed that the updated test fails when the patch is not applied. Fixes antrea-io#2127

codecov-commenter · 2021-04-28T19:11:12Z

Codecov Report

Merging #2134 (9b6e873) into main (7355d27) will decrease coverage by 0.00%.
The diff coverage is 25.00%.

@@            Coverage Diff             @@
##             main    #2134      +/-   ##
==========================================
- Coverage   61.22%   61.22%   -0.01%     
==========================================
  Files         269      269              
  Lines       20453    20457       +4     
==========================================
+ Hits        12523    12525       +2     
- Misses       6633     6636       +3     
+ Partials     1297     1296       -1

Flag	Coverage Δ
e2e-tests	`∅ <ø> (?)`
kind-e2e-tests	`52.06% <25.00%> (-0.01%)`	⬇️
unit-tests	`41.38% <0.00%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pkg/agent/openflow/client.go	`59.25% <0.00%> (-0.32%)`	⬇️
pkg/ovs/openflow/ofctrl_group.go	`45.94% <0.00%> (-2.63%)`	⬇️
pkg/agent/agent.go	`47.74% <100.00%> (ø)`
pkg/ovs/openflow/ofctrl_bridge.go	`50.00% <100.00%> (ø)`
pkg/controller/networkpolicy/status_controller.go	`85.16% <0.00%> (-1.30%)`	⬇️
...gent/controller/networkpolicy/status_controller.go	`75.34% <0.00%> (+2.73%)`	⬆️
pkg/controller/networkpolicy/tier.go	`52.50% <0.00%> (+5.00%)`	⬆️

tnqn

Thanks for the quick fix! LGTM, just have a question about the comment.

tnqn · 2021-04-29T02:13:04Z

pkg/ovs/openflow/ofctrl_group.go

 func (g *ofGroup) Reset() {
-	g.ofctrl.Switch = g.bridge.ofSwitch
+	// An error ("group already exists") is not possible here since the same
+	// group was created successfully before. If something is wrong and


Is the comment correct? It seems because the ofSwitch is a new instance.

I tried to clarify. I wanted to communicate 2 things 1) it is a new ofSwitch as you mention, but also 2) all the groups we are creating were created successfully before (previous ofSwitch instance) so their creation should succeed this time as well (no duplicate group IDs)

tnqn · 2021-04-29T02:28:14Z

I'm thinking whether there is an issue if only antrea-agent restarts. I know the code has taken care of cleaning up all stale flows, but not groups. If it reuses group IDs, will there be a trouble when installing them?

antoninbas · 2021-04-29T03:04:10Z

I'm thinking whether there is an issue if only antrea-agent restarts. I know the code has taken care of cleaning up all stale flows, but not groups. If it reuses group IDs, will there be a trouble when installing them?

When only the agent restarts (as opposed to OVS daemons), there is no cache replay. Instead AntreaProxy will trigger all needed groups to be re-created by calling this function:

https://github.com/vmware-tanzu/antrea/blob/7355d276eb957ddc65d863eead406357fb115a89/pkg/ovs/openflow/ofctrl_bridge.go#L151-L158

If the Group already exists, we will get the existing object with GetGroup, and then set the buckets correctly.
The code may not be ideal but it should work in that case. When we start using incremental bucket insertions / deletions, things may get more complicated here. The only issue I can see is if we have some stale groups left over (e.g. a Service was deleted while the Agent was down). However it seems relatively harmless in this case, since there will be no flows referencing that group (stale flows should be removed correctly by the DeleteStaleFlows goroutine).

Do you think I am missing something, or was your question about a different scenario?

tnqn · 2021-04-29T03:22:15Z

I'm thinking whether there is an issue if only antrea-agent restarts. I know the code has taken care of cleaning up all stale flows, but not groups. If it reuses group IDs, will there be a trouble when installing them?

When only the agent restarts (as opposed to OVS daemons), there is no cache replay. Instead AntreaProxy will trigger all needed groups to be re-created by calling this function:

https://github.com/vmware-tanzu/antrea/blob/7355d276eb957ddc65d863eead406357fb115a89/pkg/ovs/openflow/ofctrl_bridge.go#L151-L158

If the Group already exists, we will get the existing object with GetGroup, and then set the buckets correctly.
The code may not be ideal but it should work in that case. When we start using incremental bucket insertions / deletions, things may get more complicated here. The only issue I can see is if we have some stale groups left over (e.g. a Service was deleted while the Agent was down). However it seems relatively harmless in this case, since there will be no flows referencing that group (stale flows should be removed correctly by the DeleteStaleFlows goroutine).

Do you think I am missing something, or was your question about a different scenario?

I mean the same scenario, but both b.ofSwitch.NewGroup and b.ofSwitch.GetGroup just operates the in-memory map, which don't really create or get group from the switch. The map is always empty on restart, NewGroup will always be called and Group.isInstalled will be false. I see the code will call openflow13.OFPGC_ADD instead of openflow13.OFPGC_MODIFY in such case. So it seems that it will fail to install groups.

antoninbas · 2021-04-29T03:42:58Z

@tnqn thanks for clarifying, sorry I'm a bit tired :)
I couldn't reproduce the issue, so I looked at the code, and found this: https://github.com/wenyingd/ofnet/blob/14a78b27ef8762e45a0cfc858c4d07a4572a99d5/ofctrl/fgraphSwitch.go#L57-L62

It seems that ofnet takes care of deleting all groups during initialization, so an Antrea Agent restart will always clear all groups first. This may not be what we want to do for the long term, but in the short term it guarantees that there won't be any issue during reconciliation on restart. Let me add a comment somewhere in the Antrea code about this.

tnqn

LGTM

tnqn · 2021-04-29T13:47:37Z

/test-all

tnqn · 2021-04-29T13:48:02Z

/test-windows-e2e

antoninbas · 2021-04-29T18:37:32Z

/test-all

antoninbas · 2021-04-29T20:21:33Z

/test-e2e
test timeout, we need to increase the timeout value

The Group objects were not reset correctly when attempting to replay them, leading to confusing error log messages and invalid datapath state. We fix the implementation of Reset() for groups and we ensure that the method is called during replay. We also update the TestOVSFlowReplay e2e test to make sure it is more comprehensive: instead of just checking Pod-to-Pod connectivity after a replay, we ensure that the number of OVS flows / groups is the same before and after a restart / replay. We confirmed that the updated test fails when the patch is not applied. Fixes antrea-io#2127

The Group objects were not reset correctly when attempting to replay them, leading to confusing error log messages and invalid datapath state. We fix the implementation of Reset() for groups and we ensure that the method is called during replay. We also update the TestOVSFlowReplay e2e test to make sure it is more comprehensive: instead of just checking Pod-to-Pod connectivity after a replay, we ensure that the number of OVS flows / groups is the same before and after a restart / replay. We confirmed that the updated test fails when the patch is not applied. Fixes #2127

The Group objects were not reset correctly when attempting to replay them, leading to confusing error log messages and invalid datapath state. We fix the implementation of Reset() for groups and we ensure that the method is called during replay. We also update the TestOVSFlowReplay e2e test to make sure it is more comprehensive: instead of just checking Pod-to-Pod connectivity after a replay, we ensure that the number of OVS flows / groups is the same before and after a restart / replay. We confirmed that the updated test fails when the patch is not applied. Fixes antrea-io#2127

The Group objects were not reset correctly when attempting to replay them, leading to confusing error log messages and invalid datapath state. We fix the implementation of Reset() for groups and we ensure that the method is called during replay. We also update the TestOVSFlowReplay e2e test to make sure it is more comprehensive: instead of just checking Pod-to-Pod connectivity after a replay, we ensure that the number of OVS flows / groups is the same before and after a restart / replay. We confirmed that the updated test fails when the patch is not applied. Fixes #2127

The Group objects were not reset correctly when attempting to replay them, leading to confusing error log messages and invalid datapath state. We fix the implementation of Reset() for groups and we ensure that the method is called during replay. We also update the TestOVSFlowReplay e2e test to make sure it is more comprehensive: instead of just checking Pod-to-Pod connectivity after a replay, we ensure that the number of OVS flows / groups is the same before and after a restart / replay. We confirmed that the updated test fails when the patch is not applied. Fixes antrea-io#2127

The Group objects were not reset correctly when attempting to replay them, leading to confusing error log messages and invalid datapath state. We fix the implementation of Reset() for groups and we ensure that the method is called during replay. We also update the TestOVSFlowReplay e2e test to make sure it is more comprehensive: instead of just checking Pod-to-Pod connectivity after a replay, we ensure that the number of OVS flows / groups is the same before and after a restart / replay. We confirmed that the updated test fails when the patch is not applied. Fixes #2127

The Group objects were not reset correctly when attempting to replay them, leading to confusing error log messages and invalid datapath state. We fix the implementation of Reset() for groups and we ensure that the method is called during replay. We also update the TestOVSFlowReplay e2e test to make sure it is more comprehensive: instead of just checking Pod-to-Pod connectivity after a replay, we ensure that the number of OVS flows / groups is the same before and after a restart / replay. We confirmed that the updated test fails when the patch is not applied. Fixes antrea-io#2127

The Group objects were not reset correctly when attempting to replay them, leading to confusing error log messages and invalid datapath state. We fix the implementation of Reset() for groups and we ensure that the method is called during replay. We also update the TestOVSFlowReplay e2e test to make sure it is more comprehensive: instead of just checking Pod-to-Pod connectivity after a replay, we ensure that the number of OVS flows / groups is the same before and after a restart / replay. We confirmed that the updated test fails when the patch is not applied. Fixes #2127

vmwclabot added the cla-not-required label Apr 28, 2021

antoninbas requested review from hongliangl, wenyingd and tnqn April 28, 2021 18:01

tnqn previously approved these changes Apr 29, 2021

View reviewed changes

Improve comment

af9b06f

antoninbas dismissed tnqn’s stale review via af9b06f April 29, 2021 02:56

Add comment to clarify Antrea Agent restart scenario

9b6e873

tnqn approved these changes Apr 29, 2021

View reviewed changes

antoninbas merged commit ce8f41f into antrea-io:main Apr 29, 2021

antoninbas deleted the fix-ovs-flow-replay-for-groups branch April 29, 2021 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OVS "flow" replay for groups #2134

Fix OVS "flow" replay for groups #2134

antoninbas commented Apr 28, 2021

codecov-commenter commented Apr 28, 2021 •

edited

Loading

tnqn left a comment

tnqn Apr 29, 2021

antoninbas Apr 29, 2021

tnqn commented Apr 29, 2021 •

edited

Loading

antoninbas commented Apr 29, 2021

tnqn commented Apr 29, 2021

antoninbas commented Apr 29, 2021

tnqn left a comment

tnqn commented Apr 29, 2021

tnqn commented Apr 29, 2021

antoninbas commented Apr 29, 2021

antoninbas commented Apr 29, 2021

Fix OVS "flow" replay for groups #2134

Fix OVS "flow" replay for groups #2134

Conversation

antoninbas commented Apr 28, 2021

codecov-commenter commented Apr 28, 2021 • edited Loading

Codecov Report

tnqn left a comment

Choose a reason for hiding this comment

tnqn Apr 29, 2021

Choose a reason for hiding this comment

antoninbas Apr 29, 2021

Choose a reason for hiding this comment

tnqn commented Apr 29, 2021 • edited Loading

antoninbas commented Apr 29, 2021

tnqn commented Apr 29, 2021

antoninbas commented Apr 29, 2021

tnqn left a comment

Choose a reason for hiding this comment

tnqn commented Apr 29, 2021

tnqn commented Apr 29, 2021

antoninbas commented Apr 29, 2021

antoninbas commented Apr 29, 2021

codecov-commenter commented Apr 28, 2021 •

edited

Loading

tnqn commented Apr 29, 2021 •

edited

Loading