Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix OVS "flow" replay for groups #2134

Merged
merged 3 commits into from
Apr 29, 2021

Conversation

antoninbas
Copy link
Contributor

The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes #2127

The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes antrea-io#2127
@codecov-commenter
Copy link

codecov-commenter commented Apr 28, 2021

Codecov Report

Merging #2134 (9b6e873) into main (7355d27) will decrease coverage by 0.00%.
The diff coverage is 25.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2134      +/-   ##
==========================================
- Coverage   61.22%   61.22%   -0.01%     
==========================================
  Files         269      269              
  Lines       20453    20457       +4     
==========================================
+ Hits        12523    12525       +2     
- Misses       6633     6636       +3     
+ Partials     1297     1296       -1     
Flag Coverage Δ
e2e-tests ∅ <ø> (?)
kind-e2e-tests 52.06% <25.00%> (-0.01%) ⬇️
unit-tests 41.38% <0.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/agent/openflow/client.go 59.25% <0.00%> (-0.32%) ⬇️
pkg/ovs/openflow/ofctrl_group.go 45.94% <0.00%> (-2.63%) ⬇️
pkg/agent/agent.go 47.74% <100.00%> (ø)
pkg/ovs/openflow/ofctrl_bridge.go 50.00% <100.00%> (ø)
pkg/controller/networkpolicy/status_controller.go 85.16% <0.00%> (-1.30%) ⬇️
...gent/controller/networkpolicy/status_controller.go 75.34% <0.00%> (+2.73%) ⬆️
pkg/controller/networkpolicy/tier.go 52.50% <0.00%> (+5.00%) ⬆️

tnqn
tnqn previously approved these changes Apr 29, 2021
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix! LGTM, just have a question about the comment.

func (g *ofGroup) Reset() {
g.ofctrl.Switch = g.bridge.ofSwitch
// An error ("group already exists") is not possible here since the same
// group was created successfully before. If something is wrong and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the comment correct? It seems because the ofSwitch is a new instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to clarify. I wanted to communicate 2 things 1) it is a new ofSwitch as you mention, but also 2) all the groups we are creating were created successfully before (previous ofSwitch instance) so their creation should succeed this time as well (no duplicate group IDs)

@tnqn
Copy link
Member

tnqn commented Apr 29, 2021

I'm thinking whether there is an issue if only antrea-agent restarts. I know the code has taken care of cleaning up all stale flows, but not groups. If it reuses group IDs, will there be a trouble when installing them?

@antoninbas
Copy link
Contributor Author

I'm thinking whether there is an issue if only antrea-agent restarts. I know the code has taken care of cleaning up all stale flows, but not groups. If it reuses group IDs, will there be a trouble when installing them?

When only the agent restarts (as opposed to OVS daemons), there is no cache replay. Instead AntreaProxy will trigger all needed groups to be re-created by calling this function:

https://github.com/vmware-tanzu/antrea/blob/7355d276eb957ddc65d863eead406357fb115a89/pkg/ovs/openflow/ofctrl_bridge.go#L151-L158

If the Group already exists, we will get the existing object with GetGroup, and then set the buckets correctly.
The code may not be ideal but it should work in that case. When we start using incremental bucket insertions / deletions, things may get more complicated here. The only issue I can see is if we have some stale groups left over (e.g. a Service was deleted while the Agent was down). However it seems relatively harmless in this case, since there will be no flows referencing that group (stale flows should be removed correctly by the DeleteStaleFlows goroutine).

Do you think I am missing something, or was your question about a different scenario?

@tnqn
Copy link
Member

tnqn commented Apr 29, 2021

I'm thinking whether there is an issue if only antrea-agent restarts. I know the code has taken care of cleaning up all stale flows, but not groups. If it reuses group IDs, will there be a trouble when installing them?

When only the agent restarts (as opposed to OVS daemons), there is no cache replay. Instead AntreaProxy will trigger all needed groups to be re-created by calling this function:

https://github.com/vmware-tanzu/antrea/blob/7355d276eb957ddc65d863eead406357fb115a89/pkg/ovs/openflow/ofctrl_bridge.go#L151-L158

If the Group already exists, we will get the existing object with GetGroup, and then set the buckets correctly.
The code may not be ideal but it should work in that case. When we start using incremental bucket insertions / deletions, things may get more complicated here. The only issue I can see is if we have some stale groups left over (e.g. a Service was deleted while the Agent was down). However it seems relatively harmless in this case, since there will be no flows referencing that group (stale flows should be removed correctly by the DeleteStaleFlows goroutine).

Do you think I am missing something, or was your question about a different scenario?

I mean the same scenario, but both b.ofSwitch.NewGroup and b.ofSwitch.GetGroup just operates the in-memory map, which don't really create or get group from the switch. The map is always empty on restart, NewGroup will always be called and Group.isInstalled will be false. I see the code will call openflow13.OFPGC_ADD instead of openflow13.OFPGC_MODIFY in such case. So it seems that it will fail to install groups.

@antoninbas
Copy link
Contributor Author

@tnqn thanks for clarifying, sorry I'm a bit tired :)
I couldn't reproduce the issue, so I looked at the code, and found this: https://github.com/wenyingd/ofnet/blob/14a78b27ef8762e45a0cfc858c4d07a4572a99d5/ofctrl/fgraphSwitch.go#L57-L62

It seems that ofnet takes care of deleting all groups during initialization, so an Antrea Agent restart will always clear all groups first. This may not be what we want to do for the long term, but in the short term it guarantees that there won't be any issue during reconciliation on restart. Let me add a comment somewhere in the Antrea code about this.

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn
Copy link
Member

tnqn commented Apr 29, 2021

/test-all

@tnqn
Copy link
Member

tnqn commented Apr 29, 2021

/test-windows-e2e

@antoninbas
Copy link
Contributor Author

/test-all

@antoninbas
Copy link
Contributor Author

/test-e2e
test timeout, we need to increase the timeout value

@antoninbas antoninbas merged commit ce8f41f into antrea-io:main Apr 29, 2021
@antoninbas antoninbas deleted the fix-ovs-flow-replay-for-groups branch April 29, 2021 22:10
antoninbas added a commit to antoninbas/antrea that referenced this pull request Apr 29, 2021
The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes antrea-io#2127
antoninbas added a commit that referenced this pull request Apr 30, 2021
The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes #2127
antoninbas added a commit to antoninbas/antrea that referenced this pull request Apr 30, 2021
The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes antrea-io#2127
antoninbas added a commit to antoninbas/antrea that referenced this pull request Apr 30, 2021
The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes antrea-io#2127
antoninbas added a commit to antoninbas/antrea that referenced this pull request Apr 30, 2021
The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes antrea-io#2127
antoninbas added a commit that referenced this pull request Apr 30, 2021
The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes #2127
antoninbas added a commit to antoninbas/antrea that referenced this pull request Apr 30, 2021
The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes antrea-io#2127
antoninbas added a commit that referenced this pull request May 1, 2021
The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes #2127
antoninbas added a commit to antoninbas/antrea that referenced this pull request May 1, 2021
The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes antrea-io#2127
antoninbas added a commit to antoninbas/antrea that referenced this pull request May 1, 2021
The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes antrea-io#2127
antoninbas added a commit that referenced this pull request May 3, 2021
The Group objects were not reset correctly when attempting to replay
them, leading to confusing error log messages and invalid datapath
state. We fix the implementation of Reset() for groups and we ensure
that the method is called during replay.

We also update the TestOVSFlowReplay e2e test to make sure it is more
comprehensive: instead of just checking Pod-to-Pod connectivity after a
replay, we ensure that the number of OVS flows / groups is the same
before and after a restart / replay. We confirmed that the updated test
fails when the patch is not applied.

Fixes #2127
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AntreaProxy failed to create ovs flows and groups for K8s service
4 participants