Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove unexpected AltName after rename interface #6321

Merged
merged 1 commit into from
Jun 5, 2024

Conversation

gran-vmv
Copy link
Contributor

Fix #6301

@gran-vmv gran-vmv self-assigned this May 11, 2024
@gran-vmv
Copy link
Contributor Author

/test-e2e

@gran-vmv gran-vmv added kind/bug Categorizes issue or PR as related to a bug. area/interface Issues or PRs related to network interfaces area/OS/linux Issues or PRs related to the Linux operating system. labels May 13, 2024
pkg/agent/util/net_linux.go Show resolved Hide resolved
@@ -378,6 +384,17 @@ func renameHostInterface(oriName string, newName string) error {
return nil
}

func removeInterfaceAltName(name string, altName string) error {
link, err := netlinkUtil.LinkByName(name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to add check first, if altname exists, and equals to the original name, we shall delete it, otherwise, ignore it.

Copy link
Contributor

@wenyingd wenyingd May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that time delays exist between adding the altname and our deletion on the target altname, e.g., the sequence is like this,

  1. rename uplink
  2. try to delete altname (introduced in this change)
  3. system adds the altname.

If so, the change may not work as expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that time delays exist between adding the altname and our deletion on the target altname, e.g., the sequence is like this,

  1. rename uplink
  2. try to delete altname (introduced in this change)
  3. system adds the altname.

If so, the change may not work as expected.

AltName will be set very quickly.
If this process get delayed, antrea-agent will create internal port and occupy this name first, and this is not an issue.

@gran-vmv gran-vmv force-pushed the ipam-renamefix branch 4 times, most recently from 303b114 to 58dbcc3 Compare May 13, 2024 06:18
pkg/agent/util/net_linux.go Outdated Show resolved Hide resolved
pkg/agent/util/net_linux.go Outdated Show resolved Hide resolved
@luolanzone luolanzone added the action/release-note Indicates a PR that should be included in release notes. label May 13, 2024
@gran-vmv
Copy link
Contributor Author

@tnqn @antoninbas

In this PR we bumped up netlink to latest dev version to support AltName operation. However, we found the default Route behavior is changed from the integration test failure, and we cannot guarantee that there is no other change which will break current logic.

Do you have any suggestion on this situation?

@antoninbas
Copy link
Contributor

antoninbas commented May 14, 2024

In this PR we bumped up netlink to latest dev version to support AltName operation. However, we found the default Route behavior is changed from the integration test failure

What is the behavior change?
Is it a matter of simply updating the tests?

we cannot guarantee that there is no other change which will break current logic.

That's true of a lot of dependency version updates.
It's unfortunate that the netlink library doesn't have a proper release process with release notes that we could refer to, and that the latest release tag is a "beta" tag from 2 years ago.
But we should still be able to update the library if we need to. Ideally, we would review the change and check if anything is likely to impact us. That's also why we have tests, to make sure that we don't break anything on our side when we bump up the dependency. In the case of the integration test failure, we need to investigate if it's a matter of updating the test or something deeper. Did an API change?

If we cannot update to a version of the netlink library that includes support for AltName while not introducing bugs that impact us, we will fork the repository and do our own release. (In that case, we should also open issues upstream as appropriate.)

@gran-vmv
Copy link
Contributor Author

In this PR we bumped up netlink to latest dev version to support AltName operation. However, we found the default Route behavior is changed from the integration test failure

What is the behavior change? Is it a matter of simply updating the tests?

we cannot guarantee that there is no other change which will break current logic.

That's true of a lot of dependency version updates. It's unfortunate that the netlink library doesn't have a proper release process with release notes that we could refer to, and that the latest release tag is a "beta" tag from 2 years ago. But we should still be able to update the library if we need to. Ideally, we would review the change and check if anything is likely to impact us. That's also why we have tests, to make sure that we don't break anything on our side when we bump up the dependency. In the case of the integration test failure, we need to investigate if it's a matter of updating the test or something deeper. Did an API change?

If we cannot update to a version of the netlink library that includes support for AltName while not introducing bugs that impact us, we will fork the repository and do our own release. (In that case, we should also open issues upstream as appropriate.)

Previously the dafault route has nil Dst, but it is filled in latest dev version by below code in route_linux.go:

	// Same logic to generate "default" dst with iproute2 implementation
	if route.Dst == nil {
		var addLen int
		var ip net.IP
		switch msg.Family {
		case FAMILY_V4:
			addLen = net.IPv4len
			ip = net.IPv4zero
		case FAMILY_V6:
			addLen = net.IPv6len
			ip = net.IPv6zero
		}

		if addLen != 0 {
			route.Dst = &net.IPNet{
				IP:   ip,
				Mask: net.CIDRMask(int(msg.Dst_len), 8*addLen),
			}
		}
	}

Antrea uses Route.Dst == nil to check if this route is default route, and we need to fix the code and inteegration test.

go.mod Outdated
@@ -236,3 +236,5 @@ require (
sigs.k8s.io/kustomize/kyaml v0.14.3-0.20230601165947-6ce0bf390ce3 // indirect
sigs.k8s.io/structured-merge-diff/v4 v4.4.1 // indirect
)

replace github.com/vishvananda/netlink => github.com/vishvananda/netlink v0.0.0-20240425164735-856e190dd707
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why doesn't it set version in require directly?

@hongliangl have you figured out whether the new version really affects the e2e test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, still testing it on my forked Antrea repo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to change the version in require directly, but it is changed back to v1.2.1-beta.2 after Goland resynced the dependency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hongliang's PR is merged. No need to bump up netlink now.

// which is also occupied by an interface name, after the interface name changed from generated value to another one,
// the interface altname will be set to previous name immediately.
// This altname must be removed to avoid unexpected conflict.
removeInterfaceAltName(to, from)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

miss error handling

@tnqn
Copy link
Member

tnqn commented May 15, 2024

Previously the dafault route has nil Dst, but it is filled in latest dev version by below code in route_linux.go:

If this is the only known case, we should adapt the code to it. But I remember @hongliangl mentioned other test failures that may be related to it.

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had started a review yesterday, but didn't submit it...

go.mod Outdated
@@ -236,3 +236,5 @@ require (
sigs.k8s.io/kustomize/kyaml v0.14.3-0.20230601165947-6ce0bf390ce3 // indirect
sigs.k8s.io/structured-merge-diff/v4 v4.4.1 // indirect
)

replace github.com/vishvananda/netlink => github.com/vishvananda/netlink v0.0.0-20240425164735-856e190dd707
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need a replace directive. Can't you just use the desired "version" in the require block directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to change the version in require directly, but it is changed back to v1.2.1-beta.2 after Goland resynced the dependency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hongliang's PR is merged. No need to bump up netlink now.

@@ -290,6 +290,12 @@ func RenameInterface(from, to string) error {
if pollErr != nil {
return fmt.Errorf("failed to rename host interface name %s to %s", from, to)
}
// Fix for the issue https://github.com/antrea-io/antrea/issues/6301.
// In some new Linux versions which support AltName, if there is only one valid value in AlternativeNamesPolicy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by "one valid value" here? In the original issue, I see that AlternativeNamesPolicy=database onboard slot path. It looks like 4 valid values to me based on https://access.redhat.com/solutions/6964829

Copy link
Contributor Author

@gran-vmv gran-vmv May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although we can set many values here, it may not be valid for an specific interface.

For example, here is the default value in ubuntu 23.10

[Link]
NamePolicy=keep kernel database onboard slot path
AlternativeNamesPolicy=database onboard slot path

If slot and path is valid for this interface (it depends on the hardware information), we'll get below name/altname for a new interface. Name is slot and altname is path:

2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:50:56:a6:68:79 brd ff:ff:ff:ff:ff:ff
    altname enp11s0

If only path is valid, we'll get below name for a new interface. Name is path and no altname:

3: enp11s1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:50:56:a6:68:7A brd ff:ff:ff:ff:ff:ff

If we change the name of above interface, altname will be set to path immediately. And antrea will failed to create internal OVS port for this case.

3: enp11s1~: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:50:56:a6:68:7A brd ff:ff:ff:ff:ff:ff
    altname enp11s1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe adjust it a little:

In some new Linux versions which support AltName, if the only valid altname of the interface is the same as the interface name, it would be left empty when the name is occupied by the interface name; after we rename the interface name to another value, the altname of the interface would be set to the original interface name by the system. This altname must be removed as we need to reserve the name for an OVS internal port.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@antoninbas antoninbas added the action/backport Indicates a PR that requires backports. label May 15, 2024
@hongliangl
Copy link
Contributor

Previously the dafault route has nil Dst, but it is filled in latest dev version by below code in route_linux.go:

If this is the only known case, we should adapt the code to it. But I remember @hongliangl mentioned other test failures that may be related to it.

Yes, I always got the same failure from the following test after bumping the lib netlink.

 --- FAIL: TestAntreaPolicy/TestGroupValidationWebhook/Case=CreateInvalidTier (18.07s)

@gran-vmv
Copy link
Contributor Author

Previously the dafault route has nil Dst, but it is filled in latest dev version by below code in route_linux.go:

If this is the only known case, we should adapt the code to it. But I remember @hongliangl mentioned other test failures that may be related to it.

Yes, I always got the same failure from the following test after bumping the lib netlink.

 --- FAIL: TestAntreaPolicy/TestGroupValidationWebhook/Case=CreateInvalidTier (18.07s)

Which version you bumped up and which pipeline you get the failure?

In this PR, all finished e2e pipelines are passed.

@hongliangl
Copy link
Contributor

hongliangl commented May 16, 2024

Previously the dafault route has nil Dst, but it is filled in latest dev version by below code in route_linux.go:

If this is the only known case, we should adapt the code to it. But I remember @hongliangl mentioned other test failures that may be related to it.

Yes, I always got the same failure from the following test after bumping the lib netlink.

 --- FAIL: TestAntreaPolicy/TestGroupValidationWebhook/Case=CreateInvalidTier (18.07s)

Which version you bumped up and which pipeline you get the failure?

In this PR, all finished e2e pipelines are passed.

Previously the dafault route has nil Dst, but it is filled in latest dev version by below code in route_linux.go:

If this is the only known case, we should adapt the code to it. But I remember @hongliangl mentioned other test failures that may be related to it.

Yes, I always got the same failure from the following test after bumping the lib netlink.

 --- FAIL: TestAntreaPolicy/TestGroupValidationWebhook/Case=CreateInvalidTier (18.07s)

Which version you bumped up and which pipeline you get the failure?

In this PR, all finished e2e pipelines are passed.

All e2e tests might pass sometimes, but it seems that the failure is always the same. I saw the same failure from the tests in this PR:

This is from another PR I made:

@gran-vmv gran-vmv force-pushed the ipam-renamefix branch 2 times, most recently from 43a83cd to 41a3bbe Compare May 16, 2024 09:20
@gran-vmv
Copy link
Contributor Author

#6193 also bumped up netlink lib. Need rebase after #6193 merged.

@gran-vmv
Copy link
Contributor Author

/test-all
/test-flexible-ipam-e2e

@@ -290,6 +290,12 @@ func RenameInterface(from, to string) error {
if pollErr != nil {
return fmt.Errorf("failed to rename host interface name %s to %s", from, to)
}
// Fix for the issue https://github.com/antrea-io/antrea/issues/6301.
// In some new Linux versions which support AltName, if there is only one valid value in AlternativeNamesPolicy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe adjust it a little:

In some new Linux versions which support AltName, if the only valid altname of the interface is the same as the interface name, it would be left empty when the name is occupied by the interface name; after we rename the interface name to another value, the altname of the interface would be set to the original interface name by the system. This altname must be removed as we need to reserve the name for an OVS internal port.

Comment on lines 299 to 300
klog.ErrorS(err, "Failed to remove AltName after interface renamed")
return fmt.Errorf("failed to remove AltName %s on interface %s", from, to)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.ErrorS(err, "Failed to remove AltName after interface renamed")
return fmt.Errorf("failed to remove AltName %s on interface %s", from, to)
return fmt.Errorf("failed to remove AltName %s on interface %s: %w", from, to, err)

don't handle the same error twice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks.

Fix antrea-io#6301

Signed-off-by: gran <gran@vmware.com>
Co-authored-by: Lan <luola@vmware.com>
@gran-vmv
Copy link
Contributor Author

/test-all
/test-flexible-ipam-e2e

@gran-vmv
Copy link
Contributor Author

/test-all

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn tnqn merged commit d895124 into antrea-io:main Jun 5, 2024
52 of 56 checks passed
@tnqn
Copy link
Member

tnqn commented Jun 5, 2024

@gran-vmv could you backport this to 1.14-2.0?

gran-vmv added a commit to gran-vmv/antrea that referenced this pull request Jun 5, 2024
Fix antrea-io#6301

Signed-off-by: gran <gran@vmware.com>
Co-authored-by: Lan <luola@vmware.com>
gran-vmv added a commit to gran-vmv/antrea that referenced this pull request Jun 5, 2024
Fix antrea-io#6301

Signed-off-by: gran <gran@vmware.com>
Co-authored-by: Lan <luola@vmware.com>
gran-vmv added a commit to gran-vmv/antrea that referenced this pull request Jun 7, 2024
Fix antrea-io#6301

Signed-off-by: gran <gran@vmware.com>
Co-authored-by: Lan <luola@vmware.com>
gran-vmv added a commit to gran-vmv/antrea that referenced this pull request Jun 7, 2024
Fix antrea-io#6301

Signed-off-by: gran <gran@vmware.com>
Co-authored-by: Lan <luola@vmware.com>
gran-vmv added a commit to gran-vmv/antrea that referenced this pull request Jun 7, 2024
Fix antrea-io#6301

Signed-off-by: gran <gran@vmware.com>
Co-authored-by: Lan <luola@vmware.com>
tnqn pushed a commit that referenced this pull request Jun 11, 2024
tnqn pushed a commit that referenced this pull request Jun 11, 2024
gran-vmv added a commit to gran-vmv/antrea that referenced this pull request Jun 12, 2024
tnqn pushed a commit that referenced this pull request Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. area/interface Issues or PRs related to network interfaces area/OS/linux Issues or PRs related to the Linux operating system. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Moving host interface to OVS bridge failed due to interface altname
6 participants