Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gate CNI support behind the cni build tag #1767

Merged
merged 1 commit into from
Jan 31, 2024

Conversation

dcermak
Copy link
Contributor

@dcermak dcermak commented Dec 11, 2023

  • add cni build tag to libnetwork/cni
  • split libnetwork/network into multiple files so that cni support can be made optionally available

See also https://issues.redhat.com/browse/RUN-1943

@rhatdan
Copy link
Member

rhatdan commented Dec 11, 2023

/approve
LGTM
@Luap99 @mheon PTAL

Copy link
Contributor

openshift-ci bot commented Dec 11, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dcermak, rhatdan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are two cases to consider:
First of if the cni build tag is set it should keep behaving like it does today. Although I could also see a use case for dropping this awkward cni vs netavark detection logic and only keep reading the defaultNetworkBackend file and if the file does not exists (new users) assume empty == netavark. That would mean an update to 3.X to 5.X (with CNI) could break but I don't think that is a problem given that most 5.X build should not have cni enabled anyway.

Second if the cni build tag is not set then we have the network_backend key from the config, assuming it is set to netavark all is good and we can just use it, if it is set to cni we should return an error. However if it is empty then it reads the defaultNetworkBackend file and if the content is cni it is not clear to me that we should force an error here. This means the user (or distro) never explicitly configured anything and throwing a hard error because the users just kept updating from 3.X seems harsh to me. Assuming there are no CNI networks created and the user only ever used the default network then a update should not require manually intervention. A reboot would be enough to fix the old temporary CNI state to be cleanup up and stuff should work again. And there are also cases were people only run with --network host in which case this choice here shouldn't matter so breaking them will be unjustified (this is very common in the toolbox use case).

Lastly we need to fix the containers.conf man page to document these changes. Unfortunately I don't think we have the infra to properly build man pages based on build tags used. Thus the option is either to document both options or only document netavark only and then require distros that compile with cni support to patch the man pages. I would prefer the first option.

libnetwork/types/const_cni.go Outdated Show resolved Hide resolved
Comment on lines -106 to -107
if val == string(types.CNI) {
return types.CNI, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break backwards compatibility pretty hard. I haven't made up my mind yet what exactly should happen but this alone is broken because it will break even if CNI build tag is specified you will reject this old cni configuration which should not happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will never end up in there because it now returns unknown network backend value "cni" ... driectly below here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right, my bad!

libnetwork/network/non_cni_interface.go Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
@dcermak dcermak force-pushed the cni-build-tag branch 3 times, most recently from e9dfdf7 to 3c3e841 Compare December 11, 2023 15:28
@rhatdan
Copy link
Member

rhatdan commented Dec 13, 2023

@Luap99 PTAL

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply, added some comments but I would like to have another look when I am back from the holiday in January.

Makefile Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
libnetwork/network/non_cni_interface.go Outdated Show resolved Hide resolved
libnetwork/network/interface.go Outdated Show resolved Hide resolved
libnetwork/network/interface.go Outdated Show resolved Hide resolved
libnetwork/network/interface.go Outdated Show resolved Hide resolved
Copy link
Member

@ashley-cui ashley-cui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @dcermak!


aardvarkBin, _ := conf.FindHelperBinary(aardvarkBinary, false)
func defaultNetworkBackend(store storage.Store, conf *config.Config) (backend types.NetworkBackend, err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this function would be a good candidate to have different build tagged versions - it would alleviate the need for the CNI supported const. The CNI unsupported version would just return netavark and do the update check/warn, while the supported version would go through the logic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either this, or even one level above, the NetworkBackend() function, could be the diverging point for the buildtag. I think either of these might work better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is a good idea, although I am not sure the additional duplication of code is worth it.
Either way is fine for me so I wouldn't block on this.

libnetwork/network/netavark_interface.go Outdated Show resolved Hide resolved
libnetwork/network/cni_interface.go Outdated Show resolved Hide resolved
@dcermak dcermak force-pushed the cni-build-tag branch 3 times, most recently from b0a482e to 57b951f Compare December 19, 2023 16:11
@rhatdan
Copy link
Member

rhatdan commented Dec 20, 2023

@ashley-cui PTANL

@ashley-cui
Copy link
Member

Open comment still applies, but I'm open to discussion on it. Would like @Luap99 to take another look after the holidays though.

@rhatdan
Copy link
Member

rhatdan commented Jan 3, 2024

@Luap99 waiting on you.

@Luap99
Copy link
Member

Luap99 commented Jan 3, 2024

I try to take a look tomorrow, regardless this needs buildah/podman test PR first to ensure CI is passing there before merging here. I guess this would mostly be disabling all the CNI tests there. Or at least temporarily add the CNI build tag there to ensure it continues working and we can rip it out later (I don't want to force you to deal with all of this work).

@ashley-cui
Copy link
Member

I can take a look at the podman/buildah portion

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI you have to rebase as you have a conflict due #1780. Should be easy to solve as you only need to drop the old go build tags.

Makefile Show resolved Hide resolved
libnetwork/types/const.go Outdated Show resolved Hide resolved
@Luap99
Copy link
Member

Luap99 commented Jan 5, 2024

@ashley-cui Can you open the podman and buildah test PRs. I don't think there should be any problems with buildah and we can just build test without the cni tag there but in podman will need some test changes as well. The open question is do we want to continue to test CNI in podman CI?

@rhatdan
Copy link
Member

rhatdan commented Jan 17, 2024

@dcermak @Luap99 @ashley-cui Where do we stand on this one?

@ashley-cui
Copy link
Member

Getting back onto the podman side of things after last week - will get back to this tomorrow.

}
return "", fmt.Errorf("unknown network backend value %q in %q", val, file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line is tripping our podman tests.

This test is running netavark with the netavark only build.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To help in debugging: the failure is in f39 rootless and only in two of the e2e tests: add ssh socket path and kube play using a user namespace, and both failures look like this:

→ Enter [It] (test name)
Error: Error: failed to get default network backend: unknown network backend value "" in "[share]/containers/storage/defaultNetworkBackend"
...
No future change is possible.  Bailing out early after 0.741s.

I have no idea what "No future change is possible" means, it seems to be coming from ginkgo itself. All other e2e tests pass (1985 of them). I have no idea what is special about those two tests. I'm just posting direct links in hopes that someone else can understand the failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edsantiago The test run that @ashley-cui linked includes a lot more test failures. Could it be that there is some leftover polluted cache maybe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that it says unknown network backend value "" seems to suggest there is a conditioning in which it writes (or reads) an empty file which seems very bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcermak @edsantiago FWIW I changed the test pr to always build CNI, and it looks like all the rootless runs are failing. I think this file should only be created by this network backend code? I can try to help dig a little failure but @Luap99 's diagnosis seems correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashley-cui could you please rebase the test PRs? I've fixed the issue that @Luap99 found

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased!

Comment on lines 130 to 131
// cache the network backend to make sure always the same one will be used
defer writeBackendToFile(backend)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is that this backend arg is evaluated here right away and not in the defer context as last step so backend ends up being empty all the time.

Given you restructured the function anyway you can just move this call directly before the return and not call it with defer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a silly mistake, thank you for spotting it!

I have removed the defer and adjusted the code. Now with the defer gone, we could actually properly treat the error if the defaultNetworkBackend file cannot be written. Although I am honestly unsure whether we should not just continue the current approach and just log it. Worst case: the autodetection will run again next time.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logging is fine

@ashley-cui
Copy link
Member

ashley-cui commented Jan 25, 2024

Looks like the Podman CI stuff passed - the other failures are unrelated to this PR, but need some fixing. We'll have to wait for those to be fixed to merge this. Thanks for your patience with this PR!

LGTM

@Luap99
Copy link
Member

Luap99 commented Jan 25, 2024

linter is complaining

@ashley-cui
Copy link
Member

@dcermak Thanks so much for your patience! Could you rebase once more, and I'll kick off the podman tests one more time as we just fixed out podman CI. Then, this should be good to go!

- add cni build tag to libnetwork/cni
- split libnetwork/network into multiple files so that cni support can be made
  optionally available
- add -cni build targets to Makefile and build for amd64 with and without cni
- add a simple upgrade mechanism if the user never set the network backend explicitly
- add cni build tag to .golangci.yml to prevent false positives

See also https://issues.redhat.com/browse/RUN-1943

Signed-off-by: Dan Čermák <dcermak@suse.com>
@dcermak
Copy link
Contributor Author

dcermak commented Jan 30, 2024

@dcermak Thanks so much for your patience! Could you rebase once more, and I'll kick off the podman tests one more time as we just fixed out podman CI. Then, this should be good to go!

Done

@ashley-cui
Copy link
Member

ashley-cui commented Jan 31, 2024

containers/podman#21179 is green.

LGTM from my side, and ready to merge.
@containers/podman-maintainers PTAL

@rhatdan
Copy link
Member

rhatdan commented Jan 31, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jan 31, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 2f09d12 into containers:main Jan 31, 2024
7 checks passed
@dcermak dcermak deleted the cni-build-tag branch February 6, 2024 06:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants