Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.11 Backports 2023-04-20 #25011

Merged
merged 7 commits into from Apr 26, 2023
Merged

Conversation

nbusseneau
Copy link
Member

@nbusseneau nbusseneau commented Apr 20, 2023

Once this PR is merged, you can update the PR labels via:

for pr in 20031 24786 24850 24785; do contrib/backporting/set-labels.py $pr done 1.11; done

or with

make add-labels BRANCH=v1.11 ISSUES=20031,24786,24850,24785

joestringer and others added 7 commits April 20, 2023 16:35
[ upstream commit 82d5adc ]

When the quiet mode was enabled, the $(CLANG) var would previously have
a '@' at the start, which caused errors while attempting to make in this
directory because it would be run in the context of a shell rather than
directly as a make instruction. Move the $(QUIET) to the start of
individual make instructions to resolve this compilation failure.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit 29fe753 ]

When TESTPKGS only contains unprivileged tests, the PRIV_TEST_PKGS_EVAL
evaluation previously filtered down to an empty list of packages that
should be tested, and would pass this empty list to dirname, which then
reports:

    dirname: missing operand
    Try 'dirname --help' for more information.

This could happen multiple times during evaluation of the Makefile, and
littered the output with no meaning. This could occur even if the
privileged tests are not the target being run.

Fix this by always adding "." to the list, which evaluates to the root
directory of the repository. This causes dirname to succeed. Then, we
can filter this root directory back out since there are no privileged
tests at this level of the repository. This finally quietens the error.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit 9f7e24f ]

Pass the verbosity parameters --quiet V=0 to quieten Travis output.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit 7f9e0f9 ]

The travis logs are frequently polluted with >10K lines of docker pull
and build output. While this helps to track the ongoing progress of
docker builds that take a long time, it's mostly useless output that
developers must scroll past in order to see the useful output. Quieten
that output in Travis to just the trigger of building the image plus the
final summary that docker outputs.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit 9bb669b ]

Etcd quorum checks are falsely reported as failing even though connection
to etcd is intact. This can cause health checks to fail in both the agent
and the operator.

This happens due to a deadlock in pkg/kvstore/etcd after a prolonged
downtime of etcd. Status check errors are being sent into a channel for the
purpose of recreating kvstore connections in clustermesh. However when
clustermesh is not used, messages from this channel are never read. The
channel uses a buffer of size 128. After etcd has been down long enough to
generate 128 errors, we enter a deadlock state. Agent / operator will
continue to report etcd quorum failures and inturn health check failures
until they're restarted.

statusChecker()
	-> isConnectedAndHasQuorum()
		-> waitForInitLock()
			-> goroutine -> for -> ( initLockSucceeded <- err )
                        -> chan initLockSucceeded returned
		-> Block on receiving messages from initLockSucceeded channel
	-> e.statusCheckErrors <- e.latestErrorStatus [Blocked after 128 entries]

Blocked goroutines captured from cilium 1.10 operator:

goroutine 3309 [chan send, 13456 minutes]:
github.com/cilium/cilium/pkg/kvstore.(*etcdClient).statusChecker(0xc00017db30)
	/go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:1171 +0x75a
created by github.com/cilium/cilium/pkg/kvstore.connectEtcdClient
	/go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:801 +0x679

goroutine 7838665 [chan send, 13505 minutes]:
g.com/c/cilium/pkg/kvstore.(*etcdClient).waitForInitLock.func1(-,-,-,-)
	/go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:433 +0x449
created by github.com/cilium/cilium/pkg/kvstore.(*etcdClient).waitForInitLock
	/go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:425 +0x7f

Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit a6d0142 ]

This commit marks the CiliumEndpointSlice feature as beta (as per the
documentation) in the agent flag description. This is necessary because
users don't always read the full documentation before turning agent
flags on.

While at it, change the flag description to match the wording of other
flags.

Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit 7398de6 ]

Currently, the wireguard subsystem in the cilium agent caches
information about the known peers by node name only. This can lead to
conflicts in case of clustermesh, if nodes in different clusters have
the same name, causing in turn connectivity issues. Hence, let's switch
to identify peers by full name (i.e., cluster-name/node-name) to ensure
uniqueness. This modification does not introduce issues during upgrades,
since the node ID is not propagated to the datapath.

Fixes: cilium#24227
Reported-by: @oulinbao <oulinbao@163.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
@nbusseneau nbusseneau requested a review from a team as a code owner April 20, 2023 16:50
@nbusseneau nbusseneau added kind/backports This PR provides functionality previously merged into master. backport/1.11 This PR represents a backport for Cilium 1.11.x of a PR that was merged to main. labels Apr 20, 2023
@nbusseneau
Copy link
Member Author

nbusseneau commented Apr 20, 2023

/test-backport-1.11

Job 'Cilium-PR-K8s-1.19-kernel-4.9' failed:

Click to show.

Test Name

K8sHubbleTest Hubble Observe Test L3/L4 Flow

Failure Output

FAIL: Found 2 k8s-app=cilium logs matching list of errors that must be investigated:

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.19-kernel-4.9/2822/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.19-kernel-4.9 so I can create one.

Job 'Cilium-PR-K8s-1.17-kernel-4.9' failed:

Click to show.

Test Name

K8sKafkaPolicyTest Kafka Policy Tests KafkaPolicies

Failure Output

FAIL: Found 2 k8s-app=cilium logs matching list of errors that must be investigated:

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.17-kernel-4.9/1315/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.17-kernel-4.9 so I can create one.

Job 'Cilium-PR-K8s-1.21-kernel-4.9' failed:

Click to show.

Test Name

K8sPolicyTest Basic Test checks all kind of Kubernetes policies

Failure Output

FAIL: Found 2 k8s-app=cilium logs matching list of errors that must be investigated:

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.21-kernel-4.9/2589/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.21-kernel-4.9 so I can create one.

Job 'Cilium-PR-K8s-1.20-kernel-4.9' failed:

Click to show.

Test Name

K8sPolicyTest Basic Test checks all kind of Kubernetes policies

Failure Output

FAIL: Found 2 k8s-app=cilium logs matching list of errors that must be investigated:

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.20-kernel-4.9/1819/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.20-kernel-4.9 so I can create one.

Job 'Cilium-PR-K8s-1.19-kernel-4.9' hit: #25099 (93.06% similarity)

Job 'Cilium-PR-K8s-1.17-kernel-4.9' hit: #25100 (94.45% similarity)

Job 'Cilium-PR-K8s-1.21-kernel-4.9' has 1 failure but they might be new flake since it also hit 1 known flake: #25101 (93.46% similarity)

Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My PR looks good. Thanks!

Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mine too. Thanks!

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for #20031

@hemanthmalla
Copy link
Member

LGTM for my commit. Thanks.

@nbusseneau
Copy link
Member Author

  • Jenkins pipelines failed with CI: [v1.11] Jenkins tests failing on 4.9, agent trying to pass BPF ISA set v3 to init.sh #24394, documenting the failures in the issue.
  • AKS pipeline failed because K8s 1.23 is not supported on AKS anymore => we are entering one of the scenarios discussed when initially introducing K8s version pinning on stable branches. Since Cilium 1.11 only officially supports up to K8s 1.23, we now have a choice to make:
    • Disable AKS testing on 1.11.
    • Bump AKS testing to use K8s 1.24 even though it's not officially supported.

I would personally advocate for the first solution.
cc @joestringer @aanm

@joestringer
Copy link
Member

👍 I think it's fair to say that Cilium branch CI should test the intersection of {k8s releases that were out at the time of cilium release} and {k8s releases supported by the cloud now}. The logical consequence is that if those groups become disjoint then we no longer need to run such tests, as you propose.

We could of course opt to add additional support, but I think it's in everyone's best interest to encourage newer software in cases like this rather than provide more reasons for folks to stick on the older releases.

@nbusseneau
Copy link
Member Author

Cool, I will disable AKS testing on 1.11 then!

@nbusseneau
Copy link
Member Author

/test-1.16-netnext

@sayboras
Copy link
Member

sayboras commented Apr 25, 2023

/mlh new-flake Cilium-PR-K8s-1.19-kernel-4.9

👍 created #25098

@sayboras
Copy link
Member

sayboras commented Apr 25, 2023

/mlh new-flake Cilium-PR-K8s-1.17-kernel-4.9

👍 created #25099

@sayboras
Copy link
Member

sayboras commented Apr 25, 2023

/mlh new-flake Cilium-PR-K8s-1.21-kernel-4.9

👍 created #25100

@sayboras
Copy link
Member

sayboras commented Apr 25, 2023

/mlh new-flake Cilium-PR-K8s-1.20-kernel-4.9

👍 created #25101

@sayboras
Copy link
Member

/test-1.17-4.9

@sayboras
Copy link
Member

/test-1.19-4.9

@sayboras
Copy link
Member

/test-1.20-4.9

@sayboras
Copy link
Member

/test-1.21-4.9

@sayboras
Copy link
Member

sayboras commented Apr 25, 2023

/test-1.21-4.9

Re-run this test suite to confirm if there is any issue or just flake.

@sayboras
Copy link
Member

/test-1.21-4.9

Re-run this test suite to confirm if there is any issue or just flake.

Re-run was good ✔️

Required reviews are in, flakes are tracked as part of the below issues. Marking this ready to merge.

#25098
#25099
#25100
#25101

@sayboras sayboras added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 25, 2023
@ti-mo ti-mo merged commit bac3efe into cilium:v1.11 Apr 26, 2023
50 of 53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.11 This PR represents a backport for Cilium 1.11.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants