Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.11 Backports 2023-05-03 #25250

Closed
wants to merge 11 commits into from

Conversation

[ upstream commit 32473b9 ]

Log number of backends that are identified as leaked, and
hence, skipped and deleted during restore. This is useful
for debugging when debug logs are not enabled.

Signed-off-by: Aditi Ghag <aditi@cilium.io>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 662c96b ]

Log orphan backends that are not associated with any
services, and cleaned up post agent restart.
This can be useful for debugging when debug logs
are not enabled.

Signed-off-by: Aditi Ghag <aditi@cilium.io>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit c32b001 ]

During backends restore, the agent previously released
backend ID in case of conflict. More importantly, the
ID that was released was already restored for another
backend. This will release conflicted IDs from the ID
allocator, while still leaving backend entries in the
backends map. The ID allocator can then assign the
released ID to another backend once the restore is
complete.
This commit removes releasing conflicted backend IDs.
The proper clean-up would happen in certain cases when
backends are deleted as duplicate or orphan where they
are not associated with any of the services.

Fixes: 51b62f5 ("service: Add methods to acquire backend ID for svc v2")
Signed-off-by: Aditi Ghag <aditi@cilium.io>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit ec8d3ce ]

This commit revises the fix made in commit 5311f81 that
handles leaked backends.
Consider the following leaked backend scenarios where (1) and
(2a) were previously handled:
(1) Backend entries leaked, no duplicates: previously, such
   backends would be deleted as orphan backends after sync with
   kubernetes api server.
(2) Backend entries leaked with duplicates:
	(a) backend with overlapping L3nL4Addr hash is associated with service(s)
        Sequence of events:
        Backends were leaked prior to agent restart, but there was at least
        one service that the backend by hash is associated with.
        s.backendByHash will have a non-zero reference count for the
        overlapping L3nL4Addr hash.
	(b) none of the backends are associated with services
	    Sequence of events:
        All the services these backends were associated with were deleted
        prior to agent restart.
        s.backendByHash will not have an entry for the backends hash.

This commit updates the logic to handle all the scenarios in one place.

Fixes: b79a4a5 (pkg/service: Gracefully terminate service backends)

Signed-off-by: Aditi Ghag <aditi@cilium.io>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
@joamaki joamaki requested a review from a team as a code owner May 3, 2023 15:30
@joamaki joamaki added kind/backports This PR provides functionality previously merged into master. backport/1.11 This PR represents a backport for Cilium 1.11.x of a PR that was merged to main. labels May 3, 2023
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My commit looks good. Thanks!

@harsimran-pabla
Copy link
Contributor

BGP change #25043 is already backported in this #25139

Copy link
Member

@kaworu kaworu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My commits LGTM, thanks @joamaki!

Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for my commit, thanks 💯

Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for my commits. Thanks!

Copy link
Contributor

@tommyp1ckles tommyp1ckles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks1

aditighag and others added 7 commits May 5, 2023 10:11
[ upstream commit e1a48bb ]

The commit adds a unit test case to validate various
leaked backends scenarios handled in restore services.

Signed-off-by: Aditi Ghag <aditi@cilium.io>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit b0ada08 ]

The commit documents the known service backends leak issue
along with the cilium versions that have the fixes.

Signed-off-by: Aditi Ghag <aditi@cilium.io>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 9e83a6f ]

Cilium is currently affected by a known bug (cilium#24692) when NodePorts are
handled by the KPR implementation, which occurs when the same NodePort
is used both in the local and the remote cluster. This causes all
traffic targeting that NodePort to be redirected to a local backend,
regardless of whether the destination node belongs to the local or the
remote cluster. This affects also the clustermesh-apiserver NodePort
service, which is configured by default with a fixed port. Hence, let's
add a warning message to the corresponding values file setting.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 11e1bcc ]

This is to add a small docs for version matrix between Cilium and Cilium
envoy versions, which is useful with the upcoming work to move envoy
proxy out of Cilium agent container.

Co-authored-by: ZSC <zacharysarah@users.noreply.github.com>
Signed-off-by: Tam Mach <sayboras@yahoo.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit db3e015 ]

Before this patch, we would hit a controller-gen[1] bug when the
temporary file would be of the form tmp.0oXXXXXX.

This patch uses a custom mktemp template that will not trigger the bug.

[1]: kubernetes-sigs/controller-tools#734

Signed-off-by: Alexandre Perrin <alex@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 9a38aec ]

We've been distributing ARM architecture images for Cilium for almost
two years, but neglected to mention this up front in the system
requirements or the main docs page. Add this to the docs.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit e695e48 ]

Running the test in a cpu constrained environment, such as:

```
docker run -v $(pwd):$(pwd) -w $(pwd) --cpus=0.1 -it golang:bullseye ./inctimer.test -test.v
```

I can fairly consistency reproduce a flake where the inctimer.After does not fire in time.
If I allow it to wait for an additional couple of ms, this seems to be sufficient to prevent failure.

It appears that goroutine scheduling latency can be significantly delayed in cpu restricted environments.
This seems unavoidable, so to fix the flake I'll allow the test to wait another 2ms to see if the inctimer eventually fires.

This will also log an error for delayed test fires, so if there is any other issues we can more easily debug them in the future.

Fixed: cilium#25202

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
@joamaki joamaki force-pushed the pr/v1.11-backport-2023-05-03 branch from 67e69a2 to 6e40b96 Compare May 5, 2023 07:11
@joamaki
Copy link
Contributor Author

joamaki commented May 5, 2023

Looks good for my commits. Thanks!

Hey, looks like I forgot to fix up pkg/service/service_test.go. After fixing it the same way as in the v1.12 backports I get a test failure:

FAIL: service_test.go:833: ManagerTestSuite.TestRestoreServiceWithTerminatingBackends

service_test.go:864:
    // Backends including terminating ones have been restored
    c.Assert(len(m.svc.backendByHash), Equals, 3)
... obtained int = 2
... expected int = 3

OOPS: 0 passed, 1 FAILED

Enabling debug logging I get:

level=debug msg="Upserting service" backends="[{0  {10.0.0.1 {TCP 8080} 0} false} {0  {10.0.0.2 {TCP 8080} 0} false} {0  {10.0.0.4 {TCP 8080} 0} false}]" loadBalancerSourceRanges="[]" serviceIP="{1.1.1.1 {TCP 80} 0}" serviceName= serviceNamespace= sessionAffinity=true sessionAffinityTimeout=100 subsys=service svcHealthCheckNodePort=0 svcTrafficPolicy=Cluster svcType=NodePort
level=debug msg="Resolving service" l3n4Addr="{IP:1.1.1.1 L4Addr:{Protocol:TCP Port:80} Scope:0}" subsys=service
level=debug msg="Acquired service ID" backends="[{0  {10.0.0.1 {TCP 8080} 0} false} {0  {10.0.0.2 {TCP 8080} 0} false} {0  {10.0.0.4 {TCP 8080} 0} false}]" loadBalancerSourceRanges="[]" serviceID=1 serviceIP="{1.1.1.1 {TCP 80} 0}" serviceName= serviceNamespace= sessionAffinity=true sessionAffinityTimeout=100 subsys=service svcHealthCheckNodePort=0 svcTrafficPolicy=Cluster svcType=NodePort
level=debug msg="Deleting backends from session affinity match" backends="[]" serviceID=1 subsys=service
level=debug msg="Adding new backend" backendID=1 backends="[{0  {10.0.0.1 {TCP 8080} 0} false} {0  {10.0.0.2 {TCP 8080} 0} false} {0  {10.0.0.4 {TCP 8080} 0} false}]" l3n4Addr="{10.0.0.1 {TCP 8080} 0}" loadBalancerSourceRanges="[]" serviceID=1 serviceIP="{1.1.1.1 {TCP 80} 0}" serviceName= serviceNamespace= sessionAffinity=true sessionAffinityTimeout=100 subsys=service svcHealthCheckNodePort=0 svcTrafficPolicy=Cluster svcType=NodePort
level=debug msg="Adding new backend" backendID=2 backends="[{0  {10.0.0.1 {TCP 8080} 0} false} {0  {10.0.0.2 {TCP 8080} 0} false} {0  {10.0.0.4 {TCP 8080} 0} false}]" l3n4Addr="{10.0.0.2 {TCP 8080} 0}" loadBalancerSourceRanges="[]" serviceID=1 serviceIP="{1.1.1.1 {TCP 80} 0}" serviceName= serviceNamespace= sessionAffinity=true sessionAffinityTimeout=100 subsys=service svcHealthCheckNodePort=0 svcTrafficPolicy=Cluster svcType=NodePort
level=debug msg="Adding new backend" backendID=3 backends="[{0  {10.0.0.1 {TCP 8080} 0} false} {0  {10.0.0.2 {TCP 8080} 0} false} {0  {10.0.0.4 {TCP 8080} 0} false}]" l3n4Addr="{10.0.0.4 {TCP 8080} 0}" loadBalancerSourceRanges="[]" serviceID=1 serviceIP="{1.1.1.1 {TCP 80} 0}" serviceName= serviceNamespace= sessionAffinity=true sessionAffinityTimeout=100 subsys=service svcHealthCheckNodePort=0 svcTrafficPolicy=Cluster svcType=NodePort
level=debug msg="Adding backends to affinity match map" backends="[1 2 3]" serviceID=1 subsys=service
level=debug msg="Upserting service" backends="[{0  {10.0.0.1 {TCP 8080} 0} false} {0  {10.0.0.2 {TCP 8080} 0} false} {0  {10.0.0.4 {TCP 8080} 0} true}]" loadBalancerSourceRanges="[]" serviceIP="{1.1.1.1 {TCP 80} 0}" serviceName= serviceNamespace= sessionAffinity=true sessionAffinityTimeout=100 subsys=service svcHealthCheckNodePort=0 svcTrafficPolicy=Cluster svcType=NodePort
level=debug msg="Acquired service ID" backends="[{0  {10.0.0.1 {TCP 8080} 0} false} {0  {10.0.0.2 {TCP 8080} 0} false} {0  {10.0.0.4 {TCP 8080} 0} true}]" loadBalancerSourceRanges="[]" serviceID=1 serviceIP="{1.1.1.1 {TCP 80} 0}" serviceName= serviceNamespace= sessionAffinity=true sessionAffinityTimeout=100 subsys=service svcHealthCheckNodePort=0 svcTrafficPolicy=Cluster svcType=NodePort
level=debug msg="Deleting backends from session affinity match" backends="[]" serviceID=1 subsys=service
level=debug msg="Adding backends to affinity match map" backends="[1 2 3]" serviceID=1 subsys=service
level=warning msg="Unable to add entry to affinity match map" backendID=1 error="Backend 1 already exists in 1 affinity map" serviceID=1 subsys=service
level=warning msg="Unable to add entry to affinity match map" backendID=2 error="Backend 2 already exists in 1 affinity map" serviceID=1 subsys=service
level=warning msg="Unable to add entry to affinity match map" backendID=3 error="Backend 3 already exists in 1 affinity map" serviceID=1 subsys=service
level=debug msg="Restoring service" serviceID=1 serviceIP="1.1.1.1:80" subsys=service
level=debug msg="Restoring service" l3n4Addr="{IP:1.1.1.1 L4Addr:{Protocol:NONE Port:80} Scope:0}" subsys=service
level=info msg="Restored services from maps" failedServices=0 restoredServices=1 subsys=service
level=debug msg="Restoring backend" backendID=1 l3n4Addr="10.0.0.1:8080" subsys=service
level=debug msg="Restoring backend" backendID=2 l3n4Addr="10.0.0.2:8080" subsys=service
level=debug msg="Restoring backend" backendID=3 l3n4Addr="10.0.0.4:8080" subsys=service
level=debug msg="Leaked backend entry not restored" backendID=3 l3n4Addr="{10.0.0.4 {NONE 8080} 0}" subsys=service
level=info msg="Restored backends from maps" failedBackends=0 restoredBackends=2 skippedBackends=1 subsys=service

@aditighag Any idea what's going on? Are we missing something important in v1.11?

@joamaki joamaki requested a review from aditighag May 5, 2023 07:14
Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, missed the differences around service name +namespace changes.

Comment on lines +840 to +847
SessionAffinity: true,
SessionAffinityTimeoutSec: 100,
HealthCheckNodePort: 0,
Name: "",
Namespace: "",
LoadBalancerSourceRanges: []*cidr.CIDR{},
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert these changes? Looks like there were accidentally added in the backport commit?

Comment on lines +922 to +923
Name: "svc1",
Namespace: "ns1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the only difference between 1.11 and newer branches.

@YutaroHayakawa YutaroHayakawa self-assigned this May 10, 2023
@YutaroHayakawa
Copy link
Member

Since this branch comes from Jussi's fork, I cannot edit commits to apply Aditi's change. I'll move this to another PR the branch owned by me.

@YutaroHayakawa YutaroHayakawa removed their assignment Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.11 This PR represents a backport for Cilium 1.11.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants