Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add time wrapper to test agent delays in CI #27253

Merged
merged 8 commits into from
Oct 30, 2023
Merged

Conversation

joestringer
Copy link
Member

@joestringer joestringer commented Aug 3, 2023

Cilium's internal business logic relies on a highly parallel combination of reactive handlers for incoming information, "triggers" that ratelimit requests for processing to ensure Cilium does not over-consume resources, and "controllers" that periodically perform updates or resiliency checks of configured state. While in general most things are "eventually consistent", the presence of time-based triggers and controllers can introduce challenges when evaluating how Cilium will perform once the "eventual consistency" is resolved.

Taking an example of a situation where this eventual consistency caused issues that were not identified during pre-release testing, consider the issue fixed by #27327 . Cilium v1.14.0 was released with a bug where Cilium appeared to work correctly for the first several minutes following startup, then afterwards introduced connectivity disruption to specific peers. This was traced down to a logic delay tied to a 10 minute timer following startup which caused some state to be recomputed & configured in the system, ultimately causing the packet loss.

While better unit testing for the individual package could have caught this issue earlier, the issue was also triggered by the integration between the specific package and other logic in other packages. It is quite difficult to systematically identify time-based errors across the entire agent by relying purely on such testing in each package. This PR attempts to provide a more systematic safety net for timer-based issues by providing a hidden flag to override all timers within Cilium and ensuring that Cilium's CI runs with these timers set to a minimal value.

One of the interesting challenges with this PR is that it can be tempting for developers to rely on time-based tricks in order to ensure the execution order of specific pieces of logic. However, when the system is highly loaded, such mechanisms can become unreliable as an ordering enforcement mechanism. As a side-objective, this PR also hopes to make such tricks less viable on order to incentivize implementing better ordering mechanisms.

Review tips: The treewide commit has about 2/3 of the changes but is generated almost entirely from a script, so can be overlooked for initial review:

    treewide: Wrap time package with internal wrapper
...
178 files changed, 186 insertions(+), 191 deletions(-)

Related: #28844

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 3, 2023
@joestringer joestringer added the release-note/ci This PR makes changes to the CI. label Aug 3, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 3, 2023
@joestringer
Copy link
Member Author

/ci-ginkgo

@joestringer
Copy link
Member Author

joestringer commented Aug 3, 2023

🎣 f04-agent-policy-multi-node-1 seems to be failing consistently across k8s platforms:

• Failure in Spec Teardown (AfterEach) [130.273 seconds]
K8sAgentPolicyTest
/home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:461
  Multi-node policy test
  /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:461
    validates fromEntities policies
    /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:461
      with remote-node identity enabled [AfterEach]
      /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:461
        Validates fromEntities remote-node policy
        /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:515

        Found 1 k8s-app=cilium logs matching list of errors that must be investigated:
        2023-08-03T21:42:49.751308617Z level=error msg="Unable to find identity of previously used CIDR 172.18.0.2/32" subsys=ipcache

EDIT: The referred IP address is one of the k8s nodes. The fact that releaseCIDRIdentities is being called with that prefix is a bit of a weird observation, since it should be considered as a "remote node" and hence should not have a CIDR identity in the current implementation.

In this scenario where all timers are set to 5s, for some reason the daemon allocates an identity for remote nodes on startup and seems to even insert it into the ipcache before it learns about the remote node IPs. This can cause temporary traffic disruption during startup, and should not happen. Furthermore it appears that the identity is released and then later on the ipcache gets confused that it was released.

2023-08-03T21:42:38.093177787Z level=debug msg="Resolving identity" identityLabels="cidr:172.18.0.2/32,reserved:world" subsys=identity-cache
2023-08-03T21:42:38.094183031Z level=debug msg="Upserting IP into ipcache layer" identity="{16777217 generated [] false false}" ipAddr=172.18.0.2/32 key=0 subsys=ipcache
2023-08-03T21:42:38.094192168Z level=debug msg="Daemon notified of IP-Identity cache state change" identity="{16777217 generated [] false false}" ipAddr="{172.18.0.2 ffffffff}" modification=Upsert subsys=datapath-ipcache
...
2023-08-03T21:42:38.098477940Z level=debug msg="UpdateIdentities: Adding a new identity" identity=16777217 labels="[cidr:0.0.0.0/0 cidr:128.0.0.0/1 cidr:128.0.0.0/2 cidr:160.0.0.0/3 cidr:160.0.0.0/4 cidr:168.0.0.0/5 cidr:172.0.0.0/10 cidr:172.0.0.0/11 cidr:172.0.0.0/6 cidr:172.0.0.0/7 cidr:172.0.0.0/8 cidr:172.0.0.0/9 cidr:172.16.0.0/12 cidr:172.16.0.0/13 cidr:172.16.0.0/14 cidr:172.18.0.0/15 cidr:172.18.0.0/16 cidr:172.18.0.0/17 cidr:172.18.0.0/18 cidr:172.18.0.0/19 cidr:172.18.0.0/20 cidr:172.18.0.0/21 cidr:172.18.0.0/22 cidr:172.18.0.0/23 cidr:172.18.0.0/24 cidr:172.18.0.0/25 cidr:172.18.0.0/26 cidr:172.18.0.0/27 cidr:172.18.0.0/28 cidr:172.18.0.0/29 cidr:172.18.0.0/30 cidr:172.18.0.2/31 cidr:172.18.0.2/32 reserved:world]" subsys=policy
...
2023-08-03T21:42:39.133189307Z level=debug msg="Kubernetes service definition changed" action=service-updated endpoints="172.18.0.2:6443/TCP" k8sNamespace=default k8sSvcName=kubernetes old-endpoints="172.18.0.2:6443/TCP" old-service=nil service="frontends:[10.96.0.1]/ports=[https]/selector=map[]" subsys=k8s-watcher
...
2023-08-03T21:42:40.004022127Z level=debug msg="Updating tunnel map entry" endpoint=172.18.0.2 key=0 prefix=10.0.0.0 subsys=map-tunnel
2023-08-03T21:42:40.210928109Z level=debug msg="Upserting IP into ipcache layer" identity="{kube-apiserver kube-apiserver [] false true}" ipAddr=172.18.0.2/32 key=0 subsys=ipcache
2023-08-03T21:42:40.210936205Z level=debug msg="Daemon notified of IP-Identity cache state change" identity="{kube-apiserver kube-apiserver [] false false}" ipAddr="{172.18.0.2 ffffffff}" modification=Upsert subsys=datapath-ipcache
2023-08-03T21:42:40.211844718Z level=debug msg="UpdateIdentities: Deleting identity" identity=16777217 labels="[cidr:0.0.0.0/0 cidr:128.0.0.0/1 cidr:128.0.0.0/2 cidr:160.0.0.0/3 cidr:160.0.0.0/4 cidr:168.0.0.0/5 cidr:172.0.0.0/10 cidr:172.0.0.0/11 cidr:172.0.0.0/6 cidr:172.0.0.0/7 cidr:172.0.0.0/8 cidr:172.0.0.0/9 cidr:172.16.0.0/12 cidr:172.16.0.0/13 cidr:172.16.0.0/14 cidr:172.18.0.0/15 cidr:172.18.0.0/16 cidr:172.18.0.0/17 cidr:172.18.0.0/18 cidr:172.18.0.0/19 cidr:172.18.0.0/20 cidr:172.18.0.0/21 cidr:172.18.0.0/22 cidr:172.18.0.0/23 cidr:172.18.0.0/24 cidr:172.18.0.0/25 cidr:172.18.0.0/26 cidr:172.18.0.0/27 cidr:172.18.0.0/28 cidr:172.18.0.0/29 cidr:172.18.0.0/30 cidr:172.18.0.2/31 cidr:172.18.0.2/32 reserved:world]" subsys=policy
...
2023-08-03T21:42:43.843942352Z level=debug msg="stale identity observed" identity=16777217 ipAddr=172.18.0.2 oldIdentity=7 subsys=hubble
2023-08-03T21:42:49.751308617Z level=error msg="Unable to find identity of previously used CIDR 172.18.0.2/32" subsys=ipcache

Full agent log:
logs-cilium-qc62c-cilium-agent-20230803-214347.log

EDIT2: 💡 OK so the previous test was with --enable-remote-node-identity=false, which explains why the remote node's identity previously corresponded to a CIDR identity in the ipcache map and why it was restored that way. So the transition there from a CIDR identity to using the remote-node identity likely makes sense. However that doesn't fully explain the reference counting dynamic that introduces the error log. Cilium should have been able to detect that it had allocated the identity newly once, then it should be able to release the final reference once. There's currently a double-free. That sounds a lot like the error reported against v1.14.0.

EDIT3: This was useful to catch the regression, and the bug is now fixed. I think it's still worth pursuing this testing strategy as a backup to try to catch this class of error in CI in future, ideally before we release such bugs.

@github-actions
Copy link

github-actions bot commented Sep 7, 2023

This pull request has been automatically marked as stale because it
has not had recent activity. It will be closed if no further activity
occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Sep 7, 2023
@github-actions
Copy link

This pull request has not seen any activity since it was marked stale.
Closing.

@joestringer
Copy link
Member Author

/test

@github-actions github-actions bot removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Oct 10, 2023
@joestringer
Copy link
Member Author

/test

@joestringer joestringer force-pushed the pr/joe/messing-with-time branch 2 times, most recently from d2b9a24 to a68feb1 Compare October 11, 2023 14:48
@joestringer
Copy link
Member Author

/test

@joestringer
Copy link
Member Author

/test

pkg/time/time.go Outdated Show resolved Hide resolved
Copy link
Contributor

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joestringer Nice work! Just wish there was a better way to handle this ;(

contrib/scripts/check-time.sh Show resolved Hide resolved
I noticed a couple of instances where the Cilium imports were in the
wrong order. Fix them up.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Joe Stringer <joe@cilium.io>
Implement wrapper functions for common sleep / timer functions in order
to place a maximum on the durations for these functions. Implemented via
a local variable in the new pkg/time package. This local variable is not
yet effective until future commits initialize it during agent startup.

Signed-off-by: Joe Stringer <joe@cilium.io>
This new flag will start to enforce maximum internal time durations and
timers for testing purposes. This relies on other commits that convert
internal packages over from upstream "time" to "pkg/time".

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Joe Stringer <joe@cilium.io>
Generated with 'contrib/scripts/check-time.sh update'.

Earlier versions used:

    $ git grep -l "time\"" -- \
          ':!install/' ':!api/' ':!examples/' ':!cilium*' ':!clustermesh*' \
          ':!operator/' ':!bugtool' ':!tools' ':!test' ':!vendor' ':!plugins' \
          ':!hive' ':!*_test.go' ':!pkg/testutils' ':!pkg/time' ':!pkg/lock/' \
          ':!pkg/loadinfo' ':!pkg/health' ':!Documentation/' \
          ':!pkg/monitor' ':!pkg/k8s/client' ':!pkg/k8s/slim' \
      > files.txt
    $ cat files.txt \
      | xargs sed -i '/"time"/d; /"github.com\/cilium\/cilium\/.*"/a\
"github.com/cilium/cilium/pkg/time"'
    $ cat files.txt \
      | xargs dirname \
      | sort -u \
      | xargs go run golang.org/x/tools/cmd/goimports -w
    $ cat files.txt \
      | xargs git add

Signed-off-by: Joe Stringer <joe@cilium.io>
Enforce that time usage must use pkg/time going forwards so that CI can
detect eventual consistency issues related to timers.

Signed-off-by: Joe Stringer <joe@cilium.io>
@joestringer
Copy link
Member Author

/test

@aditighag aditighag merged commit 6fef314 into main Oct 30, 2023
213 of 214 checks passed
@aditighag aditighag deleted the pr/joe/messing-with-time branch October 30, 2023 21:55
gandro added a commit to gandro/cilium that referenced this pull request Nov 29, 2023
This commit addresses two problems with the IPAM expiration timer:

1. Before this commit, each timer consisted of a Go routine calling
   `time.Sleep` to wait for expiration to occur. The default expiration
   timeout is 10 minutes. This meant, that for every IP allocated via
   CNI ADD, we had a Go routine unconditionally sleeping for 10 minutes,
   only to (in most cases) wake up and learn that the expiration timer was stopped.
   This commit improves that situation by having the expiration Go
   routine wake up and exit early if it was stopped (either via IP Release
   or `StopExpirationTimer`).

2. In CI, we set the hidden `max-internal-timer-delay` option to 5
   seconds (see cilium#27253). This meant that the `time.Sleep`
   expiration timer would effectively be 5 seconds instead of 10 minutes. 5
   seconds however is not enough for an endpoint to be created via CNI ADD
   and complete its first endpoint regeneration. This therefore led to
   endpoint IPs being released while the endpoint was still being created.
   Due to another bug (fixed in the next commit) the expiration timer
   failed to actually  release the IP, which is why this bug was not
   discovered earlier when we introduced the 5 second limit.  This commit
   addresses this issue by adding an escape hatch to `pkg/time`, allowing
   the creation of a timer which is not subject to the
  `max-internal-timer-delay`.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Dec 4, 2023
This commit addresses two problems with the IPAM expiration timer:

1. Before this commit, each timer consisted of a Go routine calling
   `time.Sleep` to wait for expiration to occur. The default expiration
   timeout is 10 minutes. This meant, that for every IP allocated via
   CNI ADD, we had a Go routine unconditionally sleeping for 10 minutes,
   only to (in most cases) wake up and learn that the expiration timer was stopped.
   This commit improves that situation by having the expiration Go
   routine wake up and exit early if it was stopped (either via IP Release
   or `StopExpirationTimer`).

2. In CI, we set the hidden `max-internal-timer-delay` option to 5
   seconds (see #27253). This meant that the `time.Sleep`
   expiration timer would effectively be 5 seconds instead of 10 minutes. 5
   seconds however is not enough for an endpoint to be created via CNI ADD
   and complete its first endpoint regeneration. This therefore led to
   endpoint IPs being released while the endpoint was still being created.
   Due to another bug (fixed in the next commit) the expiration timer
   failed to actually  release the IP, which is why this bug was not
   discovered earlier when we introduced the 5 second limit.  This commit
   addresses this issue by adding an escape hatch to `pkg/time`, allowing
   the creation of a timer which is not subject to the
  `max-internal-timer-delay`.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
nbusseneau pushed a commit that referenced this pull request Dec 5, 2023
[ upstream commit be9e853 ]

[ backporter's notes: conflicts due to `pkg/time/time.go` not existing
  in v1.14, skipped changes to this file. ]

This commit addresses two problems with the IPAM expiration timer:

1. Before this commit, each timer consisted of a Go routine calling
   `time.Sleep` to wait for expiration to occur. The default expiration
   timeout is 10 minutes. This meant, that for every IP allocated via
   CNI ADD, we had a Go routine unconditionally sleeping for 10 minutes,
   only to (in most cases) wake up and learn that the expiration timer was stopped.
   This commit improves that situation by having the expiration Go
   routine wake up and exit early if it was stopped (either via IP Release
   or `StopExpirationTimer`).

2. In CI, we set the hidden `max-internal-timer-delay` option to 5
   seconds (see #27253). This meant that the `time.Sleep`
   expiration timer would effectively be 5 seconds instead of 10 minutes. 5
   seconds however is not enough for an endpoint to be created via CNI ADD
   and complete its first endpoint regeneration. This therefore led to
   endpoint IPs being released while the endpoint was still being created.
   Due to another bug (fixed in the next commit) the expiration timer
   failed to actually  release the IP, which is why this bug was not
   discovered earlier when we introduced the 5 second limit.  This commit
   addresses this issue by adding an escape hatch to `pkg/time`, allowing
   the creation of a timer which is not subject to the
  `max-internal-timer-delay`.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
gandro added a commit to gandro/cilium that referenced this pull request Dec 6, 2023
[ upstream commit be9e853 ]

This commit addresses two problems with the IPAM expiration timer:

1. Before this commit, each timer consisted of a Go routine calling
   `time.Sleep` to wait for expiration to occur. The default expiration
   timeout is 10 minutes. This meant, that for every IP allocated via
   CNI ADD, we had a Go routine unconditionally sleeping for 10 minutes,
   only to (in most cases) wake up and learn that the expiration timer was stopped.
   This commit improves that situation by having the expiration Go
   routine wake up and exit early if it was stopped (either via IP Release
   or `StopExpirationTimer`).

2. In CI, we set the hidden `max-internal-timer-delay` option to 5
   seconds (see cilium#27253). This meant that the `time.Sleep`
   expiration timer would effectively be 5 seconds instead of 10 minutes. 5
   seconds however is not enough for an endpoint to be created via CNI ADD
   and complete its first endpoint regeneration. This therefore led to
   endpoint IPs being released while the endpoint was still being created.
   Due to another bug (fixed in the next commit) the expiration timer
   failed to actually  release the IP, which is why this bug was not
   discovered earlier when we introduced the 5 second limit.  This commit
   addresses this issue by adding an escape hatch to `pkg/time`, allowing
   the creation of a timer which is not subject to the
  `max-internal-timer-delay`.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
gandro added a commit that referenced this pull request Dec 6, 2023
[ upstream commit be9e853 ]

This commit addresses two problems with the IPAM expiration timer:

1. Before this commit, each timer consisted of a Go routine calling
   `time.Sleep` to wait for expiration to occur. The default expiration
   timeout is 10 minutes. This meant, that for every IP allocated via
   CNI ADD, we had a Go routine unconditionally sleeping for 10 minutes,
   only to (in most cases) wake up and learn that the expiration timer was stopped.
   This commit improves that situation by having the expiration Go
   routine wake up and exit early if it was stopped (either via IP Release
   or `StopExpirationTimer`).

2. In CI, we set the hidden `max-internal-timer-delay` option to 5
   seconds (see #27253). This meant that the `time.Sleep`
   expiration timer would effectively be 5 seconds instead of 10 minutes. 5
   seconds however is not enough for an endpoint to be created via CNI ADD
   and complete its first endpoint regeneration. This therefore led to
   endpoint IPs being released while the endpoint was still being created.
   Due to another bug (fixed in the next commit) the expiration timer
   failed to actually  release the IP, which is why this bug was not
   discovered earlier when we introduced the 5 second limit.  This commit
   addresses this issue by adding an escape hatch to `pkg/time`, allowing
   the creation of a timer which is not subject to the
  `max-internal-timer-delay`.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
pjablonski123 pushed a commit to pjablonski123/cilium that referenced this pull request Dec 15, 2023
This commit addresses two problems with the IPAM expiration timer:

1. Before this commit, each timer consisted of a Go routine calling
   `time.Sleep` to wait for expiration to occur. The default expiration
   timeout is 10 minutes. This meant, that for every IP allocated via
   CNI ADD, we had a Go routine unconditionally sleeping for 10 minutes,
   only to (in most cases) wake up and learn that the expiration timer was stopped.
   This commit improves that situation by having the expiration Go
   routine wake up and exit early if it was stopped (either via IP Release
   or `StopExpirationTimer`).

2. In CI, we set the hidden `max-internal-timer-delay` option to 5
   seconds (see cilium#27253). This meant that the `time.Sleep`
   expiration timer would effectively be 5 seconds instead of 10 minutes. 5
   seconds however is not enough for an endpoint to be created via CNI ADD
   and complete its first endpoint regeneration. This therefore led to
   endpoint IPs being released while the endpoint was still being created.
   Due to another bug (fixed in the next commit) the expiration timer
   failed to actually  release the IP, which is why this bug was not
   discovered earlier when we introduced the 5 second limit.  This commit
   addresses this issue by adding an escape hatch to `pkg/time`, allowing
   the creation of a timer which is not subject to the
  `max-internal-timer-delay`.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet