-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add time wrapper to test agent delays in CI #27253
Conversation
ef6eab5
to
0ee734e
Compare
/ci-ginkgo |
🎣
EDIT: The referred IP address is one of the k8s nodes. The fact that In this scenario where all timers are set to 5s, for some reason the daemon allocates an identity for remote nodes on startup and seems to even insert it into the ipcache before it learns about the remote node IPs. This can cause temporary traffic disruption during startup, and should not happen. Furthermore it appears that the identity is released and then later on the ipcache gets confused that it was released.
Full agent log: EDIT2: 💡 OK so the previous test was with EDIT3: This was useful to catch the regression, and the bug is now fixed. I think it's still worth pursuing this testing strategy as a backup to try to catch this class of error in CI in future, ideally before we release such bugs. |
0ee734e
to
b3da413
Compare
This pull request has been automatically marked as stale because it |
This pull request has not seen any activity since it was marked stale. |
c4c28b9
to
8fb4725
Compare
/test |
8fb4725
to
65a90ba
Compare
/test |
d2b9a24
to
a68feb1
Compare
/test |
a68feb1
to
1d2f2bb
Compare
/test |
1d2f2bb
to
9c43b89
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joestringer Nice work! Just wish there was a better way to handle this ;(
I noticed a couple of instances where the Cilium imports were in the wrong order. Fix them up. Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Joe Stringer <joe@cilium.io>
Implement wrapper functions for common sleep / timer functions in order to place a maximum on the durations for these functions. Implemented via a local variable in the new pkg/time package. This local variable is not yet effective until future commits initialize it during agent startup. Signed-off-by: Joe Stringer <joe@cilium.io>
This new flag will start to enforce maximum internal time durations and timers for testing purposes. This relies on other commits that convert internal packages over from upstream "time" to "pkg/time". Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Joe Stringer <joe@cilium.io>
Generated with 'contrib/scripts/check-time.sh update'. Earlier versions used: $ git grep -l "time\"" -- \ ':!install/' ':!api/' ':!examples/' ':!cilium*' ':!clustermesh*' \ ':!operator/' ':!bugtool' ':!tools' ':!test' ':!vendor' ':!plugins' \ ':!hive' ':!*_test.go' ':!pkg/testutils' ':!pkg/time' ':!pkg/lock/' \ ':!pkg/loadinfo' ':!pkg/health' ':!Documentation/' \ ':!pkg/monitor' ':!pkg/k8s/client' ':!pkg/k8s/slim' \ > files.txt $ cat files.txt \ | xargs sed -i '/"time"/d; /"github.com\/cilium\/cilium\/.*"/a\ "github.com/cilium/cilium/pkg/time"' $ cat files.txt \ | xargs dirname \ | sort -u \ | xargs go run golang.org/x/tools/cmd/goimports -w $ cat files.txt \ | xargs git add Signed-off-by: Joe Stringer <joe@cilium.io>
Enforce that time usage must use pkg/time going forwards so that CI can detect eventual consistency issues related to timers. Signed-off-by: Joe Stringer <joe@cilium.io>
9c43b89
to
bd98a88
Compare
/test |
This commit addresses two problems with the IPAM expiration timer: 1. Before this commit, each timer consisted of a Go routine calling `time.Sleep` to wait for expiration to occur. The default expiration timeout is 10 minutes. This meant, that for every IP allocated via CNI ADD, we had a Go routine unconditionally sleeping for 10 minutes, only to (in most cases) wake up and learn that the expiration timer was stopped. This commit improves that situation by having the expiration Go routine wake up and exit early if it was stopped (either via IP Release or `StopExpirationTimer`). 2. In CI, we set the hidden `max-internal-timer-delay` option to 5 seconds (see cilium#27253). This meant that the `time.Sleep` expiration timer would effectively be 5 seconds instead of 10 minutes. 5 seconds however is not enough for an endpoint to be created via CNI ADD and complete its first endpoint regeneration. This therefore led to endpoint IPs being released while the endpoint was still being created. Due to another bug (fixed in the next commit) the expiration timer failed to actually release the IP, which is why this bug was not discovered earlier when we introduced the 5 second limit. This commit addresses this issue by adding an escape hatch to `pkg/time`, allowing the creation of a timer which is not subject to the `max-internal-timer-delay`. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit addresses two problems with the IPAM expiration timer: 1. Before this commit, each timer consisted of a Go routine calling `time.Sleep` to wait for expiration to occur. The default expiration timeout is 10 minutes. This meant, that for every IP allocated via CNI ADD, we had a Go routine unconditionally sleeping for 10 minutes, only to (in most cases) wake up and learn that the expiration timer was stopped. This commit improves that situation by having the expiration Go routine wake up and exit early if it was stopped (either via IP Release or `StopExpirationTimer`). 2. In CI, we set the hidden `max-internal-timer-delay` option to 5 seconds (see #27253). This meant that the `time.Sleep` expiration timer would effectively be 5 seconds instead of 10 minutes. 5 seconds however is not enough for an endpoint to be created via CNI ADD and complete its first endpoint regeneration. This therefore led to endpoint IPs being released while the endpoint was still being created. Due to another bug (fixed in the next commit) the expiration timer failed to actually release the IP, which is why this bug was not discovered earlier when we introduced the 5 second limit. This commit addresses this issue by adding an escape hatch to `pkg/time`, allowing the creation of a timer which is not subject to the `max-internal-timer-delay`. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit be9e853 ] [ backporter's notes: conflicts due to `pkg/time/time.go` not existing in v1.14, skipped changes to this file. ] This commit addresses two problems with the IPAM expiration timer: 1. Before this commit, each timer consisted of a Go routine calling `time.Sleep` to wait for expiration to occur. The default expiration timeout is 10 minutes. This meant, that for every IP allocated via CNI ADD, we had a Go routine unconditionally sleeping for 10 minutes, only to (in most cases) wake up and learn that the expiration timer was stopped. This commit improves that situation by having the expiration Go routine wake up and exit early if it was stopped (either via IP Release or `StopExpirationTimer`). 2. In CI, we set the hidden `max-internal-timer-delay` option to 5 seconds (see #27253). This meant that the `time.Sleep` expiration timer would effectively be 5 seconds instead of 10 minutes. 5 seconds however is not enough for an endpoint to be created via CNI ADD and complete its first endpoint regeneration. This therefore led to endpoint IPs being released while the endpoint was still being created. Due to another bug (fixed in the next commit) the expiration timer failed to actually release the IP, which is why this bug was not discovered earlier when we introduced the 5 second limit. This commit addresses this issue by adding an escape hatch to `pkg/time`, allowing the creation of a timer which is not subject to the `max-internal-timer-delay`. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit be9e853 ] This commit addresses two problems with the IPAM expiration timer: 1. Before this commit, each timer consisted of a Go routine calling `time.Sleep` to wait for expiration to occur. The default expiration timeout is 10 minutes. This meant, that for every IP allocated via CNI ADD, we had a Go routine unconditionally sleeping for 10 minutes, only to (in most cases) wake up and learn that the expiration timer was stopped. This commit improves that situation by having the expiration Go routine wake up and exit early if it was stopped (either via IP Release or `StopExpirationTimer`). 2. In CI, we set the hidden `max-internal-timer-delay` option to 5 seconds (see cilium#27253). This meant that the `time.Sleep` expiration timer would effectively be 5 seconds instead of 10 minutes. 5 seconds however is not enough for an endpoint to be created via CNI ADD and complete its first endpoint regeneration. This therefore led to endpoint IPs being released while the endpoint was still being created. Due to another bug (fixed in the next commit) the expiration timer failed to actually release the IP, which is why this bug was not discovered earlier when we introduced the 5 second limit. This commit addresses this issue by adding an escape hatch to `pkg/time`, allowing the creation of a timer which is not subject to the `max-internal-timer-delay`. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit be9e853 ] This commit addresses two problems with the IPAM expiration timer: 1. Before this commit, each timer consisted of a Go routine calling `time.Sleep` to wait for expiration to occur. The default expiration timeout is 10 minutes. This meant, that for every IP allocated via CNI ADD, we had a Go routine unconditionally sleeping for 10 minutes, only to (in most cases) wake up and learn that the expiration timer was stopped. This commit improves that situation by having the expiration Go routine wake up and exit early if it was stopped (either via IP Release or `StopExpirationTimer`). 2. In CI, we set the hidden `max-internal-timer-delay` option to 5 seconds (see #27253). This meant that the `time.Sleep` expiration timer would effectively be 5 seconds instead of 10 minutes. 5 seconds however is not enough for an endpoint to be created via CNI ADD and complete its first endpoint regeneration. This therefore led to endpoint IPs being released while the endpoint was still being created. Due to another bug (fixed in the next commit) the expiration timer failed to actually release the IP, which is why this bug was not discovered earlier when we introduced the 5 second limit. This commit addresses this issue by adding an escape hatch to `pkg/time`, allowing the creation of a timer which is not subject to the `max-internal-timer-delay`. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit addresses two problems with the IPAM expiration timer: 1. Before this commit, each timer consisted of a Go routine calling `time.Sleep` to wait for expiration to occur. The default expiration timeout is 10 minutes. This meant, that for every IP allocated via CNI ADD, we had a Go routine unconditionally sleeping for 10 minutes, only to (in most cases) wake up and learn that the expiration timer was stopped. This commit improves that situation by having the expiration Go routine wake up and exit early if it was stopped (either via IP Release or `StopExpirationTimer`). 2. In CI, we set the hidden `max-internal-timer-delay` option to 5 seconds (see cilium#27253). This meant that the `time.Sleep` expiration timer would effectively be 5 seconds instead of 10 minutes. 5 seconds however is not enough for an endpoint to be created via CNI ADD and complete its first endpoint regeneration. This therefore led to endpoint IPs being released while the endpoint was still being created. Due to another bug (fixed in the next commit) the expiration timer failed to actually release the IP, which is why this bug was not discovered earlier when we introduced the 5 second limit. This commit addresses this issue by adding an escape hatch to `pkg/time`, allowing the creation of a timer which is not subject to the `max-internal-timer-delay`. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Cilium's internal business logic relies on a highly parallel combination of reactive handlers for incoming information, "triggers" that ratelimit requests for processing to ensure Cilium does not over-consume resources, and "controllers" that periodically perform updates or resiliency checks of configured state. While in general most things are "eventually consistent", the presence of time-based triggers and controllers can introduce challenges when evaluating how Cilium will perform once the "eventual consistency" is resolved.
Taking an example of a situation where this eventual consistency caused issues that were not identified during pre-release testing, consider the issue fixed by #27327 . Cilium v1.14.0 was released with a bug where Cilium appeared to work correctly for the first several minutes following startup, then afterwards introduced connectivity disruption to specific peers. This was traced down to a logic delay tied to a 10 minute timer following startup which caused some state to be recomputed & configured in the system, ultimately causing the packet loss.
While better unit testing for the individual package could have caught this issue earlier, the issue was also triggered by the integration between the specific package and other logic in other packages. It is quite difficult to systematically identify time-based errors across the entire agent by relying purely on such testing in each package. This PR attempts to provide a more systematic safety net for timer-based issues by providing a hidden flag to override all timers within Cilium and ensuring that Cilium's CI runs with these timers set to a minimal value.
One of the interesting challenges with this PR is that it can be tempting for developers to rely on time-based tricks in order to ensure the execution order of specific pieces of logic. However, when the system is highly loaded, such mechanisms can become unreliable as an ordering enforcement mechanism. As a side-objective, this PR also hopes to make such tricks less viable on order to incentivize implementing better ordering mechanisms.
Review tips: The
treewide
commit has about 2/3 of the changes but is generated almost entirely from a script, so can be overlooked for initial review:Related: #28844