Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1.15] bgpv1: Remove disruptive error handling from BGPRouterManager #30735

Merged
merged 4 commits into from Feb 15, 2024

Conversation

YutaroHayakawa
Copy link
Member

Backport of #30382. No conflict was observed.

bgpv1: Remove disruptive error handling from BGPRouterManager

[ upstream commit 1676530 ]

Currently, when the BGP Control Plane meet the error during
reconciliation, it stops and unregisters the server. This comes at a
risk when we accidentally introduce a situation where reconcilers
repeatedly return error for a long time. In such a situation, BGP
Control Plane repeatedly create/destroy server which may cause a BGP
session flapping or frequent route update/withdraw.

We have this problem in two places. The first one is in
BGPRouterManager.register(). When the initial reconciliation fails, BGP
Control Plane stop the server it just creates. In this case, we don't
need to stop server. We can register the server to the RouterManager
before reconciliation and don't have to unregister it even if the
initial reconciliation fails. This is because we have a retry logic now.
The reconciliation failure triggers the retry and in the next cycle, the
reconciliation will be handled in the BGPRouterManager.reconcile()
because the server is already registered.

The second problem is in BGPRouterManager.reconcile(). When the
reconciliation fails, BGP Control Plane stops the BGP server in failure
and unregisters it. Since we have a retry logic, we don't need to stop
BGP servers. We can return error and retry later until we recover from
the failure state.

Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
[ upstream commit 7275198 ]

Extend existing reconciler test cases to always execute Reconcile()
twice. This simulates the behavior that BGPRouterManager doesn't clear
the BGP server on reconciliation failure and the next event occurs and
ensures the reconcilers are not relying on that behavior.

As a result, existing reconcilers except NeighborReconciler passed the
test. The fix for the neighbor reconciler will be followed.

Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
[ upstream commit 7684f35 ]

There are a lot of repetition of the same fmt.Sprintf in
NeighborReconciler that constructs the same neighbor ID. Introduce a
helper method to construct neighbor ID.

Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
@YutaroHayakawa YutaroHayakawa added release-note/misc This PR makes changes that have no direct user impact. area/bgp affects/v1.15 This issue affects v1.15 branch labels Feb 13, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.15 This PR represents a backport for Cilium 1.15.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. labels Feb 13, 2024
[ upstream commit 192f37c ]

Currenly, NeighborReconciler only reconciles on old vs new configuration
differences and doesn't take the actual GoBGP's running state into
account. As a result, when the reconciliation of another reconciler
fails and CurrentServer.Config is not updated, it tries to repeat the
previous reconciliation again. However, when it involves adding the
neighbor, it fails, because adding neighbor is not idempotent in GoBGP.

To solve this issue, we track the neighbor state in metadata and
compare the diff between new config and the state instead of old config.
This makes NeighborReconciler idempotent.

Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
@YutaroHayakawa YutaroHayakawa removed the affects/v1.15 This issue affects v1.15 branch label Feb 14, 2024
@YutaroHayakawa YutaroHayakawa marked this pull request as ready for review February 14, 2024 10:56
@YutaroHayakawa YutaroHayakawa requested a review from a team as a code owner February 14, 2024 10:56
@YutaroHayakawa YutaroHayakawa requested review from a team and rastislavs and removed request for a team February 14, 2024 10:56
@YutaroHayakawa
Copy link
Member Author

/test-backport-1.15

@YutaroHayakawa
Copy link
Member Author

Cilium IPsec upgrade: #29987
Conformance E2E: #30717
Conformance Ginkgo: Tests are passing but failing to upload the JUnit file (https://github.com/cilium/cilium/actions/runs/7900135828/job/21561083207). Maybe rate limiting?

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 15, 2024
@YutaroHayakawa YutaroHayakawa merged commit fa5506c into cilium:v1.15 Feb 15, 2024
60 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/bgp backport/1.15 This PR represents a backport for Cilium 1.15.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants