Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

route: Specify "proto kernel" for ip routes and rules #31716

Merged
merged 1 commit into from
Apr 9, 2024

Conversation

jschwinger233
Copy link
Member

v1.14 installs routes and rules with specific "proto kernel", v1.15 missed them. Without "proto kernel", it causes troubles when downgrade from v1.15 to v1.14, as v1.14 is deleting routes with "proto kernel" but there are no matching ones.

Fixes a route installing issue which may cause troubles for cilium downgrade.

v1.14 installs routes and rules with specific "proto kernel", v1.15
missed them. Without "proto kernel", it causes troubles when downgrade
from v1.15 to v1.14, as v1.14 is deleting routes with "proto kernel" but
there are no matching ones.

Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com>
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 2, 2024
@jschwinger233 jschwinger233 added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Apr 2, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 2, 2024
@jschwinger233 jschwinger233 added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. affects/v1.15 This issue affects v1.15 branch labels Apr 2, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 2, 2024
@jschwinger233 jschwinger233 added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Apr 2, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 2, 2024
@jschwinger233 jschwinger233 added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Apr 2, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 2, 2024
@jschwinger233 jschwinger233 added the sig/agent Cilium agent related. label Apr 2, 2024
@jschwinger233
Copy link
Member Author

/test

@julianwiedmann julianwiedmann added the area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. label Apr 2, 2024
@jschwinger233 jschwinger233 marked this pull request as ready for review April 2, 2024 09:14
@jschwinger233 jschwinger233 requested a review from a team as a code owner April 2, 2024 09:14
@julianwiedmann julianwiedmann requested a review from rgo3 April 2, 2024 09:14
@julianwiedmann
Copy link
Member

@jschwinger233 could you take a minute to sketch out how this will affect the downgrade to an older v1.15 release?

From a quick glance

  • the fromProxyDefaultRoute* parts should be fine. On cleanup we don't reference those routes directly, but just delete the whole routing table.
  • toProxyRule looks a bit more tricky. Is older v1.15 still able to manage this rule, now that it has a different Protocol?

@jschwinger233
Copy link
Member Author

jschwinger233 commented Apr 3, 2024

@julianwiedmann It doesn't look good...

This PR installs 0x200 table 2004 proto kernel, then on downgrade to old v1.15, cilium-1.15 calls replaceRule("0x200 table 2004 proto unspec"), which luckily(or unluckily?) doesn't check spec.Protoctol at all due to spec.Protocol == 0: https://github.com/cilium/cilium/blob/v1.15/pkg/datapath/linux/route/route_linux.go#L402. Cilium will see the rule as already existed and do nothing.

Things are even worse when doing upgrade from old v1.15. With this PR, cilium-main calls replaceRule("0x200 table 2004 proto kernel"), which unluckily starts to check spec.Protocol and will find no match, then end up with installing a new but duplicate rule. We will finally have something like:

9:	from all fwmark 0x200/0xf00 lookup 2004 proto unspec
9:	from all fwmark 0x200/0xf00 lookup 2004 proto kernel

@julianwiedmann
Copy link
Member

julianwiedmann commented Apr 3, 2024

@julianwiedmann It doesn't look good...

This PR installs 0x200 table 2004 proto kernel, then on downgrade to old v1.15, cilium-1.15 calls replaceRule("0x200 table 2004 proto unspec"), which luckily(or unluckily?) doesn't check spec.Protoctol at all due to spec.Protocol == 0: https://github.com/cilium/cilium/blob/v1.15/pkg/datapath/linux/route/route_linux.go#L402. Cilium will see the rule as already existed and do nothing.

And with the same logic, an old v1.15 is also able to match & delete the proto=kernel rule when it's no longer needed ...?

Things are even worse when doing upgrade from old v1.15. With this PR, cilium-main calls replaceRule("0x200 table 2004 proto kernel"), which unluckily starts to check spec.Protocol and will find no match, then end up with installing a new but duplicate rule. We will finally have something like:

9:	from all fwmark 0x200/0xf00 lookup 2004 proto unspec
9:	from all fwmark 0x200/0xf00 lookup 2004 proto kernel

Ah, so for upgrade we would also need an unconditional clean-up of the proto=unspec rule :). This would be needed in v1.15 / v1.16, and could go away after?

@jschwinger233
Copy link
Member Author

@julianwiedmann Yes.

I think it can go easier if upgrade path is 1.15-old -> 1.15-tip -> 1.16


upgrade: 1.15-old -> 1.15-tip:

  1. lookupRule should compare proto unconditionally;
  2. ReplaceRule should install new (but duplicated) proto kernel routings;
  3. additional handle to delete stale duplicated routings

downgrade: 1.15-tip -> 1.15-old:

  1. old lookupRule skips proto comparison
  2. ReplaceRule sees routings existed and does nothing
    Then system should leave proto kernel routings only, which don't affect further upgrade.

upgrade: 1.15-tip -> 1.16

Routings are all correct without need to change. ReplaceRule does nothing.


Overall, I think this PR should follow the steps:

patch 1. Add extra code to delete "duplicate" routes / rules.
patch 2. lookupRule checks proto unconditionally. This can lead to "duplicate" routings.
patch 3. Add proto kernel for routes and rules.

The patch order is arranged above so as to not break git bisect.

@julianwiedmann
Copy link
Member

Right, we can stipulate that users will go through 1.15-tip (which cleans up the proto=unspec rule). So 1.16 won't need the cleanup code 👍.

@jschwinger233 jschwinger233 marked this pull request as draft April 3, 2024 11:29
@jschwinger233 jschwinger233 marked this pull request as ready for review April 5, 2024 05:22
@jschwinger233
Copy link
Member Author

I re-mark this as ready for review. Please see the 1.15 backport PR #31777 for upgrade/downgrade issues.

@jschwinger233 jschwinger233 added backport/author The backport will be carried out by the author of the PR. backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Apr 5, 2024
Copy link
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - think we established that all the upgrade/downgrade smarts need to happen in the v1.15 branch, and we don't need to consider them here.

@julianwiedmann
Copy link
Member

@rgo3 not re-requesting review, but does the reasoning about upgrade/downgrade in #31716 (comment) check out for you? 🙏

@rgo3
Copy link
Contributor

rgo3 commented Apr 8, 2024

SGTM. We already recommend that when upgrading to a new minor version users should upgrade to the latest patch version of the previous minor version first. Do we need to communicate this additionally in some specific 1.16 upgrade guide or is the hint in our general upgrade documentation enough?

@jschwinger233
Copy link
Member Author

Do we need to communicate this additionally in some specific 1.16 upgrade guide or is the hint in our general upgrade documentation enough?

I think the upgrade guide already makes it clear that "upgrade to latest patch version" is required: https://docs.cilium.io/en/stable/operations/upgrade/#step-1-upgrade-to-latest-patch-version

@julianwiedmann julianwiedmann added this pull request to the merge queue Apr 9, 2024
Merged via the queue into cilium:main with commit 318a648 Apr 9, 2024
63 checks passed
@julianwiedmann julianwiedmann added backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed affects/v1.15 This issue affects v1.15 branch needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. backport/author The backport will be carried out by the author of the PR. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/agent Cilium agent related. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants