Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.15 Backports 2024-04-02 #31727

Merged
merged 21 commits into from
Apr 8, 2024
Merged

v1.15 Backports 2024-04-02 #31727

merged 21 commits into from
Apr 8, 2024

Commits on Apr 2, 2024

  1. docs: Restructure OpenShift installation instructions

    [ upstream commit efff613 ]
    
    This commit restructures the OpenShift installation instructions to
    point to the Red Hat Ecosystem Catalog, so users may find vendor
    maintained OLM images.
    
    The old installation instructions which refer to the deprecated
    cilium/cilium-olm repository will be moved to the
    isovalent/olm-for-cilium repository.
    
    Fixes: #24270
    
    Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
    Signed-off-by: Jussi Maki <jussi@isovalent.com>
    learnitall authored and joamaki committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    3fc680a View commit details
    Browse the repository at this point in the history
  2. nodemap: add node-map-max flag to configure nodemap bpf size.

    [ upstream commit 0c3570c ]
    
    Using the node-map-max flag, one can now override the default 16k node map size.
    This may be needed for large clusters, where the number of distinct node IPs in the cluster exceeds the standard size.
    
    Also provides Size() to nodemap.Map interface such that loader can use this to set the NODE_MAP_MAX var while building bpf programs.
    
    Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
    tommyp1ckles authored and joamaki committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    ea2b4b6 View commit details
    Browse the repository at this point in the history
  3. helm: add bpf.nodeMapMax helm val to configure node map size.

    [ upstream commit a596a5a ]
    
    This can be used to override the default node-map-max value which sets bpf node map size.
    In some cases, node map size may need to be overridden for very large clusters.
    
    Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
    tommyp1ckles authored and joamaki committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    fdb3d81 View commit details
    Browse the repository at this point in the history
  4. nodemap: add validation to check that node map max is at least 16384.

    [ upstream commit 3d641e1 ]
    
    This is the constant default size prior to adding the flag.
    There's not much reason to lower this value so to avoid edge cases we'll just say that this is the lower bound.
    
    Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
    Signed-off-by: Jussi Maki <jussi@isovalent.com>
    tommyp1ckles authored and joamaki committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    ec9b1c9 View commit details
    Browse the repository at this point in the history
  5. docs: add section for scale implications of nodemap size.

    [ upstream commit 4367ffa ]
    
    With previous commits adding the ability to adjust nodemap size, this adds a section explaining the implications of the nodemap sizing.
    
    Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
    Signed-off-by: Jussi Maki <jussi@isovalent.com>
    tommyp1ckles authored and joamaki committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    ca3c006 View commit details
    Browse the repository at this point in the history
  6. docs: ipsec: document native-routing + Egress proxy case

    [ upstream commit a2bf108 ]
    
    Let the docs reflect the limitation from
    GHSA-j89h-qrvr-xc36.
    
    Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
    Signed-off-by: Jussi Maki <jussi@isovalent.com>
    julianwiedmann authored and joamaki committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    7d473be View commit details
    Browse the repository at this point in the history
  7. ingress: sort all shared ingresses during model generation

    [ upstream commit 16f6afe ]
    
    Currently, when building the model for shared Ingress, all Ingresses
    in the cluster are listed and processed. The order of the Ingresses can
    differ and potentially influence the generated CiliumEnvoyConfig. This
    can lead to unnecessary reconciles. (Even though the internal translation
    already handles a stable CiliumEnvoyConfig generation where possible.)
    
    In addition to the existing stable translation logic, this commit sorts
    all shared Ingresses by their namespace and name before processing. This way
    a consistent translation is more likely to be guaranteed.
    
    Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
    Signed-off-by: Jussi Maki <jussi@isovalent.com>
    mhofstetter authored and joamaki committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    ebbd145 View commit details
    Browse the repository at this point in the history
  8. ci-e2e: Add e2e test with WireGuard + Host Firewall

    [ upstream commit 54b2ce4 ]
    
    To get more coverage about the host firewall, let's add a new job in the
    e2e test suites to run it alongside WireGuard encryption.
    
    Signed-off-by: Quentin Monnet <qmo@qmon.net>
    Signed-off-by: Jussi Maki <jussi@isovalent.com>
    qmonnet authored and joamaki committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    bac890f View commit details
    Browse the repository at this point in the history
  9. fqdn: Update DNS Restore to Index to PortProto

    [ upstream commit 1941679 ]
    
    DNS Proxy needs to account for protocol when indexing
    L7 DNS rules that it needs to adhere to, otherwise
    L7 rules with differing port-protocols can override
    each other (nondeterministically) and create overly
    restrictive, and incorrect DNS rules. The problem with
    accounting for protocol is that Endpoint restoration
    logic uses DNS rules that index to port-only as JSON
    saved to disk. Adding an additional protocol index to
    a map structure changes the JSON structure and breaks
    restoration logic between Cilium versions.
    
    This change makes the map index backwards compatible,
    since it changes the index from a uint16 to a uint32,
    both of which marshal the same into a JSON structure.
    The endpoint restoration logic will succeed between
    versions, because the older version will be
    automatically differentiated with a lack of a 1-bit
    at bit position 24. Version 2 will save a 1 bit at the
    24th bit going forward to differentiate when protocol
    is indexed or not present.
    
    Signed-off-by: Nate Sweet <nathanjsweet@pm.me>
    Signed-off-by: Jussi Maki <jussi@isovalent.com>
    nathanjsweet authored and joamaki committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    397e840 View commit details
    Browse the repository at this point in the history
  10. fqdn: Add Protocol to DNS Proxy Cache

    [ upstream commit bc7fbf3 ]
    
    DNS Proxy indexes domain selectors by port
    only. In cases where protocols collide on port
    the DNS proxy may have a more restrictive selector
    than it should because it does not merge port
    protocols for L7 policies (only ports).
    
    All callers of the DNS Proxy are updated
    to add protocol to any DNS Proxy entries, and all
    tests are updated to test for port-protocol
    merge errors.
    
    Signed-off-by: Nate Sweet <nathanjsweet@pm.me>
    nathanjsweet authored and joamaki committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    5f04cc0 View commit details
    Browse the repository at this point in the history
  11. endpoint: Create a New Restore Field for DNS

    [ upstream commit 6baab36 ]
    
    DNSRulesV2 accounts for protocol and DNSRules does not.
    DNSProxy needs to account for both, and endpoint needs
    to be able to restore from a downgrade. DNSRulesV2 is used
    by default now, but DNSRules is maintained in case of a
    downgrade.
    
    Signed-off-by: Nate Sweet <nathanjsweet@pm.me>
    nathanjsweet authored and joamaki committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    9a4d00d View commit details
    Browse the repository at this point in the history
  12. fqdn: Fallback to Version 1 Port Lookups

    [ upstream commit abd7c6e ]
    
    In cases where a port-protocol is not present
    in an restored port protocol, look up
    up the Version 1 version of the PortoProto
    in case a Version 1 PortProto was restored.
    
    Signed-off-by: Nate Sweet <nathanjsweet@pm.me>
    Signed-off-by: Jussi Maki <jussi@isovalent.com>
    nathanjsweet authored and joamaki committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    616e50e View commit details
    Browse the repository at this point in the history
  13. testing: Update Restore Sort Method Signatrues

    [ upstream commit 5185252 ]
    
    The Sort methods are updated to take an unused
    testing.T structure to indicate to all callers
    that they are only for testing purposes.
    
    Signed-off-by: Nate Sweet <nathanjsweet@pm.me>
    Signed-off-by: Jussi Maki <jussi@isovalent.com>
    nathanjsweet authored and joamaki committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    6226c85 View commit details
    Browse the repository at this point in the history
  14. docs: Document No node ID drops in case of remote node deletion

    [ upstream commit e2e97f3 ]
    
    While testing cluster scale downs, we noticed that under constant
    traffic load, we sometimes had drops of type "No node ID found". We
    confirmed that these are expected when the remote node was just deleted,
    the delete event received by the local agent, but a local pod is still
    sending traffic to pods on that node. In that case, the node is removed
    from the node ID map, but information on pods hosted by that node may
    still be present.
    
    This commit documents it with the other expected reasons for "No node
    ID found" drops.
    
    Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
    Signed-off-by: Jussi Maki <jussi@isovalent.com>
    pchaigno authored and joamaki committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    415ad74 View commit details
    Browse the repository at this point in the history
  15. cilium-health: Fix broken retry loop in cilium-health-ep controller

    [ upstream commit 43bd8c1 ]
    
    This commit fixes a bug in the `cilium-health-ep` controller restart
    logic where it did not give the cilium-health endpoint enough time to
    startup before it was re-created.
    
    For context, the `cilium-health-ep` performs two tasks:
    
      1. Launch the cilium-health endpoint when the controller is started
         for the first time.
      2. Ping the cilium-health endpoint, and if it does not reply, destroy
         and re-create it.
    
    The controller has a `RunInterval` of 60 seconds and a default
    `ErrorRetryBaseDuration` of 1 second. This means that after launching
    the initial cilium-health endpoint, we wait for 60 seconds before we
    attempt to ping it. If that ping succeeds, we then keep pinging the
    health endpoint every 60 seconds.
    
    However, if a ping fails, the controller deletes the existing endpoint
    and creates a new one. Because the controller then also returns an
    error, it is immediately re-run after one second, because in the failure
    case a controller retries with an interval of `consecutiveErrors *
    ErrorRetryBaseDuration`.
    
    This meant that after a failed ping, we deleted the unreachable
    endpoint, recreated a new one, and after 1s would immediately try to
    ping it. Because the newly launched endpoint will is unlikely to be
    reachable after just one second (it requires a full endpoint
    regeneration with BPF compilation), the `cilium-health-ep` logic would
    declare the still starting endpoint as dead and re-create it. This loop
    would continue endlessly, causing lots of unnecessary CPU churn, until
    enough consecutive errors have happened for the wait time between launch
    and the first ping to be long enough for a cilium-health endpoint to be
    fully regenerated.
    
    This commit attempts to fix the logic by not immediately killing a
    unreachable health endpoint and instead waiting for three minutes to
    pass before we attempt to try again. Three minutes should hopefully be
    enough time for the initial endpoint regeneration to succeed.
    
    Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
    gandro authored and joamaki committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    36ee729 View commit details
    Browse the repository at this point in the history
  16. workflows: ipsec-e2e: add missing key types for some configs

    [ upstream commit 283cb04 ]
    
    These configs were recent additions, and missed the introduction of
    the key-type-* parameters. Add them now.
    
    Suggested-by: Paul Chaignon <paul.chaignon@gmail.com>
    Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
    Signed-off-by: Jussi Maki <jussi@isovalent.com>
    julianwiedmann authored and joamaki committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    3795743 View commit details
    Browse the repository at this point in the history
  17. workflows: Debug info for key rotations

    [ upstream commit 820aa07 ]
    
    During the key rotations, we compare the number of keys to the expected
    number to know where we are in the process (started the rotation or
    finished it). The expected number of keys depends on the configuration
    so let's print it in the logs to help debug.
    
    Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
    Signed-off-by: Jussi Maki <jussi@isovalent.com>
    pchaigno authored and joamaki committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    0b32c96 View commit details
    Browse the repository at this point in the history
  18. fix: Delegated ipam not configure ipv6 in ipv6 disabled case

    [ upstream commit 034aee7 ]
    
    Delegated ipam returns ipv6 address to cilium cni even if ipv6 disabled
    in cilium agent config. In this scenario, ipv6 node addressing is not
    set and its causing cilium cni to crash if delegated ipam returns ipv6
    but disabled in cilium agent.
    
    Signed-off-by: Tamilmani <tamanoha@microsoft.com>
    Signed-off-by: Jussi Maki <jussi@isovalent.com>
    tamilmani1989 authored and joamaki committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    50522a7 View commit details
    Browse the repository at this point in the history
  19. install/kubernetes: use digest for nodeinit image

    [ upstream commit 2d32dab ]
    
    Like other images used in the Cilium helm chart, use a digest in
    addition to the tag for the nodeinit image.
    
    Signed-off-by: Tobias Klauser <tobias@cilium.io>
    tklauser authored and joamaki committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    b58f7c3 View commit details
    Browse the repository at this point in the history
  20. install/kubernetes: use renovate to update quay.io/cilium/startup-script

    [ upstream commit ac804b6 ]
    
    Make sure the latest version of the image is used in the helm charts by
    letting renovatebot update it automatically.
    
    Signed-off-by: Tobias Klauser <tobias@cilium.io>
    tklauser authored and joamaki committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    a32bccb View commit details
    Browse the repository at this point in the history
  21. ci/ipsec: Print more info to debug credentials removal check failures

    [ upstream commit 129f2e2 ]
    
    In commit 6fee46f ("ci/ipsec: Fix downgrade version retrieval") we
    added a check to make sure that GitHub credentials are removed before
    pulling the untrusted branch from the Pull Request's author. It appears
    that this check occasionally fails and causes the whole job to abort.
    But Cilium's repository _is_ public, and it's unclear why ".private ==
    false" does not evaluate to "false" as we expected in that case. Did the
    curl request fail? Did the reply miss the expected .private field? We'll
    probably loosen the check as a workaround, but before that it would be
    interesting to understand better what's going on. Here we remove the -s
    flag from curl and print the reply from the GitHub API request, so we
    can better understand what's going on next time we observe a failure.
    
    Signed-off-by: Quentin Monnet <qmo@qmon.net>
    Signed-off-by: Jussi Maki <jussi@isovalent.com>
    qmonnet authored and joamaki committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    d2507d3 View commit details
    Browse the repository at this point in the history