Fix the Node bootstrap problem #68

ialidzhikov · 2023-10-13T09:41:46Z

How to categorize this PR?

/kind bug

What this PR does / why we need it:
Currently, when a Shoot requests caches for all upstreams that are used by Shoot system components, things regress in several aspects. In OSS Gardener these upstreams would be quay.io for calico images, registry.k8s.io for kube-proxy and others, eu.gcr.io for the rest). Example extension configuration for that case:

  extensions:
  - type: registry-cache
    providerConfig:
      apiVersion: registry.extensions.gardener.cloud/v1alpha1
      kind: RegistryConfig
      caches:
      - upstream: eu.gcr.io
        size: 10Gi
      - upstream: quay.io
        size: 10Gi
      - upstream: registry.k8s.io
        size: 10Gi

On creation of such Shoot, image pull of Shoot system components is abnormally high. quay.io/calico/cni, quay.io/calico/calico-node and registry.k8s.io/kube-proxy images are pulled in times like 2m, 3m or even 5m. Usually, it takes up to 10s to pull these images from the upstream.
The reason for this behaviour is kind of the design of the extension. We configure a containerd mirror/registry using the Service IP of registry cache Service in the Shoot cluster. In order to have the Service's cluster IP reachable from a newly joining Node, the networking has to be set up and kube-proxy and calico components to be running. Otherwise, containerd cannot reach the Service's cluster IP and falls back to the upstream registry.
The reason for the abnormal high image pull times are that initially (until kube-proxy starts running) image pull requests from containerd to the Service's cluster IP time out after 30s. After this timeout, the fallback to the upstream (for example eu.gcr.io or quay.io) happens. We see that containerd does many requests - HEAD request to resolve the manifest by tag, GET request for the manifest by SHA, GET requests for blobs. Each of these requests times out in 30s and the the fall back to the upstream host happens. At the end image pull succeeds but it succeeds in minutes. Yesterday I was doing experiments with the docker.io/library/alpine:3.13.2 and eu.gcr.io/gardener-project/gardener/ops-toolbelt:0.18.0 in a setup where the Service IP of containerd is misconfigured (Service IP deleted and containerd refers to a no longer existing cluster IP):

Image pull from the upstream for docker.io/library/alpine:3.13.2 takes ~2s while image pull of the same image with unavailable registry cache takes ~2m.2s.
Image pull from the upstream for eu.gcr.io/gardener-project/gardener/ops-toolbelt:0.18.0 takes ~10s while image pull of the same image with unavailable registry cache takes ~3m.10s.

We see that after kube-proxy stars running, requests to the Service IP of the registry-cache Service no longer time out in 30s (dial tcp 10.4.196.64:5000: i/o timeout) but are rejected right away (dial tcp 10.4.225.49:5000: connect: connection refused).

We believe that this is because kube-proxy starts running and configures iptable rules on the Node for the Service IPs. At that time the registry cache Pods are not yet running, hence the Service does not have any available Endpoints. Most probably in that case kube-proxy configures a "black hole" for such a Service because it does not have any available Endpoints.

In total, with the current design of the extension:

Currently we cannot cache much of the Shoot system components. With the currently approach we cannot cache calico and kube-proxy images and all the images that are created at the same time with them (CSI drivers, apiserver-proxy, node-problem detector and others) because calico and kube-proxy are not yet running on the Node and haven't set up the netwoking so that containerd to be able to pull images from the cache.
We regress the Node creation time a lot (>= 10 min when I tested in the local setup with extension for an AWS Shoot). This is because the above-described behaviour - when kube-proxy is yet running, a containerd request times out after 30s and then the fallback to the upstream registry happens. For example 6 containerd requests for image pull result in > 3min image pull time. Image pulls for Shoot system components like calico and kube-proxy that usually take seconds now take minutes because of this.

To address these issues, the extension is redesigned in the following. During the OSC mutation, we no longer append the hosts.toml file with the registry config. Instead we append an new systemd unit. The systemd unit waits for each cache to be available (its Service IP starts returning HTTP 200) and only after this it creates the corresponding hosts.toml file. For the uninstall case - a DaemonSet is deployed and it cleans up the hosts.toml files and the unit (if needed).

The drawback of this approach is that the registry-cache extension won't be able to cache Shoot system components if a cache for the corresponding upstream is requested. However, with the new design, we no longer regress the image pull of the Shoot system components.

Which issue(s) this PR fixes:
Part of #3

Special notes for your reviewer:
N/A

Release note:

NONE

gardener-prow · 2023-10-13T09:41:49Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

ialidzhikov · 2023-10-16T07:45:19Z

Cloning into 'gardener'...
fatal: unable to access 'https://github.com/gardener/gardener.git/': The requested URL returned error: 403
make: *** [Makefile:120: ci-e2e-kind] Error 128

/test pull-gardener-extension-registry-cache-e2e-kind

ialidzhikov · 2023-10-17T08:50:56Z

  [FAILED] Unexpected error:
      <*errors.StatusError | 0xc0006acaa0>: 
      error dialing backend: dial tcp 10.2.244.99:9443: connect: connection refused
      {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {
                  SelfLink: "",
                  ResourceVersion: "",
                  Continue: "",
                  RemainingItemCount: nil,
              },
              Status: "Failure",
              Message: "error dialing backend: dial tcp 10.2.244.99:9443: connect: connection refused",
              Reason: "",
              Details: nil,
              Code: 500,
          },
      }
  occurred
  In [It] at: /home/prow/go/src/github.com/gardener/gardener-extension-registry-cache/test/common/common.go:107 @ 10/16/23 12:53:09.433

Seems like a networking error. The test is passing locally.

/test pull-gardener-extension-registry-cache-e2e-kind

dimitar-kostadinov · 2023-10-17T14:26:01Z

/assign

dimitar-kostadinov

Some minor findings, otherwise looks good

pkg/component/registryconfigurationcleaner/registry_configuration_cleaner.go

pkg/component/registryconfigurationcleaner/registry_configuration_cleaner_test.go

pkg/webhook/operatingsystemconfig/ensurer.go

pkg/webhook/operatingsystemconfig/ensurer_test.go

test/common/common.go

pkg/webhook/operatingsystemconfig/scripts/configure-containerd-registries.sh

ialidzhikov · 2023-10-19T07:33:55Z

/approve

gardener-prow · 2023-10-19T07:34:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ialidzhikov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ialidzhikov]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dimitar-kostadinov

/lgtm

gardener-prow · 2023-10-19T07:53:23Z

LGTM label has been added.

Git tree hash: 6e9b98d14c61b217d9fe1fdce2aee2e03ece8869

gardener-prow bot added kind/bug Bug do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Oct 13, 2023

gardener-prow bot requested review from dimitar-kostadinov and oliver-goetz October 13, 2023 09:41

gardener-prow bot added cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 13, 2023

ialidzhikov mentioned this pull request Oct 13, 2023

Getting gardener-extension-registry-cache production ready #3

Open

97 tasks

ialidzhikov force-pushed the fix/node-bootstrap-problem branch from fa2c9ec to a1d6440 Compare October 13, 2023 12:51

gardener-prow bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 13, 2023

ialidzhikov force-pushed the fix/node-bootstrap-problem branch from a1d6440 to 8555564 Compare October 13, 2023 14:31

ialidzhikov marked this pull request as ready for review October 13, 2023 14:31

gardener-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 13, 2023

gardener-prow bot requested a review from rfranzke October 13, 2023 14:31

ialidzhikov force-pushed the fix/node-bootstrap-problem branch from 8555564 to fb3985c Compare October 16, 2023 07:26

ialidzhikov force-pushed the fix/node-bootstrap-problem branch 2 times, most recently from 0c7af68 to 2abb4f1 Compare October 16, 2023 11:52

Fix the Node bootstrap problem

3258a26

ialidzhikov force-pushed the fix/node-bootstrap-problem branch from 2abb4f1 to 3258a26 Compare October 16, 2023 12:38

gardener-prow bot assigned dimitar-kostadinov Oct 17, 2023

dimitar-kostadinov reviewed Oct 18, 2023

View reviewed changes

Address review comments

5f8ddc3

ialidzhikov force-pushed the fix/node-bootstrap-problem branch from 85602c2 to 5f8ddc3 Compare October 19, 2023 07:31

ialidzhikov requested a review from dimitar-kostadinov October 19, 2023 07:32

dimitar-kostadinov reviewed Oct 19, 2023

View reviewed changes

gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Oct 19, 2023

gardener-prow bot merged commit b00274f into gardener:main Oct 19, 2023
6 checks passed

ialidzhikov deleted the fix/node-bootstrap-problem branch October 20, 2023 06:31

ialidzhikov mentioned this pull request Nov 2, 2023

Vendor gardener@1.81.6 and implement ForceDelete #59

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the Node bootstrap problem #68

Fix the Node bootstrap problem #68

ialidzhikov commented Oct 13, 2023 •

edited

gardener-prow bot commented Oct 13, 2023

ialidzhikov commented Oct 16, 2023

ialidzhikov commented Oct 17, 2023

dimitar-kostadinov commented Oct 17, 2023

dimitar-kostadinov left a comment

ialidzhikov commented Oct 19, 2023

gardener-prow bot commented Oct 19, 2023

dimitar-kostadinov left a comment

gardener-prow bot commented Oct 19, 2023

Fix the Node bootstrap problem #68

Fix the Node bootstrap problem #68

Conversation

ialidzhikov commented Oct 13, 2023 • edited

gardener-prow bot commented Oct 13, 2023

ialidzhikov commented Oct 16, 2023

ialidzhikov commented Oct 17, 2023

dimitar-kostadinov commented Oct 17, 2023

dimitar-kostadinov left a comment

Choose a reason for hiding this comment

ialidzhikov commented Oct 19, 2023

gardener-prow bot commented Oct 19, 2023

dimitar-kostadinov left a comment

Choose a reason for hiding this comment

gardener-prow bot commented Oct 19, 2023

ialidzhikov commented Oct 13, 2023 •

edited