Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the Node bootstrap problem #68

Merged
merged 2 commits into from Oct 19, 2023

Conversation

ialidzhikov
Copy link
Member

@ialidzhikov ialidzhikov commented Oct 13, 2023

How to categorize this PR?

/kind bug

What this PR does / why we need it:
Currently, when a Shoot requests caches for all upstreams that are used by Shoot system components, things regress in several aspects. In OSS Gardener these upstreams would be quay.io for calico images, registry.k8s.io for kube-proxy and others, eu.gcr.io for the rest). Example extension configuration for that case:

  extensions:
  - type: registry-cache
    providerConfig:
      apiVersion: registry.extensions.gardener.cloud/v1alpha1
      kind: RegistryConfig
      caches:
      - upstream: eu.gcr.io
        size: 10Gi
      - upstream: quay.io
        size: 10Gi
      - upstream: registry.k8s.io
        size: 10Gi

On creation of such Shoot, image pull of Shoot system components is abnormally high. quay.io/calico/cni, quay.io/calico/calico-node and registry.k8s.io/kube-proxy images are pulled in times like 2m, 3m or even 5m. Usually, it takes up to 10s to pull these images from the upstream.
The reason for this behaviour is kind of the design of the extension. We configure a containerd mirror/registry using the Service IP of registry cache Service in the Shoot cluster. In order to have the Service's cluster IP reachable from a newly joining Node, the networking has to be set up and kube-proxy and calico components to be running. Otherwise, containerd cannot reach the Service's cluster IP and falls back to the upstream registry.
The reason for the abnormal high image pull times are that initially (until kube-proxy starts running) image pull requests from containerd to the Service's cluster IP time out after 30s. After this timeout, the fallback to the upstream (for example eu.gcr.io or quay.io) happens. We see that containerd does many requests - HEAD request to resolve the manifest by tag, GET request for the manifest by SHA, GET requests for blobs. Each of these requests times out in 30s and the the fall back to the upstream host happens. At the end image pull succeeds but it succeeds in minutes. Yesterday I was doing experiments with the docker.io/library/alpine:3.13.2 and eu.gcr.io/gardener-project/gardener/ops-toolbelt:0.18.0 in a setup where the Service IP of containerd is misconfigured (Service IP deleted and containerd refers to a no longer existing cluster IP):

  • Image pull from the upstream for docker.io/library/alpine:3.13.2 takes ~2s while image pull of the same image with unavailable registry cache takes ~2m.2s.
  • Image pull from the upstream for eu.gcr.io/gardener-project/gardener/ops-toolbelt:0.18.0 takes ~10s while image pull of the same image with unavailable registry cache takes ~3m.10s.

We see that after kube-proxy stars running, requests to the Service IP of the registry-cache Service no longer time out in 30s (dial tcp 10.4.196.64:5000: i/o timeout) but are rejected right away (dial tcp 10.4.225.49:5000: connect: connection refused).

We believe that this is because kube-proxy starts running and configures iptable rules on the Node for the Service IPs. At that time the registry cache Pods are not yet running, hence the Service does not have any available Endpoints. Most probably in that case kube-proxy configures a "black hole" for such a Service because it does not have any available Endpoints.

In total, with the current design of the extension:

  • Currently we cannot cache much of the Shoot system components. With the currently approach we cannot cache calico and kube-proxy images and all the images that are created at the same time with them (CSI drivers, apiserver-proxy, node-problem detector and others) because calico and kube-proxy are not yet running on the Node and haven't set up the netwoking so that containerd to be able to pull images from the cache.
  • We regress the Node creation time a lot (>= 10 min when I tested in the local setup with extension for an AWS Shoot). This is because the above-described behaviour - when kube-proxy is yet running, a containerd request times out after 30s and then the fallback to the upstream registry happens. For example 6 containerd requests for image pull result in > 3min image pull time. Image pulls for Shoot system components like calico and kube-proxy that usually take seconds now take minutes because of this.

To address these issues, the extension is redesigned in the following. During the OSC mutation, we no longer append the hosts.toml file with the registry config. Instead we append an new systemd unit. The systemd unit waits for each cache to be available (its Service IP starts returning HTTP 200) and only after this it creates the corresponding hosts.toml file. For the uninstall case - a DaemonSet is deployed and it cleans up the hosts.toml files and the unit (if needed).

The drawback of this approach is that the registry-cache extension won't be able to cache Shoot system components if a cache for the corresponding upstream is requested. However, with the new design, we no longer regress the image pull of the Shoot system components.

Which issue(s) this PR fixes:
Part of #3

Special notes for your reviewer:
N/A

Release note:

NONE

@gardener-prow
Copy link
Contributor

gardener-prow bot commented Oct 13, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@gardener-prow gardener-prow bot added kind/bug Bug do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Oct 13, 2023
@gardener-prow gardener-prow bot added cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 13, 2023
@gardener-prow gardener-prow bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 13, 2023
@ialidzhikov ialidzhikov marked this pull request as ready for review October 13, 2023 14:31
@gardener-prow gardener-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 13, 2023
@gardener-prow gardener-prow bot requested a review from rfranzke October 13, 2023 14:31
@ialidzhikov
Copy link
Member Author

Cloning into 'gardener'...
fatal: unable to access 'https://github.com/gardener/gardener.git/': The requested URL returned error: 403
make: *** [Makefile:120: ci-e2e-kind] Error 128

/test pull-gardener-extension-registry-cache-e2e-kind

@ialidzhikov ialidzhikov force-pushed the fix/node-bootstrap-problem branch 2 times, most recently from 0c7af68 to 2abb4f1 Compare October 16, 2023 11:52
@ialidzhikov
Copy link
Member Author

  [FAILED] Unexpected error:
      <*errors.StatusError | 0xc0006acaa0>: 
      error dialing backend: dial tcp 10.2.244.99:9443: connect: connection refused
      {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {
                  SelfLink: "",
                  ResourceVersion: "",
                  Continue: "",
                  RemainingItemCount: nil,
              },
              Status: "Failure",
              Message: "error dialing backend: dial tcp 10.2.244.99:9443: connect: connection refused",
              Reason: "",
              Details: nil,
              Code: 500,
          },
      }
  occurred
  In [It] at: /home/prow/go/src/github.com/gardener/gardener-extension-registry-cache/test/common/common.go:107 @ 10/16/23 12:53:09.433

Seems like a networking error. The test is passing locally.

/test pull-gardener-extension-registry-cache-e2e-kind

@dimitar-kostadinov
Copy link
Contributor

/assign

Copy link
Contributor

@dimitar-kostadinov dimitar-kostadinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor findings, otherwise looks good

@ialidzhikov
Copy link
Member Author

/approve

@gardener-prow
Copy link
Contributor

gardener-prow bot commented Oct 19, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ialidzhikov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: no Indicates the PR's author has not signed the cla-assistant.io CLA. cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. and removed cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. cla: no Indicates the PR's author has not signed the cla-assistant.io CLA. labels Oct 19, 2023
Copy link
Contributor

@dimitar-kostadinov dimitar-kostadinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Oct 19, 2023
@gardener-prow
Copy link
Contributor

gardener-prow bot commented Oct 19, 2023

LGTM label has been added.

Git tree hash: 6e9b98d14c61b217d9fe1fdce2aee2e03ece8869

@gardener-prow gardener-prow bot merged commit b00274f into gardener:main Oct 19, 2023
6 checks passed
@ialidzhikov ialidzhikov deleted the fix/node-bootstrap-problem branch October 20, 2023 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/bug Bug lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants