Resources under the single app limit issue #41

pavel-khritonenko · 2019-09-29T10:22:01Z

I use kapp to manage a set of deployments. Under the single application, I deploy about ~230 resources (generated). On some point in time, deployment started taking a long time, after adding more resources is stopped working at all. Hangs for a couple of minutes then I get such error when I run it locally:

Error: Listing schema.GroupVersionResource{Group:"apps", Version:"v1", Resource:"replicasets"}, namespaced: true: Stream error http2.StreamError{StreamID:0x10b, Code:0x2, Cause:error(nil)} when reading response body, may be caused by closed connection. Please retry.

When I run it closer to target kubernetes cluster (in the same AWS network) - it works better (fails less often).

$ kapp version
Client Version: 0.11.0

Succeeded

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.0", GitCommit:"2bd9643cee5b3b3a5ecbd3af49d09018f0773c77", GitTreeState:"clean", BuildDate:"2019-09-19T13:57:45Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.6-eks-5047ed", GitCommit:"5047edce664593832e9b889e447ac75ab104f527", GitTreeState:"clean", BuildDate:"2019-08-21T22:32:40Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

The text was updated successfully, but these errors were encountered:

cppforlife · 2019-10-01T18:02:40Z

hey @pavel-khritonenko

Under the single application, I deploy about ~230 resources (generated). On some point in time, deployment started taking a long time, after adding more resources is stopped working at all

that's interesting... how long does it hang until the error?

Hangs for a couple of minutes then I get such error when I run it locally:

does it hang before showing the diff (im guessing so since the error includes Listing ...)?

can you describe your cluster a bit more, more specifically:

is it mostly same resource type (eg ConfigMaps) within this application
how many resource types (kubectl api-resources)
do you limit your user account (used by kapp) to specific namespace(s) or are all resources available to it?
what cluster provider are you using (eg GKE, EKS, etc.)

cppforlife · 2019-10-01T19:03:33Z

im also curious how long following command runs against your cluster: https://gist.github.com/cppforlife/25890e4a9e732413bbf83c81e4a808b1 (122 resources to be created, ~3 to check cluster and calculate diff)

cppforlife · 2019-10-01T19:07:37Z

im also creating a new release for kapp to include debug flag so that we can get to the bottom of whats going on in your cluster.

pavel-khritonenko · 2019-10-03T10:57:54Z

Sorry for disappearing:

how long does it hang until the error?

Executed several times on my local machine (100mbit wifi)
4 times it has failed in about 1 minute (00:01:09 - 00:01:11), the 5th and 6th times it has succeeded in 0:01:30-0:01:38

does it hang before showing the diff

correct, before diff

is it mostly same resource type (eg ConfigMaps) within this application

It's about 20 instances of the same application with different settings (2 deployments, 1 crd certificate, 2 pdb, 2 configmaps, 1 ingress, 2 services)

kubectl api-resources

https://gist.github.com/pavel-khritonenko/a4ffb3bec510a1d4d1a3b419cfd92993

do you limit your user account (used by kapp) to specific namespace(s) or are all resources available to it?

Cluster admin permissions (no limits)

what cluster provider are you using

EKS (amazon web services)

pavel-khritonenko · 2019-10-03T11:00:49Z

Currently I added deployment to CI/CD process and run deployment manually from gitlab on the runner near the cluster (the same subnet) - it never fails.

pavel-khritonenko · 2019-10-03T11:05:38Z

im also curious how long following command runs against your cluster: https://gist.github.com/cppforlife/25890e4a9e732413bbf83c81e4a808b1 (122 resources to be created, ~3 to check cluster and calculate diff)

19 seconds, success

pavel-khritonenko · 2019-10-03T11:08:24Z

$ cue dump | grep kind
https://gist.github.com/pavel-khritonenko/46032924c6211a6f690cd9fdb303b9a6

cppforlife · 2019-10-03T18:34:01Z

19 seconds, success

oh, that's interesting. im using default GKE cluster with 3 nodes, and would have expected to have similar response time (~3s).

EKS (amazon web services)

how beefy are control plane machines? not sure if AWS tells you those details.

pavel-khritonenko · 2019-10-03T18:54:36Z

They don't tell us such things. If you want me to run any requests with timings - would be glad to test for you

cppforlife · 2019-10-03T19:31:14Z

@pavel-khritonenko would you mind trying out https://github.com/k14s/kapp/releases/tag/v0.14.0 with --debug flag and posting results.

pavel-khritonenko · 2019-10-04T11:30:52Z

$ cue dump | kapp deploy --debug --wait=false -a frontends -f -
02:24:51PM: debug: CommandRun: start
02:24:51PM: debug: RecordedApp: CreateOrUpdate: start
02:24:52PM: debug: RecordedApp: CreateOrUpdate: end
02:24:54PM: debug: LabeledResources: Prepare: start
02:24:54PM: debug: LabeledResources: Prepare: end
02:24:54PM: debug: LabeledResources: AllAndMatching: start
02:24:54PM: debug: LabeledResources: All: start
02:24:54PM: debug: IdentifiedResources: List: start
02:27:15PM: debug: IdentifiedResources: List: end
02:27:15PM: debug: LabeledResources: All: end
02:27:15PM: debug: LabeledResources: AllAndMatching: end
02:27:15PM: debug: CommandRun: end

Error: Listing schema.GroupVersionResource{Group:"extensions", Version:"v1beta1", Resource:"replicasets"}, namespaced: true: Stream error http2.StreamError{StreamID:0x11f, Code:0x2, Cause:error(nil)} when reading response body, may be caused by closed connection. Please retry.

pavel-khritonenko · 2019-10-10T19:29:29Z

I just faced the same issue deploying 76 Kubernetes deployments (nothing special, single deployment single container, different env variables). Initial creation was fast and flawless, update such definition fails for the same reason.

cppforlife · 2019-10-14T17:21:28Z

@pavel-khritonenko i be been away on vacation hence my slow response (coming back next week). meanwhile im intrigued that you mention that creation vs update is fast. could you attach debug for creation as well?

given that above debug log showed that IdentifiedResources: List took 3 mins, ill have to add more debug logs to that method and make a new release.

cppforlife · 2019-10-17T21:22:51Z

@pavel-khritonenko would you mind building from develop and running kapp. ive made two changes: (1) cd6e6bb throttles seemingly expensive operation on your cluster to 10 at a time, and, (2) 78ab39c includes more info in --debug.

to build (requires checking out to GOPATH):

git clone https://github.com/k14s/kapp /tmp/kapp-go/src/github.com/k14s/kapp
cd /tmp/kapp-go/src/github.com/k14s/kapp
export GOPATH=/tmp/kapp-go
./hack/build.sh
./kapp ...
rm -rf /tmp/kapp-go

cppforlife · 2019-10-29T15:09:17Z

@pavel-khritonenko any update on this issue?

pavel-khritonenko · 2019-10-29T18:22:35Z

Sorry, on vacation atm, will check next week.

…

On Tue, 29 Oct 2019 at 18:09, Dmitriy Kalinin ***@***.***> wrote: @pavel-khritonenko <https://github.com/pavel-khritonenko> any update on this issue? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#41>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AATEZSQMWGHKSU3BSGYTNQ3QRBGZ5ANCNFSM4I3R3IVQ> .

cppforlife · 2019-11-27T21:08:20Z

@pavel-khritonenko checking in, any updates on this?

pavel-khritonenko · 2019-11-27T21:43:43Z

Yes, sorry for disappearing.

Yesterday I figured out a few things caused a timeout.

First - I don't specify anywhere in my deployments parameter revisionHistoryLimit, assuming it will be default one. However, when it deployed with kapp tool - I see 2147483647 here instead of the default value (10).

The second - we use https://keel.sh to auto-update our deployments, so any commit to a branch triggers a deployment update and creates a new replica set. As a result, we get a lot of replica sets for each of ~60 deployments. So even kubectl get rs failed with an error "server EOF" or similar. And kapp failed as well because of it fetches all resources generated by deployments - so it's trying to get all replica sets to build diff upon that.

I specified revisionHistoryLimit explicitly in our deployments and the issue disappeared.

cppforlife · 2019-11-27T22:04:25Z

Yesterday I figured out a few things caused a timeout.

nice finds.

However, when it deployed with kapp tool - I see 2147483647 here instead of the default value (10).

i didnt quite follow this one. are you saying 2147483647 showed in the diff? who was setting it?

pavel-khritonenko · 2019-11-27T22:44:30Z

I'm not sure where that value comes from, but I haven't set it previously. Cannot reproduce it with the latest version of kapp (0.14), trying to reproduce with earlier versions. What I see in the annotations of one deployment:

                          - type: test
                            path: /spec/progressDeadlineSeconds
                            value: 2147483647
                          - type: remove
                            path: /spec/progressDeadlineSeconds
                          - type: test
                            path: /spec/revisionHistoryLimit
                            value: 2147483647
                          - type: remove
                            path: /spec/revisionHistoryLimit

My build agent is still using 0.13 version, I'll share a report when I'll be able to reproduce that.

pavel-khritonenko · 2019-11-27T22:54:24Z

Managed to reproduce with the version 0.13:

Manually changed revisionHistoryLimit to 3 of deployment psql, then applied following definition using kapp:

---
apiVersion: "extensions/v1beta1"
kind: "Deployment"
metadata:
  labels:
    app: "psql"
    reloader.stakater.com/auto: "true"
  name: "psql"
  namespace: "sandbox"
spec:
  replicas: 1
  selector:
    matchLabels:
      app: "psql"
  template:
    metadata:
      labels:
        app: "psql"
    spec:
      containers:
        - args:
            - "while true; do sleep 30; done;"
          command:
            - "/bin/sh"
            - "-c"
            - "--"
          env:
            - name: "PGHOST"
              valueFrom:
                secretKeyRef:
                  key: "address"
                  name: "db"
            - name: "PGDATABASE"
              valueFrom:
                secretKeyRef:
                  key: "database"
                  name: "db"
            - name: "PGUSER"
              valueFrom:
                secretKeyRef:
                  key: "POSTGRES_USER"
                  name: "db-auth"
            - name: "PGPASSWORD"
              valueFrom:
                secretKeyRef:
                  key: "POSTGRES_PASSWORD"
                  name: "db-auth"
          image: "jbergknoff/postgresql-client"
          imagePullPolicy: "Always"
          name: "psql"
          resources:
            limits:
              cpu: "100m"
              memory: "128Mi"

What I see after applying:

spec:
  progressDeadlineSeconds: 2147483647
  replicas: 1
  revisionHistoryLimit: 2147483647

Deleted that deployment manually (kubectl delete deployment psql -n sandbox), then reapplied manifest above with kapp v 0.13, as a result - got the same definition.

pavel-khritonenko · 2019-11-27T23:13:05Z

Seems it's not related to kapp, because deploying the same manifest with kubectl leads to the same issue.

Finally got it, it seems it's because of apiVersion, when I specify apps/v1 - everything works just fine. For extensions/v1beta1 - it sets default value of revisionHistoryLimit to 2147483647

cppforlife · 2019-11-28T01:17:01Z

Seems it's not related to kapp, because deploying the same manifest with kubectl leads to the same issue.

yup, sounds like a server side behaviour.

ill close the issue (ill probably file a different issue that throws in a warning when fetching resources takes a long time). thanks for digging in.

cppforlife added discussion This issue is not a bug or feature and a conversation is needed to find an appropriate resolution helping with an issue Debugging happening to identify the problem labels Oct 1, 2019

cppforlife added the in progress Work has begun by a community member or a maintainer; this issue may be included in a future release label Oct 3, 2019

cppforlife closed this as completed Nov 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resources under the single app limit issue #41

Resources under the single app limit issue #41

pavel-khritonenko commented Sep 29, 2019 •

edited

Loading

cppforlife commented Oct 1, 2019 •

edited

Loading

cppforlife commented Oct 1, 2019 •

edited

Loading

cppforlife commented Oct 1, 2019

pavel-khritonenko commented Oct 3, 2019

pavel-khritonenko commented Oct 3, 2019

pavel-khritonenko commented Oct 3, 2019

pavel-khritonenko commented Oct 3, 2019 •

edited

Loading

cppforlife commented Oct 3, 2019 •

edited

Loading

pavel-khritonenko commented Oct 3, 2019

cppforlife commented Oct 3, 2019

pavel-khritonenko commented Oct 4, 2019 •

edited

Loading

pavel-khritonenko commented Oct 10, 2019 •

edited

Loading

cppforlife commented Oct 14, 2019

cppforlife commented Oct 17, 2019 •

edited by pivotaljohn

Loading

cppforlife commented Oct 29, 2019

pavel-khritonenko commented Oct 29, 2019 via email

cppforlife commented Nov 27, 2019

pavel-khritonenko commented Nov 27, 2019

cppforlife commented Nov 27, 2019

pavel-khritonenko commented Nov 27, 2019

pavel-khritonenko commented Nov 27, 2019 •

edited

Loading

pavel-khritonenko commented Nov 27, 2019

cppforlife commented Nov 28, 2019

Resources under the single app limit issue #41

Resources under the single app limit issue #41

Comments

pavel-khritonenko commented Sep 29, 2019 • edited Loading

cppforlife commented Oct 1, 2019 • edited Loading

cppforlife commented Oct 1, 2019 • edited Loading

cppforlife commented Oct 1, 2019

pavel-khritonenko commented Oct 3, 2019

pavel-khritonenko commented Oct 3, 2019

pavel-khritonenko commented Oct 3, 2019

pavel-khritonenko commented Oct 3, 2019 • edited Loading

cppforlife commented Oct 3, 2019 • edited Loading

pavel-khritonenko commented Oct 3, 2019

cppforlife commented Oct 3, 2019

pavel-khritonenko commented Oct 4, 2019 • edited Loading

pavel-khritonenko commented Oct 10, 2019 • edited Loading

cppforlife commented Oct 14, 2019

cppforlife commented Oct 17, 2019 • edited by pivotaljohn Loading

cppforlife commented Oct 29, 2019

pavel-khritonenko commented Oct 29, 2019 via email

cppforlife commented Nov 27, 2019

pavel-khritonenko commented Nov 27, 2019

cppforlife commented Nov 27, 2019

pavel-khritonenko commented Nov 27, 2019

pavel-khritonenko commented Nov 27, 2019 • edited Loading

pavel-khritonenko commented Nov 27, 2019

cppforlife commented Nov 28, 2019

pavel-khritonenko commented Sep 29, 2019 •

edited

Loading

cppforlife commented Oct 1, 2019 •

edited

Loading

cppforlife commented Oct 1, 2019 •

edited

Loading

pavel-khritonenko commented Oct 3, 2019 •

edited

Loading

cppforlife commented Oct 3, 2019 •

edited

Loading

pavel-khritonenko commented Oct 4, 2019 •

edited

Loading

pavel-khritonenko commented Oct 10, 2019 •

edited

Loading

cppforlife commented Oct 17, 2019 •

edited by pivotaljohn

Loading

pavel-khritonenko commented Nov 27, 2019 •

edited

Loading