Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resources under the single app limit issue #41

Closed
pavel-khritonenko opened this issue Sep 29, 2019 · 23 comments
Closed

Resources under the single app limit issue #41

pavel-khritonenko opened this issue Sep 29, 2019 · 23 comments
Labels
discussion This issue is not a bug or feature and a conversation is needed to find an appropriate resolution helping with an issue Debugging happening to identify the problem in progress Work has begun by a community member or a maintainer; this issue may be included in a future release

Comments

@pavel-khritonenko
Copy link

pavel-khritonenko commented Sep 29, 2019

I use kapp to manage a set of deployments. Under the single application, I deploy about ~230 resources (generated). On some point in time, deployment started taking a long time, after adding more resources is stopped working at all. Hangs for a couple of minutes then I get such error when I run it locally:

Error: Listing schema.GroupVersionResource{Group:"apps", Version:"v1", Resource:"replicasets"}, namespaced: true: Stream error http2.StreamError{StreamID:0x10b, Code:0x2, Cause:error(nil)} when reading response body, may be caused by closed connection. Please retry.

When I run it closer to target kubernetes cluster (in the same AWS network) - it works better (fails less often).

$ kapp version
Client Version: 0.11.0

Succeeded

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.0", GitCommit:"2bd9643cee5b3b3a5ecbd3af49d09018f0773c77", GitTreeState:"clean", BuildDate:"2019-09-19T13:57:45Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.6-eks-5047ed", GitCommit:"5047edce664593832e9b889e447ac75ab104f527", GitTreeState:"clean", BuildDate:"2019-08-21T22:32:40Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
@cppforlife cppforlife added discussion This issue is not a bug or feature and a conversation is needed to find an appropriate resolution helping with an issue Debugging happening to identify the problem labels Oct 1, 2019
@cppforlife
Copy link
Contributor

cppforlife commented Oct 1, 2019

hey @pavel-khritonenko

Under the single application, I deploy about ~230 resources (generated). On some point in time, deployment started taking a long time, after adding more resources is stopped working at all

that's interesting... how long does it hang until the error?

Hangs for a couple of minutes then I get such error when I run it locally:

does it hang before showing the diff (im guessing so since the error includes Listing ...)?

can you describe your cluster a bit more, more specifically:

  • is it mostly same resource type (eg ConfigMaps) within this application
  • how many resource types (kubectl api-resources)
  • do you limit your user account (used by kapp) to specific namespace(s) or are all resources available to it?
  • what cluster provider are you using (eg GKE, EKS, etc.)

@cppforlife
Copy link
Contributor

cppforlife commented Oct 1, 2019

im also curious how long following command runs against your cluster: https://gist.github.com/cppforlife/25890e4a9e732413bbf83c81e4a808b1 (122 resources to be created, ~3 to check cluster and calculate diff)

@cppforlife
Copy link
Contributor

im also creating a new release for kapp to include debug flag so that we can get to the bottom of whats going on in your cluster.

@pavel-khritonenko
Copy link
Author

Sorry for disappearing:

how long does it hang until the error?

Executed several times on my local machine (100mbit wifi)
4 times it has failed in about 1 minute (00:01:09 - 00:01:11), the 5th and 6th times it has succeeded in 0:01:30-0:01:38

does it hang before showing the diff

correct, before diff

Screenshot 2019-10-03 at 13 51 18

is it mostly same resource type (eg ConfigMaps) within this application

It's about 20 instances of the same application with different settings (2 deployments, 1 crd certificate, 2 pdb, 2 configmaps, 1 ingress, 2 services)

kubectl api-resources

https://gist.github.com/pavel-khritonenko/a4ffb3bec510a1d4d1a3b419cfd92993

do you limit your user account (used by kapp) to specific namespace(s) or are all resources available to it?

Cluster admin permissions (no limits)

what cluster provider are you using

EKS (amazon web services)

@pavel-khritonenko
Copy link
Author

Currently I added deployment to CI/CD process and run deployment manually from gitlab on the runner near the cluster (the same subnet) - it never fails.

@pavel-khritonenko
Copy link
Author

im also curious how long following command runs against your cluster: https://gist.github.com/cppforlife/25890e4a9e732413bbf83c81e4a808b1 (122 resources to be created, ~3 to check cluster and calculate diff)

19 seconds, success

@pavel-khritonenko
Copy link
Author

pavel-khritonenko commented Oct 3, 2019

@cppforlife
Copy link
Contributor

cppforlife commented Oct 3, 2019

19 seconds, success

oh, that's interesting. im using default GKE cluster with 3 nodes, and would have expected to have similar response time (~3s).

EKS (amazon web services)

how beefy are control plane machines? not sure if AWS tells you those details.

@pavel-khritonenko
Copy link
Author

They don't tell us such things. If you want me to run any requests with timings - would be glad to test for you

@cppforlife cppforlife added the in progress Work has begun by a community member or a maintainer; this issue may be included in a future release label Oct 3, 2019
@cppforlife
Copy link
Contributor

@pavel-khritonenko would you mind trying out https://github.com/k14s/kapp/releases/tag/v0.14.0 with --debug flag and posting results.

@pavel-khritonenko
Copy link
Author

pavel-khritonenko commented Oct 4, 2019

$ cue dump | kapp deploy --debug --wait=false -a frontends -f -
02:24:51PM: debug: CommandRun: start
02:24:51PM: debug: RecordedApp: CreateOrUpdate: start
02:24:52PM: debug: RecordedApp: CreateOrUpdate: end
02:24:54PM: debug: LabeledResources: Prepare: start
02:24:54PM: debug: LabeledResources: Prepare: end
02:24:54PM: debug: LabeledResources: AllAndMatching: start
02:24:54PM: debug: LabeledResources: All: start
02:24:54PM: debug: IdentifiedResources: List: start
02:27:15PM: debug: IdentifiedResources: List: end
02:27:15PM: debug: LabeledResources: All: end
02:27:15PM: debug: LabeledResources: AllAndMatching: end
02:27:15PM: debug: CommandRun: end

Error: Listing schema.GroupVersionResource{Group:"extensions", Version:"v1beta1", Resource:"replicasets"}, namespaced: true: Stream error http2.StreamError{StreamID:0x11f, Code:0x2, Cause:error(nil)} when reading response body, may be caused by closed connection. Please retry.

@pavel-khritonenko
Copy link
Author

pavel-khritonenko commented Oct 10, 2019

I just faced the same issue deploying 76 Kubernetes deployments (nothing special, single deployment single container, different env variables). Initial creation was fast and flawless, update such definition fails for the same reason.

@cppforlife
Copy link
Contributor

@pavel-khritonenko i be been away on vacation hence my slow response (coming back next week). meanwhile im intrigued that you mention that creation vs update is fast. could you attach debug for creation as well?

given that above debug log showed that IdentifiedResources: List took 3 mins, ill have to add more debug logs to that method and make a new release.

@cppforlife
Copy link
Contributor

cppforlife commented Oct 17, 2019

@pavel-khritonenko would you mind building from develop and running kapp. ive made two changes: (1) cd6e6bb throttles seemingly expensive operation on your cluster to 10 at a time, and, (2) 78ab39c includes more info in --debug.

to build (requires checking out to GOPATH):

git clone https://github.com/k14s/kapp /tmp/kapp-go/src/github.com/k14s/kapp
cd /tmp/kapp-go/src/github.com/k14s/kapp
export GOPATH=/tmp/kapp-go
./hack/build.sh
./kapp ...
rm -rf /tmp/kapp-go

@cppforlife
Copy link
Contributor

@pavel-khritonenko any update on this issue?

@pavel-khritonenko
Copy link
Author

pavel-khritonenko commented Oct 29, 2019 via email

@cppforlife
Copy link
Contributor

@pavel-khritonenko checking in, any updates on this?

@pavel-khritonenko
Copy link
Author

Yes, sorry for disappearing.

Yesterday I figured out a few things caused a timeout.

First - I don't specify anywhere in my deployments parameter revisionHistoryLimit, assuming it will be default one. However, when it deployed with kapp tool - I see 2147483647 here instead of the default value (10).

The second - we use https://keel.sh to auto-update our deployments, so any commit to a branch triggers a deployment update and creates a new replica set. As a result, we get a lot of replica sets for each of ~60 deployments. So even kubectl get rs failed with an error "server EOF" or similar. And kapp failed as well because of it fetches all resources generated by deployments - so it's trying to get all replica sets to build diff upon that.

I specified revisionHistoryLimit explicitly in our deployments and the issue disappeared.

@cppforlife
Copy link
Contributor

Yesterday I figured out a few things caused a timeout.

nice finds.

However, when it deployed with kapp tool - I see 2147483647 here instead of the default value (10).

i didnt quite follow this one. are you saying 2147483647 showed in the diff? who was setting it?

@pavel-khritonenko
Copy link
Author

I'm not sure where that value comes from, but I haven't set it previously. Cannot reproduce it with the latest version of kapp (0.14), trying to reproduce with earlier versions. What I see in the annotations of one deployment:

                          - type: test
                            path: /spec/progressDeadlineSeconds
                            value: 2147483647
                          - type: remove
                            path: /spec/progressDeadlineSeconds
                          - type: test
                            path: /spec/revisionHistoryLimit
                            value: 2147483647
                          - type: remove
                            path: /spec/revisionHistoryLimit

My build agent is still using 0.13 version, I'll share a report when I'll be able to reproduce that.

@pavel-khritonenko
Copy link
Author

pavel-khritonenko commented Nov 27, 2019

Managed to reproduce with the version 0.13:

Manually changed revisionHistoryLimit to 3 of deployment psql, then applied following definition using kapp:

---
apiVersion: "extensions/v1beta1"
kind: "Deployment"
metadata:
  labels:
    app: "psql"
    reloader.stakater.com/auto: "true"
  name: "psql"
  namespace: "sandbox"
spec:
  replicas: 1
  selector:
    matchLabels:
      app: "psql"
  template:
    metadata:
      labels:
        app: "psql"
    spec:
      containers:
        - args:
            - "while true; do sleep 30; done;"
          command:
            - "/bin/sh"
            - "-c"
            - "--"
          env:
            - name: "PGHOST"
              valueFrom:
                secretKeyRef:
                  key: "address"
                  name: "db"
            - name: "PGDATABASE"
              valueFrom:
                secretKeyRef:
                  key: "database"
                  name: "db"
            - name: "PGUSER"
              valueFrom:
                secretKeyRef:
                  key: "POSTGRES_USER"
                  name: "db-auth"
            - name: "PGPASSWORD"
              valueFrom:
                secretKeyRef:
                  key: "POSTGRES_PASSWORD"
                  name: "db-auth"
          image: "jbergknoff/postgresql-client"
          imagePullPolicy: "Always"
          name: "psql"
          resources:
            limits:
              cpu: "100m"
              memory: "128Mi"

What I see after applying:

spec:
  progressDeadlineSeconds: 2147483647
  replicas: 1
  revisionHistoryLimit: 2147483647

Deleted that deployment manually (kubectl delete deployment psql -n sandbox), then reapplied manifest above with kapp v 0.13, as a result - got the same definition.

@pavel-khritonenko
Copy link
Author

Seems it's not related to kapp, because deploying the same manifest with kubectl leads to the same issue.

Finally got it, it seems it's because of apiVersion, when I specify apps/v1 - everything works just fine. For extensions/v1beta1 - it sets default value of revisionHistoryLimit to 2147483647

@cppforlife
Copy link
Contributor

Seems it's not related to kapp, because deploying the same manifest with kubectl leads to the same issue.

yup, sounds like a server side behaviour.

ill close the issue (ill probably file a different issue that throws in a warning when fetching resources takes a long time). thanks for digging in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion This issue is not a bug or feature and a conversation is needed to find an appropriate resolution helping with an issue Debugging happening to identify the problem in progress Work has begun by a community member or a maintainer; this issue may be included in a future release
Projects
None yet
Development

No branches or pull requests

2 participants