Skip to content

Conversation

@MichaelKatsoulis
Copy link
Contributor

@MichaelKatsoulis MichaelKatsoulis commented Aug 3, 2021

This PR tries to fix bug created with #452

In this PR a retry operation is added when downloading elastic agent manifest from upstream. If the status code of the response is not 200 or the received bytes of the file are less than 2000, the code will retry for 5 times to fetch the file before failing.
This way short network issues can be faced

@MichaelKatsoulis MichaelKatsoulis marked this pull request as draft August 3, 2021 11:28
@elasticmachine
Copy link
Collaborator

elasticmachine commented Aug 3, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2021-08-04T10:21:56.763+0000

  • Duration: 45 min 56 sec

  • Commit: 127f587

Test stats 🧪

Test Results
Failed 0
Passed 425
Skipped 4
Total 429

Trends 🧪

Image of Build Times

Image of Tests

@MichaelKatsoulis
Copy link
Contributor Author

/test

@MichaelKatsoulis MichaelKatsoulis marked this pull request as ready for review August 3, 2021 13:26
defer resp.Body.Close()
logger.Debugf("status code when downloading elastic-agent-managed-kubernetes.yaml is %d", resp.StatusCode)
if resp.StatusCode != 200 {
return nil, errors.Wrapf(err, "downloading failed due to status code %d", resp.StatusCode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe print also the body here?

return elasticAgentManagedYaml, nil
}
err = fmt.Errorf("bytes downloaded should be more than 2000 but where: %d", len(elasticAgentManagedYaml))
logger.Debugf("failed because of %s", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check these debugf calls (for example with an invalid URL)?

}
err = fmt.Errorf("bytes downloaded should be more than 2000 but where: %d", len(elasticAgentManagedYaml))
logger.Debugf("failed because of %s", err)
logger.Debugf("file downloaded is %s", string(elasticAgentManagedYaml))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discourage printing the doc here. If you really want to check if the content is right, maybe print just MD5?

Comment on lines 228 to 229
err = fmt.Errorf("bytes downloaded should be more than 2000 but where: %d", len(elasticAgentManagedYaml))
logger.Debugf("failed because of %s", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you combine these two lines together

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need the err variable to be set here in order to be returned later


# Dump kubectl details
kubectl describe pods --all-namespaces > build/kubectl-dump.txt
kubectl logs -l app=kind-fleet-agent-clusterscope -n kube-system >> build/kubectl-dump.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please grep the codebase here if there are not references to kind-fleet-agent-clusterscope .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only references in tests and docs of kubernetes test package

@MichaelKatsoulis
Copy link
Contributor Author

I update the error handling and error messages. It is tested with scenarios where the url is invalid and the bytes are less than expected. The errors are the ones supposed to be.

@mtojek
Copy link
Contributor

mtojek commented Aug 3, 2021

This is unrelated to this PR, but maybe worth defining in the beats repo:

[2021-08-03T14:50:34.562Z] 2021/08/03 14:50:34 DEBUG downloading elastic-agent-managed-kubernetes.yaml from https://raw.githubusercontent.com/elastic/beats/7.x/deploy/kubernetes/elastic-agent-managed-kubernetes.yaml
[2021-08-03T14:50:34.827Z] 2021/08/03 14:50:34 DEBUG status code when downloading elastic-agent-managed-kubernetes.yaml is 200
[2021-08-03T14:50:34.827Z] 2021/08/03 14:50:34 DEBUG downloaded 5084 bytes
[2021-08-03T14:50:34.827Z] 2021/08/03 14:50:34 DEBUG Apply Kubernetes stdin
[2021-08-03T14:50:34.827Z] 2021/08/03 14:50:34 DEBUG run command: /var/lib/jenkins/workspace/t-manager_elastic-package_PR-461/bin/kubectl apply -f - -o yaml
[2021-08-03T14:50:35.399Z] 2021/08/03 14:50:35 DEBUG Handle "apply" command output
[2021-08-03T14:50:35.399Z] 2021/08/03 14:50:35 DEBUG Extract resources from command output
[2021-08-03T14:50:35.399Z] 2021/08/03 14:50:35 DEBUG Wait for ready resources
[2021-08-03T14:50:35.399Z] 2021/08/03 14:50:35 DEBUG Sync resource info: elastic-agent (kind: DaemonSet, namespace: kube-system)
[2021-08-03T14:50:35.399Z] 2021/08/03 14:50:35 DEBUG Sync resource info: elastic-agent (kind: ClusterRoleBinding, namespace: )
[2021-08-03T14:50:35.659Z] 2021/08/03 14:50:35 DEBUG Sync resource info: elastic-agent (kind: RoleBinding, namespace: kube-system)
[2021-08-03T14:50:35.659Z] 2021/08/03 14:50:35 DEBUG Sync resource info: elastic-agent-kubeadm-config (kind: RoleBinding, namespace: kube-system)
[2021-08-03T14:50:35.659Z] 2021/08/03 14:50:35 DEBUG Sync resource info: elastic-agent (kind: ClusterRole, namespace: )
[2021-08-03T14:50:35.659Z] 2021/08/03 14:50:35 DEBUG Sync resource info: elastic-agent (kind: Role, namespace: kube-system)
[2021-08-03T14:50:35.659Z] 2021/08/03 14:50:35 DEBUG Sync resource info: elastic-agent-kubeadm-config (kind: Role, namespace: kube-system)
[2021-08-03T14:50:35.659Z] 2021/08/03 14:50:35 DEBUG Sync resource info: elastic-agent (kind: ServiceAccount, namespace: kube-system)
[2021-08-03T14:50:35.659Z] 2021/08/03 14:50:35 DEBUG beginning wait for 8 resources with timeout of 10m0s
[2021-08-03T14:50:35.659Z] 2021/08/03 14:50:35 DEBUG install custom Kubernetes definitions (directory: /var/lib/jenkins/workspace/t-manager_elastic-package_PR-461/src/github.com/elastic/elastic-package/test/packages/kubernetes/data_stream/apiserver/_dev/deploy/k8s)
[2021-08-03T14:50:35.659Z] 2021/08/03 14:50:35 DEBUG no custom definitions found (directory: /var/lib/jenkins/workspace/t-manager_elastic-package_PR-461/src/github.com/elastic/elastic-package/test/packages/kubernetes/data_stream/apiserver/_dev/deploy/k8s). Nothing else will be installed.
[2021-08-03T14:50:35.660Z] 2021/08/03 14:50:35 DEBUG GET http://127.0.0.1:5601/api/fleet/agents
[2021-08-03T14:50:35.660Z] 2021/08/03 14:50:35 DEBUG filter agents using criteria: NamePrefix=kind-control-plane
[2021-08-03T14:50:35.660Z] 2021/08/03 14:50:35 DEBUG found 0 enrolled agent(s)
[2021-08-03T14:50:36.597Z] 2021/08/03 14:50:36 DEBUG GET http://127.0.0.1:5601/api/fleet/agents
[2021-08-03T14:50:36.597Z] 2021/08/03 14:50:36 DEBUG filter agents using criteria: NamePrefix=kind-control-plane
[2021-08-03T14:50:36.597Z] 2021/08/03 14:50:36 DEBUG found 0 enrolled agent(s)
[2021-08-03T14:50:37.552Z] 2021/08/03 14:50:37 DEBUG GET http://127.0.0.1:5601/api/fleet/agents
[2021-08-03T14:50:37.811Z] 2021/08/03 14:50:37 DEBUG filter agents using criteria: NamePrefix=kind-control-plane
[2021-08-03T14:50:37.811Z] 2021/08/03 14:50:37 DEBUG found 0 enrolled agent(s)
[2021-08-03T14:50:38.752Z] 2021/08/03 14:50:38 DEBUG GET http://127.0.0.1:5601/api/fleet/agents
[2021-08-03T14:50:38.752Z] 2021/08/03 14:50:38 DEBUG filter agents using criteria: NamePrefix=kind-control-plane
[2021-08-03T14:50:38.752Z] 2021/08/03 14:50:38 DEBUG found 0 enrolled agent(s)

As you can see, the code doesn't work for created resources as it does for kube-state-metrics:

[2021-08-03T14:56:34.402Z] 2021/08/03 14:56:34 DEBUG Handle "apply" command output
[2021-08-03T14:56:34.402Z] 2021/08/03 14:56:34 DEBUG Extract resources from command output
[2021-08-03T14:56:34.402Z] 2021/08/03 14:56:34 DEBUG Wait for ready resources
[2021-08-03T14:56:34.402Z] 2021/08/03 14:56:34 DEBUG Sync resource info: kube-state-metrics (kind: ClusterRoleBinding, namespace: )
[2021-08-03T14:56:34.402Z] 2021/08/03 14:56:34 DEBUG Sync resource info: kube-state-metrics (kind: ClusterRole, namespace: )
[2021-08-03T14:56:34.402Z] 2021/08/03 14:56:34 DEBUG Sync resource info: hello (kind: CronJob, namespace: default)
[2021-08-03T14:56:34.402Z] W0803 14:56:34.222531  100343 warnings.go:70] batch/v1beta1 CronJob is deprecated in v1.21+, unavailable in v1.25+; use batch/v1 CronJob
[2021-08-03T14:56:34.402Z] 2021/08/03 14:56:34 DEBUG Sync resource info: kube-state-metrics (kind: Deployment, namespace: kube-system)
[2021-08-03T14:56:34.402Z] 2021/08/03 14:56:34 DEBUG Sync resource info: hello (kind: Job, namespace: default)
[2021-08-03T14:56:34.402Z] 2021/08/03 14:56:34 DEBUG Sync resource info: task-pv-volume (kind: PersistentVolume, namespace: )
[2021-08-03T14:56:34.402Z] 2021/08/03 14:56:34 DEBUG Sync resource info: task-pv-claim (kind: PersistentVolumeClaim, namespace: default)
[2021-08-03T14:56:34.402Z] 2021/08/03 14:56:34 DEBUG Sync resource info: pods-high (kind: ResourceQuota, namespace: default)
[2021-08-03T14:56:34.402Z] 2021/08/03 14:56:34 DEBUG Sync resource info: kube-state-metrics (kind: ServiceAccount, namespace: kube-system)
[2021-08-03T14:56:34.402Z] 2021/08/03 14:56:34 DEBUG Sync resource info: kube-state-metrics (kind: Service, namespace: kube-system)
[2021-08-03T14:56:34.402Z] 2021/08/03 14:56:34 DEBUG Sync resource info: web (kind: StatefulSet, namespace: default)
[2021-08-03T14:56:34.402Z] 2021/08/03 14:56:34 DEBUG beginning wait for 11 resources with timeout of 10m0s
[2021-08-03T14:56:34.402Z] 2021/08/03 14:56:34 DEBUG Deployment is not ready: kube-system/kube-state-metrics. 0 out of 1 expected pods are ready
[2021-08-03T14:56:36.307Z] 2021/08/03 14:56:36 DEBUG Deployment is not ready: kube-system/kube-state-metrics. 0 out of 1 expected pods are ready
[2021-08-03T14:56:38.871Z] 2021/08/03 14:56:38 DEBUG Deployment is not ready: kube-system/kube-state-metrics. 0 out of 1 expected pods are ready
[2021-08-03T14:56:40.780Z] 2021/08/03 14:56:40 DEBUG Deployment is not ready: kube-system/kube-state-metrics. 0 out of 1 expected pods are ready
[2021-08-03T14:56:42.693Z] 2021/08/03 14:56:42 DEBUG Deployment is not ready: kube-system/kube-state-metrics. 0 out of 1 expected pods are ready
[2021-08-03T14:56:44.604Z] 2021/08/03 14:56:44 DEBUG GET http://127.0.0.1:5601/api/fleet/agents
[2021-08-03T14:56:44.604Z] 2021/08/03 14:56:44 DEBUG filter agents using criteria: NamePrefix=kind-control-plane
[2021-08-03T14:56:44.604Z] 2021/08/03 14:56:44 DEBUG found 1 enrolled agent(s)
[2021-08-03T14:56:44.604Z] 2021/08/03 14:56:44 DEBUG creating test policy...
[2021-08-03T14:56:44.604Z] 2021/08/03 14:56:44 DEBUG POST http://127.0.0.1:5601/api/fleet/agent_policies

I was wondering if this is something we can improve in Beats.

@MichaelKatsoulis
Copy link
Contributor Author

[2021-08-03T14:56:34.402Z] 2021/08/03 14:56:34 DEBUG beginning wait for 11 resources with timeout of 10m0s
[2021-08-03T14:56:34.402Z] 2021/08/03 14:56:34 DEBUG Deployment is not ready: kube-system/kube-state-metrics. 0 out of 1 expected pods are ready
[2021-08-03T14:56:36.307Z] 2021/08/03 14:56:36 DEBUG Deployment is not ready: kube-system/kube-state-metrics. 0 out of 1 expected pods are ready
[2021-08-03T14:56:38.871Z] 2021/08/03 14:56:38 DEBUG Deployment is not ready: kube-system/kube-state-metrics. 0 out of 1 expected pods are ready
[2021-08-03T14:56:40.780Z] 2021/08/03 14:56:40 DEBUG Deployment is not ready: kube-system/kube-state-metrics. 0 out of 1 expected pods are ready
[2021-08-03T14:56:42.693Z] 2021/08/03 14:56:42 DEBUG Deployment is not ready: kube-system/kube-state-metrics. 0 out of 1 expected pods are ready

You mean that the part of checking the readiness of the pods is missing?

@mtojek
Copy link
Contributor

mtojek commented Aug 4, 2021

Yes. As you see here, we're using helm internals (same logic) to wait for resources. I'm wondering what's missing that it doesn't wait for it. Is it just lack of healthcheck or deployment?

Please compare it with: https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Ingest-manager/pipelines/integrations/branches/master/runs/647/nodes/304/steps/5473/log/?start=0

@mtojek mtojek self-requested a review August 4, 2021 07:17
Copy link
Contributor

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest it sounds like a blocker for this, as with remote YAML elastic-package doesn't wait for agent deployment.

@mtojek mtojek self-requested a review August 4, 2021 07:20
Copy link
Contributor

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove also this file. I understand that it won't be used anymore.

Copy link
Contributor

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please merge it if CI is happy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants