Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sysdump: Collect also init container logs #414

Merged

Conversation

joestringer
Copy link
Member

Previously we only collected regular container logs, but missed the init
container logs. Gather these as well, they can be useful for debugging.

@michi-covalent
Copy link
Contributor

aks failure expected from forked repos

@michi-covalent
Copy link
Contributor

Multicluster failure is not expected, gke cluster creation failed 😐

btw looks like cilium command isn't there, cilium sysdump command failed to run. i guess @nbusseneau 's suspicion was right. cc @bmcustodio

https://github.com/cilium/cilium-cli/pull/414/checks?check_run_id=3032486960#step:16:48

@joestringer joestringer force-pushed the submit/sysdump-collect-init-containers branch from 5e60cac to f6767ec Compare July 9, 2021 20:46
@joestringer joestringer temporarily deployed to ci July 9, 2021 20:46 Inactive
Copy link
Member

@nbusseneau nbusseneau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes LGTM, nice catch.

@nbusseneau
Copy link
Member

nbusseneau commented Jul 12, 2021

btw looks like cilium command isn't there, cilium sysdump command failed to run. i guess @nbusseneau 's suspicion was right. cc @bmcustodio

Not sure what you're referring to but in the multicluster workflow run (https://github.com/cilium/cilium-cli/runs/3032615794) it's expected that cilium is not there because the cluster did not even start. Since we run the command in-cluster, there was no time to set that up :D

PS: I recommend to link using permanent Actions links rather than the Check link, which is temporary. For example, for the multicluster workflow linked in my comment:

Copy link
Contributor

@bmcustodio bmcustodio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one! 🚀

@joestringer
Copy link
Member Author

The following actions are failing:

  • AKS - Failed because I submitted the PR from a separate fork
  • External Workloads - Failed due to the "Missing or incomplete configuration info."
    • kubectl logs --timestamps -n kube-system job/cilium-cli-install
      shell: /usr/bin/bash -e ***0***
      ...
      error: Missing or incomplete configuration info.  Please point to an existing, complete config file:```
      
  • GKE - Failed a connectivity check by the looks
    • Test report says:
      2021-07-09T20:57:46.525388447Z 📋 Test Report
      2021-07-09T20:57:46.525503713Z ❌ 1/9 tests failed (0/74 actions), 0 tests skipped, 0 scenarios skipped:
      2021-07-09T20:57:46.525590033Z Test [dns-only]:
      2021-07-09T20:57:46.525875565Z 
      2021-07-09T20:57:46.525901701Z Error: Connectivity test failed: 1 tests failed
      error: cannot exec into a container in a completed pod; current phase is Failed
      ...
      /home/runner/work/_temp/d5b10418-40fb-4e1d-9c33-5f27c4d98078.sh: line 4: cilium: command not found
      Error: Process completed with exit code 127.
      
    • The only string matches for "failed" or "dns-only" or in the "Post-test installation logs" section of the action output are this on and one outside the logs. Despite reporting that there was a test failure, I can't find the failure itself.
    • Only other odd thing is that the CLI container is crashing:
      kube-system   cilium-cli-ckwld                                                 0/1     Error     0          15m   10.168.15.215   gke-cilium-cilium-cli-10-default-pool-f2d20591-55n8   <none>           <none>
      
    • Upload artifacts failed with:
      Warning: No files were found with the provided path: cilium-sysdump-out.zip. No artifacts will be uploaded.
      
  • Multicluster - Failed provisioning due to invalid argument:
    • ERROR: (gcloud.container.clusters.create) Operation [<Operation
      clusterConditions: [<StatusCondition
      canonicalCode: CanonicalCodeValueValuesEnum(INVALID_ARGUMENT, 3)
      message: 'The network "default" does not have available private IP space in 10.0.0.0/9 to reserve a /14 block for pods for 
      cluster ***Zone=us-west2-a, ProjectNum=185287498374, ProjectName=***, ClusterName=cilium-cilium-cli-1016427082-mesh-1, ClusterHash=e186c3abb89943bdaf240e0b6aad55b993854d496a384d0dbe801fe7c4c858c5***.'>]```
      
      

How do we proceed?

@tklauser
Copy link
Member

How do we proceed?

Thanks for the detailed analysis @joestringer!

The Multicluster and the External Workloads failures look like some sort of temporary infrastructure issues to me. For the GKE workflow we lack a sysdump which would allow further investigation. Given that, I'd propose to rebase this PR onto latest master, now that #423 and #426 are merged. These should restore sysdump creation on workflow failure/cancellation and would give us at least a sysdump for the failing GKE workflow, in case it turns out not to be a flake.

Signed-off-by: Joe Stringer <joe@cilium.io>
@joestringer joestringer force-pushed the submit/sysdump-collect-init-containers branch from f6767ec to 4b37c97 Compare July 14, 2021 21:35
@joestringer joestringer temporarily deployed to ci July 14, 2021 21:35 Inactive
@nbusseneau
Copy link
Member

Agree with Tobias, only blocking flake was the GKE one, but lacking a sysdump it's kinda hard to know which one it was :/

@tklauser
Copy link
Member

The only failing test is on AKS, which is expected as this PR was opened from a fork.

Merging, thanks!

@tklauser tklauser merged commit fb0f0f6 into cilium:master Jul 15, 2021
@joestringer joestringer deleted the submit/sysdump-collect-init-containers branch July 15, 2021 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants