Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stacktrace when scanning k8s containers and no info about the problem #2101

Closed
ymettier opened this issue May 26, 2024 · 8 comments · Fixed by #2107
Closed

Stacktrace when scanning k8s containers and no info about the problem #2101

ymettier opened this issue May 26, 2024 · 8 comments · Fixed by #2107
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. target/kubernetes Issues relating to kubernetes cluster scanning

Comments

@ymettier
Copy link

Using

  • Trivy-operator 0.21.1
  • from official helm chart 0.23.1
  • with values.yaml at the end of the report (see below)
  • on Kubernetes 1.30.1

I have 2 problems. The first one, I cannot investigate because of the second one. So this issue is not about the first problem but about the second one.

How I reproduce the problem

Here is the second problem.

➜ kubectl -n trivy-system logs trivy-operator-8797d6d8b-78zhh
...
{"level":"error","ts":"2024-05-26T16:19:53Z","logger":"reconciler.scan job","msg":"Scan job container","job":"trivy-system/scan-vulnerabilityreport-687bbcc85b","container":"<REDACTED>","status.reason":"Error","status.message":"","stacktrace":"github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport/controller.(*ScanJobController).completedContainers\n\t/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller/scanjob.go:353\ngithub.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport/controller.(*ScanJobController).SetupWithManager.(*ScanJobController).reconcileJobs.func1\n\t/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller/scanjob.go:80\nsigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/reconcile/reconcile.go:113\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:222"}

To make it easier to read, let's show the above message with printf :

github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport/controller.(*ScanJobController).completedContainers
	/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller/scanjob.go:353
github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport/controller.(*ScanJobController).SetupWithManager.(*ScanJobController).reconcileJobs.func1
	/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller/scanjob.go:80
sigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/reconcile/reconcile.go:113
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:261
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:222

In the source code I understand that

  • the container scan was completed
  • it failed for some reason
  • the status.message points to itself in the log.Error
  • the stacktrace also points to the log.Error

The issue

The issue is that I have a problem when the operator scans a container. I know it in the logs, but

  • I have a stacktrace : I should not have a stacktrace because stacktraces are when the software is buggy
  • neither the log, neither the stacktrace give information about what does not work.

I will try to investigate on my first problem, but for this one, could you make the logs more explicite please ?

my helm values

fullnameOverride: "trivy-operator"

operator:
  # -- builtInTrivyServer The flag enables the usage of built-in trivy server in cluster. It also overrides the following trivy params with built-in values
  # trivy.mode = ClientServer and serverURL = http://<serverServiceName>.<trivy operator namespace>:4975
  builtInTrivyServer: true
  privateRegistryScanSecretsNames: { "trivy-system": "<REDACTED>" }

serviceMonitor:
  enabled: true
  labels:
    k8s-app: "trivy-operator"

trivy:
  # -- mode is the Trivy client mode. Either Standalone or ClientServer. Depending
  # on the active mode other settings might be applicable or required.
  mode: ClientServer

  # -- whether to use a storage class for trivy server or emptydir (one mey want to use ephemeral storage)
  storageClassEnabled: true

  # -- storageClassName is the name of the storage class to be used for trivy server PVC. If empty, tries to find default storage class
  storageClassName: "<REDACTED>"

  # -- storageSize is the size of the trivy server PVC
  storageSize: "5Gi"

  # -- serverInsecure is the flag to enable insecure connection to the Trivy server.
  serverInsecure: false
  insecureRegistries:
    internalRegistry: <REDACTED>

nodeCollector:
  tolerations:
    - operator: Exists
      effect: NoSchedule
@ymettier ymettier added the kind/bug Categorizes issue or PR as related to a bug. label May 26, 2024
@ymettier
Copy link
Author

I forgot to say that the pod running the operator is stable, up&running for days. No issue there.

The stacktrace I mention is not in the operator but probably in the pod that runs the vulnerability scan.

And the fix for issue 2101 (this issue) is probably just returning something else than a stacktrace in the logs because Trivy-operator is stable and working. Something that helps to find why the pod running the vulnerability scan failed (I have its name in the log, but only its name).

@chen-keinan
Copy link
Collaborator

@ymettier is it happen on specific image ? every time ? can you share it (if it public) ?

@ymettier
Copy link
Author

I fixed my initial problem. I was switch from one private registry to another. Both are "insecure" but I forgot to add the new one in the configuration. My first problem is fixed.

About this issue, it happened on all images that were on my private registry. Every time of course.

In fact, the problem should appear at least every time that Trivy fails to retrieve an image.

You can reproduce my problem by creating a deployment/statefulset with some broken image :

apiVersion: apps/v1
kind: Deployment
metadata:
  name: issue-2101
spec:
  replicas: 1
  selector:
    matchLabels:
      app: issue-2101
  template:
    metadata:
      name: issue-2101
      labels:
        app: issue-2101
    spec:
      containers:
      - image: nowhere.nodomain.com:12345/nothing/noimage:latest
        imagePullPolicy: IfNotPresent
        name: issue-2101

How can you troubleshoot the problem with only the log with the stacktrace ?

@chen-keinan
Copy link
Collaborator

chen-keinan commented May 27, 2024

@ymettier can you please elaborate on problems:

  • are you able to scan images with private registry at all?

@sheeeng
Copy link

sheeeng commented May 27, 2024

To make it easier to read, let's show the above message with printf :

github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport/controller.(*ScanJobController).completedContainers
	/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller/scanjob.go:353
github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport/controller.(*ScanJobController).SetupWithManager.(*ScanJobController).reconcileJobs.func1
	/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller/scanjob.go:80
sigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/reconcile/reconcile.go:113
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:261
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:222

I was able to reproduce the same stacktrace with a minimal example as described in #2095 (comment).

@ymettier
Copy link
Author

When the private registries are correctly configured : it works.
When an insecure registry is not configured : I have the stacktrace.

However, with the above deployment that helps to reproduce the problem, I guess that bad-configured private registries are just another way to reproduce the problem.

@chen-keinan chen-keinan added priority/backlog Higher priority than priority/awaiting-more-evidence. target/kubernetes Issues relating to kubernetes cluster scanning labels May 28, 2024
@urcus
Copy link

urcus commented May 31, 2024

Hello,

I did encounter the same issue as, from your discussion i understand that the issue seems to came from my private registry configuration. But where ? I don't know (even after a lot of research / test)...
Is there a possibility to make the error the log more explicit than just the stacktrace ? this should maybe help people find where the issue comme from as "controller-runtime@v0.18.2/pkg/internal/controller/controller.go:222" is not really helpfull.

@ymettier
Copy link
Author

I share @urcus 's comments as it was my "first problem" in the description of the issue.

However, this issue is expliciteliy not about this and it is solved and closed. A new issue should be open to get more explicit information when trivy fails.

(and thanks @chen-keinan for the fix! )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. target/kubernetes Issues relating to kubernetes cluster scanning
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants