Faulty scan jobs blocking further scans from being executed #295

VF-mbrauer · 2022-07-12T07:21:07Z

What steps did you take and what happened:

Due to an error reported in #206 scan jobs getting stuck.
In this case, other PODs will not be scanned anymore as when the OPERATOR_CONCURRENT_SCAN_JOBS_LIMITis reached, no more Scan PODs will be re-spawned up anymore as trivy-operator still wait for them to finish.

Example (due to the error in #206) :

scan-vulnerabilityreport-5759f44647--1-qf7sh   0/1     Completed   0          7m49s
scan-vulnerabilityreport-7d57cffd5f--1-47vds   0/1     Completed   0          2m58s
scan-vulnerabilityreport-849fffd5c7--1-p9fdt   0/1     Completed   0          6m58s
scan-vulnerabilityreport-dc5fb6cf--1-xq5kw     0/1     Completed   0          7m28s
scan-vulnerabilityreport-f49679dcc--1-cvd8x    0/1     Completed   0          118s

What did you expect to happen:
Even though that jobs get stuck due to an unforeseen error, they should get released after some time to make sure that the scan
will continue with other Repositories/Registries. Otherwise, no more scan is happening.

Anything else you would like to add:

If the Job/Pod gets manually deleted it is likely that trivy-operator picks up any other remaining deployment to scan, and then
scanning continues, but if it comes back to the deployment which results back into the error, again the POD gets stuck.
So to get all deployments scanned you need to increase the OPERATOR_CONCURRENT_SCAN_JOBS_LIMIT to a high value
and you need frequently to delete all jobs/pods which got hung, to give 'trivy-operator' the freedom to re-spawn new scans.

Environment:

Trivy-Operator version: 0.1.0
Kubernetes version: 1.22

The text was updated successfully, but these errors were encountered:

chen-keinan · 2022-07-12T07:22:39Z

@VF-mbrauer thanks for reporting this , I will review it and update

erikgb · 2022-07-12T10:17:50Z

I'll suggest to fix this by making the limiter don't count finished (completed/failed) jobs. That will also be a requirement for #228 used with the limiting feature, which I want to do. WDYT @chen-keinan? I can work on it if you agree on the suggested approach.

VF-mbrauer · 2022-07-12T10:23:35Z

@erikgb, we need to be careful, because it will also lead to resource consumption, as the completed ones will still occupy vCPU and MEM at that time. Therefore, we need to calculate and mention a slight increase of the resources even if you limit them with the OPERATOR_CONCURRENT_SCAN_JOBS_LIMIT to a specific value.

@chen-keinan cc

erikgb · 2022-07-12T10:26:56Z

I think you mean completed ones in

lead to resource consumption, as the not completed ones will still occupy vCPU and MEM at that time.

?

VF-mbrauer · 2022-07-12T10:29:16Z

Yes, you are right, corrected my statement already. Thanks for that.

VF-mbrauer · 2022-07-12T10:34:07Z

Something like activeDeadlineSeconds would make sense to remove jobs/pods after some time and give space again for new scans to be initiated. This is just to prevent the stoppage. We still should drive the fix also for the ticket #206

chen-keinan · 2022-07-12T11:18:11Z

@erikgb sure , pick it up. I agree with @VF-mbrauer there is a concern around completed jobs piling up before cleanup has taken place. we can't count on opt-in TTL as probably will not be available by default in all k8s versions.

erikgb · 2022-07-12T19:46:59Z

@VF-mbrauer You can actually set activeDeadlineSeconds on scan jobs by configuring OPERATOR_SCAN_JOB_TIMEOUT. Seems to have a default value of 5m. I don't have an environment where I can reproduce your problem. Maybe you can try and see if it helps?

VF-mbrauer · 2022-07-12T20:34:46Z

@erikgb The default 5 minutes seems also not to help in this case as jobs where sittings there for hours/days. So even lower that value to 1 minutes or similar seems also to be useless.
So we should also check the meaning of this settings and where does it really influence something.

Reproducing the issue is not necessary as the result is already in my statement above.

erikgb · 2022-07-12T21:02:30Z

@erikgb The default 5 minutes seems also not to help in this case as jobs where sittings there for hours/days. So even lower that value to 1 minutes or similar seems also to be useless. So we should also check the meaning of this settings and where does it really influence something.

Hmmm, interesting! Can you check your scan Job yaml? If it includes activeDeadlineSeconds, and what the value is?

VF-mbrauer · 2022-07-13T07:31:16Z

@erikgb The default 5 minutes seems also not to help in this case as jobs where sittings there for hours/days. So even lower that value to 1 minutes or similar seems also to be useless. So we should also check the meaning of this settings and where does it really influence something.

Hmmm, interesting! Can you check your scan Job yaml? If it includes activeDeadlineSeconds, and what the value is?

This is an extract of the Job-yaml:

 spec:
    activeDeadlineSeconds: 300
    backoffLimit: 0
    completionMode: NonIndexed
    completions: 1
    parallelism: 1

It contains the activeDeadlineSeconds which is set to 5 minutes.

But the Age is over 14 hours:

#kubectl get job -n trivy-system                                                                               
NAME                                 COMPLETIONS   DURATION   AGE
scan-vulnerabilityreport-6f4658c97   1/1           112s       14h

#kubectl get pod -n trivy-system                                                                            
NAME                                          READY   STATUS      RESTARTS   AGE
scan-vulnerabilityreport-6f4658c97--1-slxns   0/1     Completed   0          14h

VF-mbrauer · 2022-07-14T07:42:57Z

One of the installations which proves that it completely blocks:

kubectl get pod -n trivy-system                                                                                                                 
NAME                                           READY   STATUS      RESTARTS   AGE
scan-vulnerabilityreport-547874795b--1-pfdb7   0/1     Completed   0          41h
scan-vulnerabilityreport-b7d7f6874--1-9dj5j    0/1     Completed   0          41h
trivy-exporter-77cdf45fc6-79v7s                1/1     Running     0          41h
trivy-operator-895486674-vm686                 1/1     Running     0          41h

kubectl get job -n trivy-system                                                                                                               
NAME                                  COMPLETIONS   DURATION   AGE
scan-vulnerabilityreport-547874795b   1/1           52s        41h
scan-vulnerabilityreport-b7d7f6874    1/1           3m56s      41h

So no Vulnerabilities were scanned at all:

kubectl get vuln -A                                                                                                                           
No resources found

VF-mbrauer · 2022-07-26T16:03:54Z

@erikgb @chen-keinan Aby news about that one. Until this gets finally solved I hesitate to rollout further.
Jobs are still in a stuck state.

chen-keinan · 2022-07-26T19:15:22Z

@VF-mbrauer this issue is under investigation, I will update you once we have a solid solution.

VF-mbrauer · 2022-09-11T16:06:51Z

@chen-keinan any news on this one? Independent from any issue related to trivy-operator or trivy scanner, the job should be properly released and not get stuck for forever.

chen-keinan · 2022-09-11T16:19:53Z

@VF-mbrauer you mention at the top that due to error #206 scan jobs getting stuck (meaning the scan job is completed but trivy-operator is unable to process the report) , I assume this is not the case now , am I right?

VF-mbrauer · 2022-09-11T16:24:51Z

@chen-keinan Yes that is correct. That has been fixed with compression and will be further slimmed where you are working in a split in CRDs and reduce unneccesary stuff.

But if in future there is some new issue which will block job, we should be prepared. And therefore this ticked here is still there.

chen-keinan · 2022-09-11T16:35:28Z

@chen-keinan Yes that is correct. That has been fixed with compression and will be further slimmed where you are working in a split in CRDs and reduce unneccesary stuff.

But if in future there is some new issue which will block job, we should be prepared. And therefore this ticked here is still there.

trivy-operator has the logic which know to delete jobs on completion (and report has been processed) or failure , the case where a scan has been completed but trivy-operator is unable to process a report example: #206 ,should be fixed as reported immediately , by passing it as a generic solution might lead to data lose or jobs overflow, wdyt ?

VF-mbrauer added the kind/bug Categorizes issue or PR as related to a bug. label Jul 12, 2022

chen-keinan self-assigned this Jul 12, 2022

chen-keinan added target/kubernetes Issues relating to kubernetes cluster scanning priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jul 12, 2022

chen-keinan assigned erikgb and unassigned chen-keinan Jul 12, 2022

chen-keinan unassigned erikgb Sep 9, 2022

jrhunger mentioned this issue Sep 13, 2022

container-images annotation applied to wrong job #509

Closed

jfcoz mentioned this issue Oct 12, 2022

prometheus metrics for operability #97

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faulty scan jobs blocking further scans from being executed #295

Faulty scan jobs blocking further scans from being executed #295

VF-mbrauer commented Jul 12, 2022

chen-keinan commented Jul 12, 2022 •

edited

Loading

erikgb commented Jul 12, 2022

VF-mbrauer commented Jul 12, 2022 •

edited

Loading

erikgb commented Jul 12, 2022 •

edited

Loading

VF-mbrauer commented Jul 12, 2022

VF-mbrauer commented Jul 12, 2022

chen-keinan commented Jul 12, 2022 •

edited

Loading

erikgb commented Jul 12, 2022

VF-mbrauer commented Jul 12, 2022 •

edited

Loading

erikgb commented Jul 12, 2022

VF-mbrauer commented Jul 13, 2022

VF-mbrauer commented Jul 14, 2022 •

edited

Loading

VF-mbrauer commented Jul 26, 2022

chen-keinan commented Jul 26, 2022 •

edited

Loading

VF-mbrauer commented Sep 11, 2022

chen-keinan commented Sep 11, 2022 •

edited

Loading

VF-mbrauer commented Sep 11, 2022

chen-keinan commented Sep 11, 2022 •

edited

Loading

Faulty scan jobs blocking further scans from being executed #295

Faulty scan jobs blocking further scans from being executed #295

Comments

VF-mbrauer commented Jul 12, 2022

chen-keinan commented Jul 12, 2022 • edited Loading

erikgb commented Jul 12, 2022

VF-mbrauer commented Jul 12, 2022 • edited Loading

erikgb commented Jul 12, 2022 • edited Loading

VF-mbrauer commented Jul 12, 2022

VF-mbrauer commented Jul 12, 2022

chen-keinan commented Jul 12, 2022 • edited Loading

erikgb commented Jul 12, 2022

VF-mbrauer commented Jul 12, 2022 • edited Loading

erikgb commented Jul 12, 2022

VF-mbrauer commented Jul 13, 2022

VF-mbrauer commented Jul 14, 2022 • edited Loading

VF-mbrauer commented Jul 26, 2022

chen-keinan commented Jul 26, 2022 • edited Loading

VF-mbrauer commented Sep 11, 2022

chen-keinan commented Sep 11, 2022 • edited Loading

VF-mbrauer commented Sep 11, 2022

chen-keinan commented Sep 11, 2022 • edited Loading

chen-keinan commented Jul 12, 2022 •

edited

Loading

VF-mbrauer commented Jul 12, 2022 •

edited

Loading

erikgb commented Jul 12, 2022 •

edited

Loading

chen-keinan commented Jul 12, 2022 •

edited

Loading

VF-mbrauer commented Jul 12, 2022 •

edited

Loading

VF-mbrauer commented Jul 14, 2022 •

edited

Loading

chen-keinan commented Jul 26, 2022 •

edited

Loading

chen-keinan commented Sep 11, 2022 •

edited

Loading

chen-keinan commented Sep 11, 2022 •

edited

Loading