Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faulty scan jobs blocking further scans from being executed #295

Open
VF-mbrauer opened this issue Jul 12, 2022 · 18 comments
Open

Faulty scan jobs blocking further scans from being executed #295

VF-mbrauer opened this issue Jul 12, 2022 · 18 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. target/kubernetes Issues relating to kubernetes cluster scanning

Comments

@VF-mbrauer
Copy link
Contributor

What steps did you take and what happened:

Due to an error reported in #206 scan jobs getting stuck.
In this case, other PODs will not be scanned anymore as when the OPERATOR_CONCURRENT_SCAN_JOBS_LIMITis reached, no more Scan PODs will be re-spawned up anymore as trivy-operator still wait for them to finish.

Example (due to the error in #206) :

scan-vulnerabilityreport-5759f44647--1-qf7sh   0/1     Completed   0          7m49s
scan-vulnerabilityreport-7d57cffd5f--1-47vds   0/1     Completed   0          2m58s
scan-vulnerabilityreport-849fffd5c7--1-p9fdt   0/1     Completed   0          6m58s
scan-vulnerabilityreport-dc5fb6cf--1-xq5kw     0/1     Completed   0          7m28s
scan-vulnerabilityreport-f49679dcc--1-cvd8x    0/1     Completed   0          118s

What did you expect to happen:
Even though that jobs get stuck due to an unforeseen error, they should get released after some time to make sure that the scan
will continue with other Repositories/Registries. Otherwise, no more scan is happening.

Anything else you would like to add:

If the Job/Pod gets manually deleted it is likely that trivy-operator picks up any other remaining deployment to scan, and then
scanning continues, but if it comes back to the deployment which results back into the error, again the POD gets stuck.
So to get all deployments scanned you need to increase the OPERATOR_CONCURRENT_SCAN_JOBS_LIMIT to a high value
and you need frequently to delete all jobs/pods which got hung, to give 'trivy-operator' the freedom to re-spawn new scans.

Environment:

  • Trivy-Operator version: 0.1.0
  • Kubernetes version: 1.22
@VF-mbrauer VF-mbrauer added the kind/bug Categorizes issue or PR as related to a bug. label Jul 12, 2022
@chen-keinan
Copy link
Contributor

chen-keinan commented Jul 12, 2022

@VF-mbrauer thanks for reporting this , I will review it and update

@chen-keinan chen-keinan self-assigned this Jul 12, 2022
@chen-keinan chen-keinan added target/kubernetes Issues relating to kubernetes cluster scanning priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jul 12, 2022
@erikgb
Copy link
Contributor

erikgb commented Jul 12, 2022

I'll suggest to fix this by making the limiter don't count finished (completed/failed) jobs. That will also be a requirement for #228 used with the limiting feature, which I want to do. WDYT @chen-keinan? I can work on it if you agree on the suggested approach.

@VF-mbrauer
Copy link
Contributor Author

VF-mbrauer commented Jul 12, 2022

@erikgb, we need to be careful, because it will also lead to resource consumption, as the completed ones will still occupy vCPU and MEM at that time. Therefore, we need to calculate and mention a slight increase of the resources even if you limit them with the OPERATOR_CONCURRENT_SCAN_JOBS_LIMIT to a specific value.

@chen-keinan cc

@erikgb
Copy link
Contributor

erikgb commented Jul 12, 2022

I think you mean completed ones in

lead to resource consumption, as the not completed ones will still occupy vCPU and MEM at that time.

?

@VF-mbrauer
Copy link
Contributor Author

Yes, you are right, corrected my statement already. Thanks for that.

@VF-mbrauer
Copy link
Contributor Author

Something like activeDeadlineSeconds would make sense to remove jobs/pods after some time and give space again for new scans to be initiated. This is just to prevent the stoppage. We still should drive the fix also for the ticket #206

@chen-keinan
Copy link
Contributor

chen-keinan commented Jul 12, 2022

@erikgb sure , pick it up. I agree with @VF-mbrauer there is a concern around completed jobs piling up before cleanup has taken place. we can't count on opt-in TTL as probably will not be available by default in all k8s versions.

@chen-keinan chen-keinan assigned erikgb and unassigned chen-keinan Jul 12, 2022
@erikgb
Copy link
Contributor

erikgb commented Jul 12, 2022

@VF-mbrauer You can actually set activeDeadlineSeconds on scan jobs by configuring OPERATOR_SCAN_JOB_TIMEOUT. Seems to have a default value of 5m. I don't have an environment where I can reproduce your problem. Maybe you can try and see if it helps?

@VF-mbrauer
Copy link
Contributor Author

VF-mbrauer commented Jul 12, 2022

@erikgb The default 5 minutes seems also not to help in this case as jobs where sittings there for hours/days. So even lower that value to 1 minutes or similar seems also to be useless.
So we should also check the meaning of this settings and where does it really influence something.

Reproducing the issue is not necessary as the result is already in my statement above.

@erikgb
Copy link
Contributor

erikgb commented Jul 12, 2022

@erikgb The default 5 minutes seems also not to help in this case as jobs where sittings there for hours/days. So even lower that value to 1 minutes or similar seems also to be useless. So we should also check the meaning of this settings and where does it really influence something.

Hmmm, interesting! Can you check your scan Job yaml? If it includes activeDeadlineSeconds, and what the value is?

@VF-mbrauer
Copy link
Contributor Author

@erikgb The default 5 minutes seems also not to help in this case as jobs where sittings there for hours/days. So even lower that value to 1 minutes or similar seems also to be useless. So we should also check the meaning of this settings and where does it really influence something.

Hmmm, interesting! Can you check your scan Job yaml? If it includes activeDeadlineSeconds, and what the value is?

This is an extract of the Job-yaml:

 spec:
    activeDeadlineSeconds: 300
    backoffLimit: 0
    completionMode: NonIndexed
    completions: 1
    parallelism: 1

It contains the activeDeadlineSeconds which is set to 5 minutes.

But the Age is over 14 hours:

#kubectl get job -n trivy-system                                                                               
NAME                                 COMPLETIONS   DURATION   AGE
scan-vulnerabilityreport-6f4658c97   1/1           112s       14h

#kubectl get pod -n trivy-system                                                                            
NAME                                          READY   STATUS      RESTARTS   AGE
scan-vulnerabilityreport-6f4658c97--1-slxns   0/1     Completed   0          14h

@VF-mbrauer
Copy link
Contributor Author

VF-mbrauer commented Jul 14, 2022

One of the installations which proves that it completely blocks:

kubectl get pod -n trivy-system                                                                                                                 
NAME                                           READY   STATUS      RESTARTS   AGE
scan-vulnerabilityreport-547874795b--1-pfdb7   0/1     Completed   0          41h
scan-vulnerabilityreport-b7d7f6874--1-9dj5j    0/1     Completed   0          41h
trivy-exporter-77cdf45fc6-79v7s                1/1     Running     0          41h
trivy-operator-895486674-vm686                 1/1     Running     0          41h

kubectl get job -n trivy-system                                                                                                               
NAME                                  COMPLETIONS   DURATION   AGE
scan-vulnerabilityreport-547874795b   1/1           52s        41h
scan-vulnerabilityreport-b7d7f6874    1/1           3m56s      41h

So no Vulnerabilities were scanned at all:

kubectl get vuln -A                                                                                                                           
No resources found

@VF-mbrauer
Copy link
Contributor Author

@erikgb @chen-keinan Aby news about that one. Until this gets finally solved I hesitate to rollout further.
Jobs are still in a stuck state.

@chen-keinan
Copy link
Contributor

chen-keinan commented Jul 26, 2022

@VF-mbrauer this issue is under investigation, I will update you once we have a solid solution.

@VF-mbrauer
Copy link
Contributor Author

@chen-keinan any news on this one? Independent from any issue related to trivy-operator or trivy scanner, the job should be properly released and not get stuck for forever.

@chen-keinan
Copy link
Contributor

chen-keinan commented Sep 11, 2022

@VF-mbrauer you mention at the top that due to error #206 scan jobs getting stuck (meaning the scan job is completed but trivy-operator is unable to process the report) , I assume this is not the case now , am I right?

@VF-mbrauer
Copy link
Contributor Author

@chen-keinan Yes that is correct. That has been fixed with compression and will be further slimmed where you are working in a split in CRDs and reduce unneccesary stuff.

But if in future there is some new issue which will block job, we should be prepared. And therefore this ticked here is still there.

@chen-keinan
Copy link
Contributor

chen-keinan commented Sep 11, 2022

@chen-keinan Yes that is correct. That has been fixed with compression and will be further slimmed where you are working in a split in CRDs and reduce unneccesary stuff.

But if in future there is some new issue which will block job, we should be prepared. And therefore this ticked here is still there.

trivy-operator has the logic which know to delete jobs on completion (and report has been processed) or failure , the case where a scan has been completed but trivy-operator is unable to process a report example: #206 ,should be fixed as reported immediately , by passing it as a generic solution might lead to data lose or jobs overflow, wdyt ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. target/kubernetes Issues relating to kubernetes cluster scanning
Projects
None yet
Development

No branches or pull requests

3 participants