-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faulty scan jobs blocking further scans from being executed #295
Comments
@VF-mbrauer thanks for reporting this , I will review it and update |
I'll suggest to fix this by making the limiter don't count finished (completed/failed) jobs. That will also be a requirement for #228 used with the limiting feature, which I want to do. WDYT @chen-keinan? I can work on it if you agree on the suggested approach. |
@erikgb, we need to be careful, because it will also lead to resource consumption, as the @chen-keinan cc |
I think you mean completed ones in
? |
Yes, you are right, corrected my statement already. Thanks for that. |
Something like |
@erikgb sure , pick it up. I agree with @VF-mbrauer there is a concern around |
@VF-mbrauer You can actually set |
@erikgb The default 5 minutes seems also not to help in this case as jobs where sittings there for hours/days. So even lower that value to 1 minutes or similar seems also to be useless. Reproducing the issue is not necessary as the result is already in my statement above. |
Hmmm, interesting! Can you check your scan Job yaml? If it includes |
This is an extract of the Job-yaml:
It contains the But the Age is over 14 hours:
|
One of the installations which proves that it completely blocks:
So no Vulnerabilities were scanned at all:
|
@erikgb @chen-keinan Aby news about that one. Until this gets finally solved I hesitate to rollout further. |
@VF-mbrauer this issue is under investigation, I will update you once we have a solid solution. |
@chen-keinan any news on this one? Independent from any issue related to trivy-operator or trivy scanner, the job should be properly released and not get stuck for forever. |
@VF-mbrauer you mention at the top that due to error #206 scan jobs getting stuck (meaning the scan job is completed but |
@chen-keinan Yes that is correct. That has been fixed with compression and will be further slimmed where you are working in a split in CRDs and reduce unneccesary stuff. But if in future there is some new issue which will block job, we should be prepared. And therefore this ticked here is still there. |
|
What steps did you take and what happened:
Due to an error reported in #206 scan jobs getting stuck.
In this case, other PODs will not be scanned anymore as when the
OPERATOR_CONCURRENT_SCAN_JOBS_LIMIT
is reached, no more Scan PODs will be re-spawned up anymore as trivy-operator still wait for them to finish.Example (due to the error in #206) :
What did you expect to happen:
Even though that jobs get stuck due to an unforeseen error, they should get released after some time to make sure that the scan
will continue with other Repositories/Registries. Otherwise, no more scan is happening.
Anything else you would like to add:
If the Job/Pod gets manually deleted it is likely that
trivy-operator
picks up any other remaining deployment to scan, and thenscanning continues, but if it comes back to the deployment which results back into the error, again the POD gets stuck.
So to get all deployments scanned you need to increase the
OPERATOR_CONCURRENT_SCAN_JOBS_LIMIT
to a high valueand you need frequently to delete all jobs/pods which got hung, to give 'trivy-operator' the freedom to re-spawn new scans.
Environment:
The text was updated successfully, but these errors were encountered: