New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs(design): Caching Scan Results by Image Reference #740
docs(design): Caching Scan Results by Image Reference #740
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the operator has permissions to the cluster, and the user has access to the operator (since it's namespaced), then effectively a user could get access to the cluster reports, which might contain other namespaces (tenants) reports?
Not sure which user we're talking about here. In the proposed solution the Starboard operator pod is run with the service account (by default A user who's creating a workload is a different K8s RBAC subject. Hence, explicit role binding has to be created to grant him view access to namespaced or cluster-scoped vulnerability reports. |
As you can see, Starboard eventually created two VulnerabilityReports by spinning | ||
up only one K8s Job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which problem are we trying to solve: 1. Spinning less K8s jobs or 2. Creating less VulnerabilityReports?
I ask this because https://github.com/aquasecurity/starboard/pull/740/files#diff-fa88466a8291355bf7dddafd314bcf4277f669d956110f8fdfd319dd48f8147eR32-R35 mentions 2 jobs and 2 VulnerabilityReports. From this solution all we've done is 1 job and 2 VulnerabilityReports.
If long running jobs are the problem we're trying to solve, wouldn't using Trivy Client/Server model be a better idea? (Reasons: Less code, less bugs, single cache, and ultimately faster finishing jobs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this proposal we are just trying to create less K8s jobs.
For VulnerabilityReports we'll have a separate design if needed. Changing they way we create VulnerabilityReports won't be backward compatible and we have to plan for that carefully.
Starboard is already integrated with Trivy in Standalone and ClientServer mode.
By default we use Trivy in Standalone mode because it does not require deploying Trivy server. It's simpler, good for demos and trying things out, but is also least efficient.
Even if you switch from Standalone to ClientServer mode Starboard would still create K8s jobs. The only difference is that jobs will complete faster. Therefore, leveraging cache allows us to skip K8s job creation if an image with the same digest has already been scanned.
I like the idea, the issue I see is that today can a developer that only have namespace access delete a vulnerabilityreport and starboard would retrigger a new job. Assuming that starboard is using a external trivy server trivy could have gotten a new CVE update and thanks to that it would be interesting to scan the new job. This could probably be solved with a config value, If we add a TTL of the ClusterVulnerabilityReport that is for example 24 hours by default I think we could skip the majority of jobs especially when creating new RS in a deployment per day which I think would be really useful. One thing that we would have to keep in mind is what should happen if we start the operator or we have two RS that is created really close. Should we have some logic in the vulnerabilityreport controller that makes sure that only one of the image scan job gets started? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave this some more thought and read through the document again and added a small comment that hopefully will help out with the TODO mentioned.
In the design document I think it would be good to add a ClusterVulnerabilityReport example.
I assume it will be very similar to how a VulnerabilityReport look like.
I would suggest also creating a new annotation for VulnerabilityReports that gets generated from a ClusterVulnerabilityReport without running a separate job.
For example: starboard.aquasecurity.github.io/ClusterVulnerabilityReportName: 84bcb5cd46
this way it would be very clear if the VulnerabilityReport was created from a ClusterVulnerabilityReport and I think this could help out with debugging.
@danielpacak do you already have some code ready for this?
|
||
## TODO | ||
|
||
* Specify the life-cycle of ClusterVulnerabilityReport objects, e.g. how they are invalidated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the main goal is to delete the ClusterVulnerabilityReport after a specific time.
One way could be to follow the steps of: #879 which is to have a seperate controller sitting and listening for ClusterVulnerabilityReport. But in this case we woulden't do any reque instead we would just delete the existing ClusterVulnerabilityReport after a specific amount of time.
The good thing with this solution is that it would be very similar to what we do in #863.
The second way could be when checking if a job should be started or not vs copying an existing ClusterVulnerabilityReport into a VulnerabilityReports we could just check and see the age of the ClusterVulnerabilityReport. If it's above x time we could just delete the existing ClusterVulnerabilityReport and run the job as we normally would, thus generating a new VulnerabilityReports and ClusterVulnerabilityReport.
The good thing here is that we don't rely on another controller.
@NissesSenap I wrote some code initially to POC this idea, but then I dropped it and started with this design doc. |
@danielpacak how do you want to proceed on this design document? If you want I can start working on the design document. Some other solution? |
What if we merge this PR and then you can refine it? This way we'll have your PR to collect feedback and make progress. WDYT? |
I think that is a great idea. |
Signed-off-by: Daniel Pacak <pacak.daniel@gmail.com>
I marked it as draft and merged. We can refine it by opening follow up PR(s). |
Signed-off-by: Daniel Pacak <pacak.daniel@gmail.com>
Signed-off-by: Daniel Pacak pacak.daniel@gmail.com