Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(design): Caching Scan Results by Image Reference #740

Merged
merged 1 commit into from Mar 6, 2022
Merged

docs(design): Caching Scan Results by Image Reference #740

merged 1 commit into from Mar 6, 2022

Conversation

danielpacak
Copy link
Contributor

Signed-off-by: Daniel Pacak pacak.daniel@gmail.com

@danielpacak danielpacak added the 🎨 design More about design and architecture than writing Go code label Oct 8, 2021
@danielpacak danielpacak changed the title docs(design): Caching Scan Results by Repository Digest docs(design): Caching Scan Results by Image Reference Oct 8, 2021
@danielpacak danielpacak added the 🙏 help wanted Extra attention is needed label Oct 8, 2021
Copy link

@itaysk itaysk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the operator has permissions to the cluster, and the user has access to the operator (since it's namespaced), then effectively a user could get access to the cluster reports, which might contain other namespaces (tenants) reports?

@danielpacak
Copy link
Contributor Author

danielpacak commented Oct 11, 2021

if the operator has permissions to the cluster, and the user has access to the operator (since it's namespaced), then effectively a user could get access to the cluster reports, which might contain other namespaces (tenants) reports?

Not sure which user we're talking about here. In the proposed solution the Starboard operator pod is run with the service account (by default starboard-operator), which would require extended permissions to manage ClusterVulnerabilityReport on top of existing access to VulnerabilityReport kinds.

A user who's creating a workload is a different K8s RBAC subject. Hence, explicit role binding has to be created to grant him view access to namespaced or cluster-scoped vulnerability reports.

@danielpacak danielpacak marked this pull request as ready for review October 14, 2021 12:22
Comment on lines +95 to +96
As you can see, Starboard eventually created two VulnerabilityReports by spinning
up only one K8s Job.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which problem are we trying to solve: 1. Spinning less K8s jobs or 2. Creating less VulnerabilityReports?

I ask this because https://github.com/aquasecurity/starboard/pull/740/files#diff-fa88466a8291355bf7dddafd314bcf4277f669d956110f8fdfd319dd48f8147eR32-R35 mentions 2 jobs and 2 VulnerabilityReports. From this solution all we've done is 1 job and 2 VulnerabilityReports.

If long running jobs are the problem we're trying to solve, wouldn't using Trivy Client/Server model be a better idea? (Reasons: Less code, less bugs, single cache, and ultimately faster finishing jobs).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this proposal we are just trying to create less K8s jobs.

For VulnerabilityReports we'll have a separate design if needed. Changing they way we create VulnerabilityReports won't be backward compatible and we have to plan for that carefully.

Starboard is already integrated with Trivy in Standalone and ClientServer mode.

By default we use Trivy in Standalone mode because it does not require deploying Trivy server. It's simpler, good for demos and trying things out, but is also least efficient.

Even if you switch from Standalone to ClientServer mode Starboard would still create K8s jobs. The only difference is that jobs will complete faster. Therefore, leveraging cache allows us to skip K8s job creation if an image with the same digest has already been scanned.

@NissesSenap
Copy link
Contributor

I like the idea, the issue I see is that today can a developer that only have namespace access delete a vulnerabilityreport and starboard would retrigger a new job. Assuming that starboard is using a external trivy server trivy could have gotten a new CVE update and thanks to that it would be interesting to scan the new job.

This could probably be solved with a config value, If we add a TTL of the ClusterVulnerabilityReport that is for example 24 hours by default I think we could skip the majority of jobs especially when creating new RS in a deployment per day which I think would be really useful.

One thing that we would have to keep in mind is what should happen if we start the operator or we have two RS that is created really close. Should we have some logic in the vulnerabilityreport controller that makes sure that only one of the image scan job gets started?

@danielpacak danielpacak requested review from chen-keinan and removed request for yoavrotems January 5, 2022 18:39
Copy link
Contributor

@NissesSenap NissesSenap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave this some more thought and read through the document again and added a small comment that hopefully will help out with the TODO mentioned.

In the design document I think it would be good to add a ClusterVulnerabilityReport example.
I assume it will be very similar to how a VulnerabilityReport look like.

I would suggest also creating a new annotation for VulnerabilityReports that gets generated from a ClusterVulnerabilityReport without running a separate job.

For example: starboard.aquasecurity.github.io/ClusterVulnerabilityReportName: 84bcb5cd46 this way it would be very clear if the VulnerabilityReport was created from a ClusterVulnerabilityReport and I think this could help out with debugging.

@danielpacak do you already have some code ready for this?


## TODO

* Specify the life-cycle of ClusterVulnerabilityReport objects, e.g. how they are invalidated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the main goal is to delete the ClusterVulnerabilityReport after a specific time.

One way could be to follow the steps of: #879 which is to have a seperate controller sitting and listening for ClusterVulnerabilityReport. But in this case we woulden't do any reque instead we would just delete the existing ClusterVulnerabilityReport after a specific amount of time.

The good thing with this solution is that it would be very similar to what we do in #863.

The second way could be when checking if a job should be started or not vs copying an existing ClusterVulnerabilityReport into a VulnerabilityReports we could just check and see the age of the ClusterVulnerabilityReport. If it's above x time we could just delete the existing ClusterVulnerabilityReport and run the job as we normally would, thus generating a new VulnerabilityReports and ClusterVulnerabilityReport.

The good thing here is that we don't rely on another controller.

@danielpacak
Copy link
Contributor Author

danielpacak commented Feb 1, 2022

Gave this some more thought and read through the document again and added a small comment that hopefully will help out with the TODO mentioned.

In the design document I think it would be good to add a ClusterVulnerabilityReport example. I assume it will be very similar to how a VulnerabilityReport look like.

I would suggest also creating a new annotation for VulnerabilityReports that gets generated from a ClusterVulnerabilityReport without running a separate job.

For example: starboard.aquasecurity.github.io/ClusterVulnerabilityReportName: 84bcb5cd46 this way it would be very clear if the VulnerabilityReport was created from a ClusterVulnerabilityReport and I think this could help out with debugging.

@danielpacak do you already have some code ready for this?

@NissesSenap I wrote some code initially to POC this idea, but then I dropped it and started with this design doc.

@NissesSenap
Copy link
Contributor

@danielpacak how do you want to proceed on this design document?
It would be really nice to get this feature, especially in environments where you are running sidecars like linkerd.

If you want I can start working on the design document.
I could write a PR to your fork and you can approve but it will be hard for people to give feedback on.
Or should I create a new PR?

Some other solution?

@danielpacak
Copy link
Contributor Author

@danielpacak how do you want to proceed on this design document? It would be really nice to get this feature, especially in environments where you are running sidecars like linkerd.

If you want I can start working on the design document. I could write a PR to your fork and you can approve but it will be hard for people to give feedback on. Or should I create a new PR?

Some other solution?

What if we merge this PR and then you can refine it? This way we'll have your PR to collect feedback and make progress. WDYT?

@NissesSenap
Copy link
Contributor

I think that is a great idea.

Signed-off-by: Daniel Pacak <pacak.daniel@gmail.com>
@danielpacak danielpacak merged commit afe2521 into aquasecurity:main Mar 6, 2022
@danielpacak
Copy link
Contributor Author

I think that is a great idea.

I marked it as draft and merged. We can refine it by opening follow up PR(s).

@danielpacak danielpacak deleted the docs_design_cache_scan_results_by_repo_digest branch March 6, 2022 10:37
dirien pushed a commit to dirien/starboard that referenced this pull request Mar 8, 2022
Signed-off-by: Daniel Pacak <pacak.daniel@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🎨 design More about design and architecture than writing Go code 🙏 help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants