Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[agent-smith] account for egress traffic #4677

Merged
merged 6 commits into from
Jul 9, 2021
Merged

Conversation

fntlnz
Copy link
Contributor

@fntlnz fntlnz commented Jul 1, 2021

Introducing the first iteration of a mechanism to count excessive egress traffic over a period of time. Based on the original code from @csweichel - the only difference is the way we account for processes using bpf.

This is what happens when a workspace uses too much egress.

{"level":"info","message":"Found violation","penalties":null,"serviceContext":{"service":"agent-smith","version":""},"severity":"INFO","time":"2021-07-01T15:00:58Z","violation":{"SupervisorPID":410501,"Namespace":"","Pod":"","Owner":"","InstanceID":"2eb1dab1-77c3-42fd-99b9-ba3f98800be1","WorkspaceID":"magenta-swallow-alg0r4xe","Infringements":[{"Description":"egress traffic is 144.233 megabytes over limit","Kind":" excessive egress"}],"GitRemoteURL":["https://github.com/fntlnz/bpf-harness"]}}
{"level":"info","message":"Found violation","penalties":["stop workspace"],"serviceContext":{"service":"agent-smith","version":""},"severity":"INFO","time":"2021-07-01T15:01:28Z","violation":{"SupervisorPID":410501,"Namespace":"","Pod":"","Owner":"","InstanceID":"2eb1dab1-77c3-42fd-99b9-ba3f98800be1","WorkspaceID":"magenta-swallow-alg0r4xe","Infringements":[{"Description":"egress traffic is 972.820 megabytes over limit","Kind":"very excessive egress"}],"GitRemoteURL":["https://github.com/fntlnz/bpf-harness"]}}

@codecov
Copy link

codecov bot commented Jul 1, 2021

Codecov Report

Merging #4677 (9a45f53) into main (9628909) will increase coverage by 26.92%.
The diff coverage is 34.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##           main    #4677       +/-   ##
=========================================
+ Coverage      0   26.92%   +26.92%     
=========================================
  Files         0        5        +5     
  Lines         0      739      +739     
=========================================
+ Hits          0      199      +199     
- Misses        0      518      +518     
- Partials      0       22       +22     
Flag Coverage Δ
components-ee-agent-smith-app 26.92% <34.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
components/ee/agent-smith/pkg/agent/egress.go 0.00% <0.00%> (ø)
components/ee/agent-smith/pkg/agent/agent.go 18.96% <28.00%> (ø)
components/ee/agent-smith/pkg/agent/metrics.go 61.33% <66.66%> (ø)
...omponents/ee/agent-smith/pkg/signature/sinature.go 44.27% <0.00%> (ø)
components/ee/agent-smith/pkg/agent/actions.go 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9628909...9a45f53. Read the comment docs.

@fntlnz fntlnz requested a review from csweichel July 1, 2021 15:09
Copy link
Contributor

@csweichel csweichel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a read through the code, currently deploying, will try then

components/ee/agent-smith/pkg/agent/agent.go Outdated Show resolved Hide resolved
components/ee/agent-smith/pkg/agent/agent.go Show resolved Hide resolved
t := value.(time.Time)
infr, err := agent.checkEgressTrafficCallback(p, t)
if err != nil {
return true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reckon at least some debug logging would be handy here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A log here would be extremely verbose, I can add one with a non logged by default priority.
Keep in mind that this is done for every found process every 30 seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point. We could add a metric here also, to see how these checks behave, akin to the signature check failures.

components/ee/agent-smith/pkg/agent/agent.go Show resolved Hide resolved
components/ee/agent-smith/pkg/agent/agent.go Show resolved Hide resolved
components/ee/agent-smith/pkg/agent/agent.go Show resolved Hide resolved
components/ee/agent-smith/pkg/network/egress.go Outdated Show resolved Hide resolved
@csweichel
Copy link
Contributor

Awesome! I was able to provoke agent smith to step in on excessive egress

@@ -47,6 +51,7 @@ type Smith struct {

notifiedInfringements *lru.Cache
perfHandler chan perfHandlerFunc
pidsMap sync.Map
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably expose a prometheus metrics to see how this grows, wdyt @csweichel ?

@fntlnz
Copy link
Contributor Author

fntlnz commented Jul 5, 2021

/werft run

👍 started the job as gitpod-build-lf-agent-smith-egress-policing.16

@fntlnz fntlnz force-pushed the lf/agent-smith-egress-policing branch from 0168e35 to e8d313b Compare July 5, 2021 15:38
@fntlnz fntlnz requested a review from a team as a code owner July 5, 2021 15:38
@fntlnz fntlnz requested a review from a team July 5, 2021 15:38
@fntlnz fntlnz force-pushed the lf/agent-smith-egress-policing branch from 03094bc to 7ebfcf5 Compare July 6, 2021 09:03
Copy link
Contributor

@csweichel csweichel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last nit

components/ee/agent-smith/pkg/agent/agent.go Outdated Show resolved Hide resolved
@fntlnz fntlnz force-pushed the lf/agent-smith-egress-policing branch from 4fd86f7 to 0d6dede Compare July 6, 2021 13:16
@fntlnz
Copy link
Contributor Author

fntlnz commented Jul 6, 2021

/werft run

👍 started the job as gitpod-build-lf-agent-smith-egress-policing.22

@fntlnz fntlnz force-pushed the lf/agent-smith-egress-policing branch from 0d6dede to 6be1f3b Compare July 6, 2021 13:57
Copy link
Contributor

@csweichel csweichel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Thank you 🚀

@fntlnz
Copy link
Contributor Author

fntlnz commented Jul 6, 2021

/werft run

👍 started the job as gitpod-build-lf-agent-smith-egress-policing.24

@fntlnz fntlnz force-pushed the lf/agent-smith-egress-policing branch from 6be1f3b to da335ba Compare July 6, 2021 16:22
@fntlnz
Copy link
Contributor Author

fntlnz commented Jul 6, 2021

I'd love to do some performance tuning before merging.

I noticed that cpu usage is quite high compared to other services

NAME                                CPU(cores)   MEMORY(bytes)   
agent-smith-592nd                   280m         102Mi           
agent-smith-5k6wk                   332m         107Mi           
agent-smith-6vh5l                   321m         93Mi            
agent-smith-fvq7r                   309m         126Mi           
blobserve-7877498bcf-kw5r5          1m           105Mi           
content-service-77d7c84b7c-pqmzx    1m           5Mi             
dashboard-7f798db967-ll2tj          2m           14Mi            
image-builder-7bd7f874df-z68rp      8m           560Mi           
jaeger-7f8f8ff74-hzgkm              2m           12Mi            
messagebus-0                        177m         109Mi           
minio-58d57d5f7d-8sngl              1m           25Mi            
mysql-0                             22m          243Mi           
proxy-68cfb7c965-pmlvz              1m           16Mi            
registry-facade-dclr8               1m           22Mi            
registry-facade-fj7qg               1m           18Mi            
registry-facade-j5z69               1m           21Mi            
registry-facade-qnwbc               1m           18Mi            
server-5886b66cf7-wdrzr             30m          144Mi           
sweeper-665d98b88c-prb2h            1m           4Mi             
ws-daemon-2w265                     3m           28Mi            
ws-daemon-9gp29                     1m           9Mi             
ws-daemon-c8p6x                     1m           8Mi             
ws-daemon-n968m                     1m           8Mi             
ws-manager-7bff5c5d5d-m5rqh         3m           29Mi            
ws-manager-bridge-db56448cb-9mcpn   1m           87Mi            
ws-proxy-7955dcf85-xxrfc            2m           9Mi             
ws-scheduler-96c8bb45-k466q         9m           50Mi   

I need to understand if this is because of a resource leak or because of the mole of data we process.

And I added a counter of active pids that seems that we are currently processing many of them at once

# HELP gitpod_agent_smith_monitored_pids Current count of pids under investigation
# TYPE gitpod_agent_smith_monitored_pids counter
gitpod_agent_smith_monitored_pids{process_state="deleted"} 71124
gitpod_agent_smith_monitored_pids{process_state="stored"} 147847

In the case above 147847-71124 it means 76723 pids processed every 30 seconds. I'm trying to understand how this number changes in a typical scenario before deploying this.

@fntlnz
Copy link
Contributor Author

fntlnz commented Jul 6, 2021

This is the top of agent smith in the current situation (before this change), there's probably something wrong - the whole pids processing pipeline needs to be improved.

NAME                                 CPU(cores)   MEMORY(bytes)   
agent-smith-t4knf                    9m           37Mi 

@fntlnz fntlnz self-assigned this Jul 7, 2021
@fntlnz
Copy link
Contributor Author

fntlnz commented Jul 7, 2021

/werft run

👍 started the job as gitpod-build-lf-agent-smith-egress-policing.27

@fntlnz
Copy link
Contributor Author

fntlnz commented Jul 7, 2021

Made a change so that only supervisor processes are taken in account for checking for egress.
This was needed because the network namespace is owned by the process tree and not by the process so we now perform the checks only once.

This reflects into a performance improvement because we handle 1 pid instead of thousands.

This can be easily seen in the metrics

gitpod_agent_smith_monitored_pids{process_state="deleted"} 1
gitpod_agent_smith_monitored_pids{process_state="stored"} 1
# HELP gitpod_agent_smith_penalty_attempts_total The total amount of attempts that agent-smith is trying to apply a penalty.
# TYPE gitpod_agent_smith_penalty_attempts_total counter
gitpod_agent_smith_penalty_attempts_total{penalty="stop workspace"} 1

This metric says, one workspace was started process_state: stored one workspace went above the egress limit penalty: stop workspace and it was garbage collected process_state: deleted

@fntlnz
Copy link
Contributor Author

fntlnz commented Jul 7, 2021

Resource usage seems also better now, will investigate further optimizations.

NAME                                      CPU(cores)   MEMORY(bytes)   
agent-smith-4stl6                         233m         74Mi            
agent-smith-fxj4t                         206m         40Mi            
agent-smith-l6xng                         410m         59Mi            
agent-smith-t6skf                         371m         101Mi           
agent-smith-thjrf                         136m         68Mi            
agent-smith-wgtr6                         185m         69Mi            
agent-smith-zmk7k                         270m         60Mi  

@fntlnz
Copy link
Contributor Author

fntlnz commented Jul 7, 2021

Looks like the performance differences between this and production are due to signature matching (which is totally not changed here). Will investigate further but I believe that this one is ready to go.

image

@fntlnz fntlnz moved this from In Progress to In Review in [DEPRECATED] Product Engineering Groundwork Jul 7, 2021
@fntlnz fntlnz requested a review from csweichel July 7, 2021 12:24
@fntlnz
Copy link
Contributor Author

fntlnz commented Jul 7, 2021

Just tested against the current main. The cpu usage was not introduced by this PR, main already has it.

agent-smith-hjfjz                    317m         60Mi            
agent-smith-jttmr                    257m         56Mi            
agent-smith-kfwst                    233m         69Mi            
agent-smith-rgplr                    298m         74Mi            
agent-smith-rj7r4                    301m         89Mi            
agent-smith-wq8rr                    357m         70Mi            
agent-smith-zj2x9                    234m         125Mi    

Will investigate on main.

@csweichel
Copy link
Contributor

I reckon we can decouple the CPU use of agent smith and this PR, considering how unoptimised the signature checks are.

@csweichel
Copy link
Contributor

Agent Smith found the excessive egress event, but failed to stop my workspace.

{"level":"info","message":"Found violation","penalties":null,"serviceContext":{"service":"agent-smith","version":""},"severity":"INFO","time":"2021-07-08T07:47:25Z","violation":{"SupervisorPID":1155632,"Namespace":"","Pod":"","Owner":"","InstanceID":"8b6ddb44-0945-4946-93f5-5f36b8d27b34","WorkspaceID":"plum-harrier-1srrlksl","Infringements":[{"Description":"egress traffic is 718.802 megabytes over limit","Kind":"excessive egress"}],"GitRemoteURL":["https://github.com/gitpod-io/template-typescript-node"]}}
{"level":"info","message":"Found violation","penalties":["stop workspace"],"serviceContext":{"service":"agent-smith","version":""},"severity":"INFO","time":"2021-07-08T07:49:25Z","violation":{"SupervisorPID":1600186,"Namespace":"","Pod":"","Owner":"","InstanceID":"47a32706-da82-4475-9bdf-25d4d4d8b7e4","WorkspaceID":"harlequin-horse-w3q83efn","Infringements":[{"Description":"egress traffic is 4111.466 megabytes over limit","Kind":"very excessive egress"}],"GitRemoteURL":["https://github.com/gitpod-io/template-typescript-node"]}}

@fntlnz
Copy link
Contributor Author

fntlnz commented Jul 8, 2021

That seems very strange because I didn't change anything in that code path since your last review @csweichel - is that happening consistently? I just tried it again now and it worked for me, maybe there's something I'm not considering?

@csweichel
Copy link
Contributor

That is most odd indeed. Let's get this in and observe the behaviour in the real world.

@fntlnz fntlnz merged commit f779dec into main Jul 9, 2021
@fntlnz fntlnz deleted the lf/agent-smith-egress-policing branch July 9, 2021 07:16
[DEPRECATED] Product Engineering Groundwork automation moved this from In Review to Awaiting Deployment Jul 9, 2021
@ArthurSens ArthurSens moved this from Awaiting Deployment to Done in [DEPRECATED] Product Engineering Groundwork Jul 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants