Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BBS] Use scheduling info instead of the whole desiredLRP #883

Closed
klapkov opened this issue Jan 8, 2024 · 5 comments
Closed

[BBS] Use scheduling info instead of the whole desiredLRP #883

klapkov opened this issue Jan 8, 2024 · 5 comments

Comments

@klapkov
Copy link
Contributor

klapkov commented Jan 8, 2024

Use scheduling info instead of the whole desiredLRP

Summary

There are two places where I see we can make some optimisations:
When crashing a ActualLRP:
https://github.com/cloudfoundry/bbs/blob/main/controllers/actual_lrp_lifecycle_controller.go#L207

In requestAuction() which is called when we evacuate a Claimed or Running actualLRP: https://github.com/cloudfoundry/bbs/blob/main/controllers/evacuation_controller.go#L387

In both cases we fetch a whole desiredLRP and use only the scheduling info. This could be avoided by directly fetching the scheduling info only. This way we in cases where the RunInfo of an desiredLRP is really large in size, we will not waist resources and time fetching in when we do not use it. The change is small, but I can see some improvements to be gained here.

The only point against the change I see is that sadly the DesiredLRPSchedulingInfo returns an array. so we need to check if it is empty and use the first element if not. This maybe introduce a little bit of confusion and the code is not as clean as before. Anyhow, if anyone has ideas on how to better do this, please leave a comment. Will try to collect some performance data in the following days to share as well.

Diego repo

https://github.com/cloudfoundry/bbs

Describe alternatives you've considered (optional)

Proposal PR: cloudfoundry/bbs#79

@winkingturtle-vmw
Copy link
Contributor

@klapkov Can you please quantify the improvements? I'd like to see a before/after analysis to better understand what we are gaining with this change.

@dsabeti
Copy link
Contributor

dsabeti commented Feb 20, 2024

Hey @klapkov. I reviewed the PR with @ebroberson , and besides a few style changes (e.g. test descriptions, variable names), we think the PR looks good. We'd definitely like to understand the performance improvements as @winkingturtle-vmw mentioned. Once we see some data about that, we'll be happy to merge the PR.

@klapkov
Copy link
Contributor Author

klapkov commented Feb 29, 2024

Hello @dsabeti and @winkingturtle-vmw,

Sorry for the late response. I did some tests to see what would we gain with this change.

What I did was deploy 150 applications - each with 3 instances and around 2000 app security groups. This means that app of those applications have a very large run_info. Then I tracked how much time it takes to fetch a desiredLRP in the evacuation controller like this: klapkov/bbs@b42b9f6

I did this for both the current way with desiredLRPByProcessGuid and then with the new endpoint. I triggered a evacuation and observed the logs.

Here are the results from 500 evacuating lrp's:

Using desiredLRPByProcessGuid:

  | Feb 28, 2024 @ 23:38:38.156 | bbs | {"timestamp":"2024-02-28T23:38:38.156950563Z","level":"info","source":"bbs","message":"bbs.request.evacuate-claimed-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"61034.1","time":1.312609736}}

  | Feb 28, 2024 @ 23:38:38.052 | bbs | {"timestamp":"2024-02-28T23:38:38.052007918Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"61096.1","time":1.395905079}}

  | Feb 28, 2024 @ 23:38:38.051 | bbs | {"timestamp":"2024-02-28T23:38:38.051072004Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"61093.1","time":0.969127116}}

  | Feb 28, 2024 @ 23:38:38.042 | bbs | {"timestamp":"2024-02-28T23:38:38.042025336Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"61073.1","time":1.556825014}}

  | Feb 28, 2024 @ 23:38:38.022 | bbs | {"timestamp":"2024-02-28T23:38:38.022147510Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"61042.1","time":2.840410677}}

  | Feb 28, 2024 @ 23:38:37.910 | bbs | {"timestamp":"2024-02-28T23:38:37.910684925Z","level":"info","source":"bbs","message":"bbs.request.evacuate-claimed-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"60977.1","time":2.8429662799999997}}

  | Feb 28, 2024 @ 23:38:37.879 | bbs | {"timestamp":"2024-02-28T23:38:37.879589211Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"61063.1","time":1.235602573}}

  | Feb 28, 2024 @ 23:38:37.879 | bbs | {"timestamp":"2024-02-28T23:38:37.879539399Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"60968.1","time":1.688915266}}

  | Feb 28, 2024 @ 23:38:37.847 | bbs | {"timestamp":"2024-02-28T23:38:37.847163637Z","level":"info","source":"bbs","message":"bbs.request.evacuate-claimed-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"61068.1","time":2.727919954}}

  | Feb 28, 2024 @ 23:38:37.846 | bbs | {"timestamp":"2024-02-28T23:38:37.846022114Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"60984.1","time":1.281320477}}

  | Feb 28, 2024 @ 23:38:37.846 | bbs | {"timestamp":"2024-02-28T23:38:37.846547979Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"60980.1","time":2.731387767}}

  | Feb 28, 2024 @ 23:38:37.846 | bbs | {"timestamp":"2024-02-28T23:38:37.846022090Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"61019.1","time":2.120372531}}

  | Feb 28, 2024 @ 23:38:37.820 | bbs | {"timestamp":"2024-02-28T23:38:37.820867847Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"61059.1","time":1.179043058}}

  | Feb 28, 2024 @ 23:38:37.816 | bbs | {"timestamp":"2024-02-28T23:38:37.816351230Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"60960.1","time":2.585209338}}

  | Feb 28, 2024 @ 23:38:37.816 | bbs | {"timestamp":"2024-02-28T23:38:37.816698606Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"61001.1","time":1.173293566}}

The average time for 500 LRP's was: 1.48s per LRP

And not here is what I saw using the new endpoint:

  | Feb 29, 2024 @ 09:57:36.834 | bbs | {"timestamp":"2024-02-29T09:57:36.834070858Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evacontroller","data":{"session":"69788.1","time":0.053956742}}


  | Feb 29, 2024 @ 09:57:36.832 | bbs | {"timestamp":"2024-02-29T09:57:36.832609309Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evacontroller","data":{"session":"69840.1","time":0.05343809}}

  | Feb 29, 2024 @ 09:57:36.831 | bbs | {"timestamp":"2024-02-29T09:57:36.831256636Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evacontroller","data":{"session":"69811.1","time":0.048165686}}

  | Feb 29, 2024 @ 09:57:36.831 | bbs | {"timestamp":"2024-02-29T09:57:36.831095012Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evacontroller","data":{"session":"69803.1","time":0.063661647}}

  | Feb 29, 2024 @ 09:57:36.829 | bbs | {"timestamp":"2024-02-29T09:57:36.829210118Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evacontroller","data":{"session":"69819.1","time":0.063480223}}

  | Feb 29, 2024 @ 09:57:36.828 | bbs | {"timestamp":"2024-02-29T09:57:36.828220971Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evacontroller","data":{"session":"69781.1","time":0.055027243}}

  | Feb 29, 2024 @ 09:57:36.828 | bbs | {"timestamp":"2024-02-29T09:57:36.828688531Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evacontroller","data":{"session":"69793.1","time":0.055143658}}

  | Feb 29, 2024 @ 09:57:36.827 | bbs | {"timestamp":"2024-02-29T09:57:36.827719274Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evacontroller","data":{"session":"69818.1","time":0.061275584}}

  | Feb 29, 2024 @ 09:57:36.826 | bbs | {"timestamp":"2024-02-29T09:57:36.826195576Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evacontroller","data":{"session":"69785.1","time":0.056747965}}

  | Feb 29, 2024 @ 09:57:36.826 | bbs | {"timestamp":"2024-02-29T09:57:36.826666573Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evacontroller","data":{"session":"69762.1","time":0.051233733}}

  | Feb 29, 2024 @ 09:57:36.825 | bbs | {"timestamp":"2024-02-29T09:57:36.825402685Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evacontroller","data":{"session":"69801.1","time":0.048784051}}

  | Feb 29, 2024 @ 09:57:36.824 | bbs | {"timestamp":"2024-02-29T09:57:36.824379962Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evacontroller","data":{"session":"69776.1","time":0.046138081}}

The average time was 0.058s.

So we can see that there is some time we can save here. I also imagine that the load on BBS and the diegoDB will be decreased as well, since we would not need to fetch the large run_info in such cases.

Please let me know if you need anything more. We think this change is worth it.

Regards,
Konstantin Lapkov

@ebroberson
Copy link
Contributor

This looks great! I made some small changes for styling and have merged the PR.

@klapkov
Copy link
Contributor Author

klapkov commented Feb 29, 2024

Thanks @ebroberson !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

4 participants