Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSv vsphere deployment give high CPU usage #1208

Open
sammekh opened this issue Aug 4, 2022 · 2 comments
Open

OSv vsphere deployment give high CPU usage #1208

sammekh opened this issue Aug 4, 2022 · 2 comments

Comments

@sammekh
Copy link

sammekh commented Aug 4, 2022

Hi, I've reimplemented an application to run it on unikernel, in comparison, the same implementation is running on the ubuntu server as well. Both unikernel and vm are deployed on vsphere. unikernel image was build using manifest_from_host.sh as it's a linux based application

While testing, I compared the CPU usage, which is too much on the OSv compared to the ubuntu server. When sending requests, OSv CPU usage is around 75-80% and keeps on increasing with increasing the number of requests.
for the ubuntu server, CPU usage is aroung 25-30% for same number of requests.
I've tried including the provided httpserver when building the image, but it gives the following error:
{"message": "Not found", "code": 404}
is there any way I can check why the application is using this much CPU on unikernel?

@wkozaczuk
Copy link
Collaborator

Hi,

The httpserver api app you are adding to your image as I understand (but does not work for you) has an endpoint /os/threads that can give you the top-like experience especially if you use cli module command line app.

Regarding vsphere high cpu utilization I wonder if that may be related to too short idle time count that leads to higher than should be CPU utilization. Can you try to lower the count in the do_idle() from 10000 to 1000 like this commit does - SpirentOrion@c6fc5f7 and see if this helps? It is interesting that at some point this number was 100 but was increased to 10000 with this commit - 4e3177e.

Would it be possible for you to give me a recipe or makefile of your app so I can reproduce the issues you are seeing? I am especially curious about your inability to hit the httpserver API endpoints.

PS. Is the issue #1187 related to the same app?

@nyh
Copy link
Contributor

nyh commented Aug 10, 2022

Are you worried about this "70%-80%" vs "25%-30%" number because you think it indicates lower performance, or because it wastes energy or host CPU?

If it's the first thing (higher CPU percentage supposedly indicates lower performance), I want to say this: In the distant past, on large multi-user timesharing systems, it was indeed common practice to benchmark software by feeding it a workload and checking the CPU time (or CPU percentage) it used. Faster software would use less CPU time to do the same work. But the story is different in a VM in the cloud. In the cloud, you don't pay less if you don't use all the CPU time you are given. If you perform one request and then go idle, when the next request comes it will have higher latency. Additionally, the cost of these context switches ("exits" in hypervisor lingo, and later interrupts to wake up) are high, so if a server does 100,000 requests per second and does an exit after each one just to go idle - these exits will reduce performance. This is why OSv has this idle loop which @wkozaczuk mentioned - before really exiting the guest it deliberately spends some time proactively looking for more work and avoiding the exit (as well as expensive interupts to wake it up later). On an almost-idle server, the impact of this idle loop is negligible. But you are right that if your server does do 10,000 requests per second and idle between these requests, than the idle loops do add up. Of course, if you increase the load further to 100,000 requests per second (or whatever), you'll see less and less of these idle loops - because they only happen when OSv is idle.
So if you're worried about performance, then you should measure it directly, not CPU usage. Try to measure the peak throughput (you will have then 100% CPU usage, not 30% or 80%, but even then you should try to increase the throughput further until you can't any more). Or the latency at some fixed throughput.

If what is worrying you is the energy cost or CPU usage on a shared system, then @wkozaczuk is absolutely right - you can reduce the length of the idle loop (that "10000" number) to something much lower and hopefully this will help. This will reduce peak throughput somewhat, but you can measure how much this bothers you (and it won't bother you at all if you stay around 30% CPU usage, never reaching peak throughput).

Perhaps we should have made this "10000" number easier to configure or document it somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants