Changes the value of config vm_memballoon_stats_period to 60 #8520

RodrigoDLopez · 2024-01-16T18:23:22Z

Description

Currently, the default value for the vm_memballoon_stats_period configuration is set to 0. As a result, out of the box in ACS, memory metrics for Windows instances (with virtio drivers correctly installed and configured) is compromised when using KVM hypervisor. This pull request aims to propose setting the vm_memballoon_stats_period configuration value to 60. This correction addresses situations such as the one reported through the issue: #8453.

fixes: #8453

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)
build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Major
Minor

Bug Severity

Screenshots (if appropriate):

Before changes:
After changes:

How Has This Been Tested?

A Windows VM was created, and it was confirmed that the VirtIO drivers were correctly installed. However, upon inspecting the metrics collected by ACS, it was observed that they did not align with the actual consumption of the VM. Subsequently, after changing the configuration value from 0 to 60, the metrics collected by ACS matched the consumption values reported by the VM.

codecov · 2024-01-17T09:14:16Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (c3b77cb) 18.28% compared to head (fa25d98) 30.80%.
Report is 5 commits behind head on main.

Additional details and impacted files

@@              Coverage Diff              @@
##               main    #8520       +/-   ##
=============================================
+ Coverage     18.28%   30.80%   +12.52%     
- Complexity    16826    33998    +17172     
=============================================
  Files          4848     5341      +493     
  Lines        324301   375027    +50726     
  Branches      45561    54554     +8993     
=============================================
+ Hits          59290   115525    +56235     
+ Misses       256437   244211    -12226     
- Partials       8574    15291     +6717

Flag	Coverage Δ
simulator-marvin-tests	`24.68% <0.00%> (+6.39%)`	⬆️
uitests	`4.39% <ø> (?)`
unit-tests	`16.50% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

DaanHoogland · 2024-01-17T09:15:01Z

@andrijapanicsb @alexandremattioli any ideas on this?

weizhouapache · 2024-01-17T10:21:25Z

this value can be set manually, or by ansible/chef/puppet.
I prefer to leave users to decide.

andrijapanicsb · 2024-01-17T12:14:03Z

many of the global settings are set to a certain value, many of them to 0 - and can be easily changed (like most of the other global settings). I prefer to keep it as it is, and allow operator to decide what to change.

RodrigoDLopez · 2024-01-17T14:36:15Z

The value of this configuration was originally proposed with 60 to avoid misunderstandings as the one that are happening with operators. However, this value has been modified, and there is no information regarding this change communicated clearly for operators. As a result, numerous users may be experiencing the same issue reported in #8453. It might be a good idea to follow the steps taken with this other configurations 6c9e8a0, where the value of vm.stats.max.retention.time configuration was changed from 0 to 720 for the sake of simplicity for users/operators. Therefore, I suggest doing the same here, and if operators wish, they can modify the value to 0.

BTW: using the value 0 makes the monitoring stop working in KVM.

weizhouapache · 2024-01-17T14:53:42Z

The value of this configuration was originally proposed with 60 to avoid misunderstandings as the one that are happening with operators. However, this value has been modified, and there is no information regarding this change communicated clearly for operators. As a result, numerous users may be experiencing the same issue reported in #8453. It might be a good idea to follow the steps taken with this other configurations 6c9e8a0, where the value of vm.stats.max.retention.time configuration was changed from 0 to 720 for the sake of simplicity for users/operators. Therefore, I suggest doing the same here, and if operators wish, they can modify the value to 0.

BTW: using the value 0 makes the monitoring stop working in KVM.

we are free to change the default value in a PR, or before official release. Once the configuration is included in a official release, we should be careful to change the default value.@RodrigoDLopez

weizhouapache · 2024-01-17T15:38:39Z

BTW: using the value 0 makes the monitoring stop working in KVM.

worth to mention that I have faced an issue caused by it in the past . #8453 (comment)

although @shaerul said it worked well in his testing, I think it would be better to enable/disable memory ballooning and statistics at vm level.

andrijapanicsb · 2024-01-17T19:42:47Z

(For what is worth it - I've been told by a RH engineer (in 2019) that memory auto-ballooning is an abandoned project - so be careful with the feature)

RodrigoDLopez · 2024-01-18T14:21:02Z

I believe that metrics monitoring should work without operators having to worry about configurations and enabling/disabling the feature. Changing the configuration value from '60' to '0' compromised the functionality for the KVM hypervisor; which in turn, can make it seem as it does not work out of the box. An example of this is operators lacking visibility regarding the necessity of adding this configuration in the agent.properties file (see: #8453 (comment)).

From my perspective, using the value '60' resolves issues similar to those faced by @shaerul in #8453. What are the main impacts you foresee with this change?

Regarding the issue encountered during the migration of a Windows instance with the memory balloon enabled (#8453 (comment)), could you explain the error scenario in detail? This would allow me to attempt to reproduce locally and, if necessary, extend this PR to address the issue related to the migration of Windows VMs with the memory balloon feature enabled.

weizhouapache · 2024-01-18T14:53:47Z

I believe that metrics monitoring should work without operators having to worry about configurations and enabling/disabling the feature. Changing the configuration value from '60' to '0' compromised the functionality for the KVM hypervisor; which in turn, can make it seem as it does not work out of the box. An example of this is operators lacking visibility regarding the necessity of adding this configuration in the agent.properties file (see: #8453 (comment)).

In my opinion, the wrong memory usage is not a critical issue for operators.
I believe most cloud operators use automation tools (chef, ansible, puppet, etc). It is very easy to apply the changes on all kvm hosts.

From my perspective, using the value '60' resolves issues similar to those faced by @shaerul in #8453. What are the main impacts you foresee with this change?

As I said, I have faced an issue in the past. I am not sure if the issue was caused by windows virtio drivers or qemu/libvirt.
anyway, we do not need to make decision for users.
There is already a configuration for users to enable/disable it or set to a different value.

Regarding the issue encountered during the migration of a Windows instance with the memory balloon enabled (#8453 (comment)), could you explain the error scenario in detail? This would allow me to attempt to reproduce locally and, if necessary, extend this PR to address the issue related to the migration of Windows VMs with the memory balloon feature enabled.

I do not remember the details very clearly. the kvm host is Ubuntu 18 or 20
the issue was difficult to reproduce, it happened to some vms with virtio drivers, large memory and memory intensive applications.
but I am sure it was caused by memory stat, because the issue is gone after disabling memory stat on Windows vms.

cc @RodrigoDLopez

GutoVeronezi

@andrijapanicsb @weizhouapache, what are your concerns with this change? Without it, a nice feature in ACS, that enables users to monitor users' VMs via UI does not work out of the box for KVM deployments, which can cause disappointments and misunderstandings as the ones already noticed in the mailing list and issues in Github.

agent/src/main/java/com/cloud/agent/properties/AgentProperties.java

weizhouapache · 2024-01-25T13:49:52Z

hat are your concerns with this change? Without it, a nice feature in ACS, that enables users to monitor users' VMs via UI does not work out of the box for KVM deployments, which can cause disappointments and misunderstandings as the ones already noticed in the mailing list and issues in Github.

can please see my comments above ? @GutoVeronezi

GutoVeronezi · 2024-01-25T14:35:39Z

can please see my comments above ? @GutoVeronezi

@weizhouapache you said you have faced an issue in the past, but did not specify how to reproduce it or its details. On the other hand, you said that you are not sure if the issue was caused by the virtio drivers; therefore, we do not have a solid reason to not change it.

weizhouapache · 2024-01-25T14:55:05Z

can please see my comments above ? @GutoVeronezi

@weizhouapache you said you have faced an issue in the past, but did not specify how to reproduce it or its details. On the other hand, you said that you are not sure if the issue was caused by the virtio drivers; therefore, we do not have a solid reason to not change it.

@GutoVeronezi
I have mentioned this but I am sure it was caused by memory stat, because the issue is gone after disabling memory stat on Windows vms.

Anyone who wants to enable the memory stats collection and happy to take the risky, go for it. ACS has already provided the option to do it.

GutoVeronezi · 2024-01-25T14:56:36Z

@GutoVeronezi I have mentioned this but I am sure it was caused by memory stat, because the issue is gone after disabling memory stat on Windows vms.

Anyone who wants to enable the memory stats collection and happy to take the risky, go for it. ACS has already provided the option to do it.

@weizhouapache do you have steps to reproduce it?

weizhouapache · 2024-01-25T15:07:35Z

@GutoVeronezi I have mentioned this but I am sure it was caused by memory stat, because the issue is gone after disabling memory stat on Windows vms.
Anyone who wants to enable the memory stats collection and happy to take the risky, go for it. ACS has already provided the option to do it.

@weizhouapache do you have steps to reproduce it?

@GutoVeronezi
you know some issues are difficult to reproduce.
the issue could be related to the template (Windows version, virtio drivers, etc), and the host (Linux OS version, qemu/libvirt version), etc
I did not face any issue with Linux VMs caused by it. so I made some changes to enable it on Linux VMs, but disable it for Windows VMs.

GutoVeronezi · 2024-01-25T15:17:29Z

@GutoVeronezi you know some issues are difficult to reproduce. the issue could be related to the template (Windows version, virtio drivers, etc), and the host (Linux OS version, qemu/libvirt version), etc I did not face any issue with Linux VMs caused by it. so I made some changes to enable it on Linux VMs, but disable it for Windows VMs.

But then we do not have a solid reason to not change the value, just speculations. If at least we could reproduce the error you saying that exist, we could look for the real cause and solve the problem (or at least find an explanation).

weizhouapache · 2024-01-25T15:41:17Z

@GutoVeronezi you know some issues are difficult to reproduce. the issue could be related to the template (Windows version, virtio drivers, etc), and the host (Linux OS version, qemu/libvirt version), etc I did not face any issue with Linux VMs caused by it. so I made some changes to enable it on Linux VMs, but disable it for Windows VMs.

But then we do not have a solid reason to not change the value, just speculations. If at least we could reproduce the error you saying that exist, we could look for the real cause and solve the problem (or at least find an explanation).

@GutoVeronezi
If you are running a production, you might know, when we face an issue (Windows VM got frozen after live migration in my case), the first thing we need to do, is to find the workaround which fix the issue and then avoid it (in my case, the workaround is disabling memory stats).
root cause? in my case, we need to dive into the source code of QEMU and virtio driver. the root cause is not important to me, as I can accept the problem that Windows VM has wrong memory usage (100%) which is a minor issue, espscially comparing with the issue that Windows VM got frozen.

changing vm_memballoon_stats_period to 60, will add a new setting in the vm definition, which will probably cause some regression which is out of our control. We should disable it by default.

GutoVeronezi · 2024-01-25T16:43:36Z

changing vm_memballoon_stats_period to 60, will add a new setting in the vm definition, which will probably cause some regression which is out of our control. We should disable it by default.

But again, it is a speculation. We have solid use cases of operators having to contact the community to understand why their Windows VMs are with wrong stats; and every time, the solution is to change the agent.properties. However, we do not have use cases that corroborate with the situation you said you faced a long time ago. Indeed, we do have use cases that shows that the aforementioned situation does not happen, vide #8453 (comment). Therefore, we do not have a solid use case/reason to not change the value.

Please, bring solid use cases so we can discuss it further.

weizhouapache · 2024-01-25T16:57:30Z

changing vm_memballoon_stats_period to 60, will add a new setting in the vm definition, which will probably cause some regression which is out of our control. We should disable it by default.

But again, it is a speculation. We have solid use cases of operators having to contact the community to understand why their Windows VMs are with wrong stats; and every time, the solution is to change the agent.properties. However, we do not have use cases that corroborate with the situation you said you faced a long time ago. Indeed, we do have use cases that shows that the aforementioned situation does not happen, vide #8453 (comment). Therefore, we do not have a solid use case/reason to not change the value.

Please, bring solid use cases so we can discuss it further.

interesting
Your use case is solid
my issue is not solid

Restores value of config vm_memballoon_stats_period back to 60

7a02a65

boring-cyborg bot added the component:agent label Jan 16, 2024

RodrigoDLopez changed the title ~~Restores value of config vm_memballoon_stats_period back to 60~~ Changes the value of config vm_memballoon_stats_period to 60 Jan 18, 2024

GutoVeronezi reviewed Jan 25, 2024

View reviewed changes

agent/src/main/java/com/cloud/agent/properties/AgentProperties.java Show resolved Hide resolved

address Daniel review

fa25d98

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes the value of config vm_memballoon_stats_period to 60 #8520

Changes the value of config vm_memballoon_stats_period to 60 #8520

RodrigoDLopez commented Jan 16, 2024 •

edited

codecov bot commented Jan 17, 2024 •

edited

DaanHoogland commented Jan 17, 2024

weizhouapache commented Jan 17, 2024

andrijapanicsb commented Jan 17, 2024

RodrigoDLopez commented Jan 17, 2024

weizhouapache commented Jan 17, 2024

weizhouapache commented Jan 17, 2024

andrijapanicsb commented Jan 17, 2024

RodrigoDLopez commented Jan 18, 2024

weizhouapache commented Jan 18, 2024

GutoVeronezi left a comment

weizhouapache commented Jan 25, 2024

GutoVeronezi commented Jan 25, 2024

weizhouapache commented Jan 25, 2024

GutoVeronezi commented Jan 25, 2024

weizhouapache commented Jan 25, 2024

GutoVeronezi commented Jan 25, 2024

weizhouapache commented Jan 25, 2024

GutoVeronezi commented Jan 25, 2024

weizhouapache commented Jan 25, 2024

Changes the value of config vm_memballoon_stats_period to 60 #8520

Are you sure you want to change the base?

Changes the value of config vm_memballoon_stats_period to 60 #8520

Conversation

RodrigoDLopez commented Jan 16, 2024 • edited

Description

Types of changes

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

codecov bot commented Jan 17, 2024 • edited

Codecov Report

DaanHoogland commented Jan 17, 2024

weizhouapache commented Jan 17, 2024

andrijapanicsb commented Jan 17, 2024

RodrigoDLopez commented Jan 17, 2024

weizhouapache commented Jan 17, 2024

weizhouapache commented Jan 17, 2024

andrijapanicsb commented Jan 17, 2024

RodrigoDLopez commented Jan 18, 2024

weizhouapache commented Jan 18, 2024

GutoVeronezi left a comment

Choose a reason for hiding this comment

weizhouapache commented Jan 25, 2024

GutoVeronezi commented Jan 25, 2024

weizhouapache commented Jan 25, 2024

GutoVeronezi commented Jan 25, 2024

weizhouapache commented Jan 25, 2024

GutoVeronezi commented Jan 25, 2024

weizhouapache commented Jan 25, 2024

GutoVeronezi commented Jan 25, 2024

weizhouapache commented Jan 25, 2024

RodrigoDLopez commented Jan 16, 2024 •

edited

codecov bot commented Jan 17, 2024 •

edited