Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes the value of config vm_memballoon_stats_period to 60 #8520

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

RodrigoDLopez
Copy link
Contributor

@RodrigoDLopez RodrigoDLopez commented Jan 16, 2024

Description

Currently, the default value for the vm_memballoon_stats_period configuration is set to 0. As a result, out of the box in ACS, memory metrics for Windows instances (with virtio drivers correctly installed and configured) is compromised when using KVM hypervisor. This pull request aims to propose setting the vm_memballoon_stats_period configuration value to 60. This correction addresses situations such as the one reported through the issue: #8453.

fixes: #8453

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

  • Before changes:
    image

  • After changes:
    image

How Has This Been Tested?

A Windows VM was created, and it was confirmed that the VirtIO drivers were correctly installed. However, upon inspecting the metrics collected by ACS, it was observed that they did not align with the actual consumption of the VM. Subsequently, after changing the configuration value from 0 to 60, the metrics collected by ACS matched the consumption values reported by the VM.

Copy link

codecov bot commented Jan 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (c3b77cb) 18.28% compared to head (fa25d98) 30.80%.
Report is 5 commits behind head on main.

Additional details and impacted files
@@              Coverage Diff              @@
##               main    #8520       +/-   ##
=============================================
+ Coverage     18.28%   30.80%   +12.52%     
- Complexity    16826    33998    +17172     
=============================================
  Files          4848     5341      +493     
  Lines        324301   375027    +50726     
  Branches      45561    54554     +8993     
=============================================
+ Hits          59290   115525    +56235     
+ Misses       256437   244211    -12226     
- Partials       8574    15291     +6717     
Flag Coverage Δ
simulator-marvin-tests 24.68% <0.00%> (+6.39%) ⬆️
uitests 4.39% <ø> (?)
unit-tests 16.50% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@DaanHoogland
Copy link
Contributor

@andrijapanicsb @alexandremattioli any ideas on this?

@weizhouapache
Copy link
Member

this value can be set manually, or by ansible/chef/puppet.
I prefer to leave users to decide.

@andrijapanicsb
Copy link
Contributor

many of the global settings are set to a certain value, many of them to 0 - and can be easily changed (like most of the other global settings). I prefer to keep it as it is, and allow operator to decide what to change.

@RodrigoDLopez
Copy link
Contributor Author

The value of this configuration was originally proposed with 60 to avoid misunderstandings as the one that are happening with operators. However, this value has been modified, and there is no information regarding this change communicated clearly for operators. As a result, numerous users may be experiencing the same issue reported in #8453. It might be a good idea to follow the steps taken with this other configurations 6c9e8a0, where the value of vm.stats.max.retention.time configuration was changed from 0 to 720 for the sake of simplicity for users/operators. Therefore, I suggest doing the same here, and if operators wish, they can modify the value to 0.

BTW: using the value 0 makes the monitoring stop working in KVM.

@weizhouapache
Copy link
Member

The value of this configuration was originally proposed with 60 to avoid misunderstandings as the one that are happening with operators. However, this value has been modified, and there is no information regarding this change communicated clearly for operators. As a result, numerous users may be experiencing the same issue reported in #8453. It might be a good idea to follow the steps taken with this other configurations 6c9e8a0, where the value of vm.stats.max.retention.time configuration was changed from 0 to 720 for the sake of simplicity for users/operators. Therefore, I suggest doing the same here, and if operators wish, they can modify the value to 0.

BTW: using the value 0 makes the monitoring stop working in KVM.

we are free to change the default value in a PR, or before official release. Once the configuration is included in a official release, we should be careful to change the default value.@RodrigoDLopez

@weizhouapache
Copy link
Member

BTW: using the value 0 makes the monitoring stop working in KVM.

worth to mention that I have faced an issue caused by it in the past . #8453 (comment)

although @shaerul said it worked well in his testing, I think it would be better to enable/disable memory ballooning and statistics at vm level.

@andrijapanicsb
Copy link
Contributor

(For what is worth it - I've been told by a RH engineer (in 2019) that memory auto-ballooning is an abandoned project - so be careful with the feature)

@RodrigoDLopez RodrigoDLopez changed the title Restores value of config vm_memballoon_stats_period back to 60 Changes the value of config vm_memballoon_stats_period to 60 Jan 18, 2024
@RodrigoDLopez
Copy link
Contributor Author

I believe that metrics monitoring should work without operators having to worry about configurations and enabling/disabling the feature. Changing the configuration value from '60' to '0' compromised the functionality for the KVM hypervisor; which in turn, can make it seem as it does not work out of the box. An example of this is operators lacking visibility regarding the necessity of adding this configuration in the agent.properties file (see: #8453 (comment)).

From my perspective, using the value '60' resolves issues similar to those faced by @shaerul in #8453. What are the main impacts you foresee with this change?


Regarding the issue encountered during the migration of a Windows instance with the memory balloon enabled (#8453 (comment)), could you explain the error scenario in detail? This would allow me to attempt to reproduce locally and, if necessary, extend this PR to address the issue related to the migration of Windows VMs with the memory balloon feature enabled.

@weizhouapache
Copy link
Member

I believe that metrics monitoring should work without operators having to worry about configurations and enabling/disabling the feature. Changing the configuration value from '60' to '0' compromised the functionality for the KVM hypervisor; which in turn, can make it seem as it does not work out of the box. An example of this is operators lacking visibility regarding the necessity of adding this configuration in the agent.properties file (see: #8453 (comment)).

In my opinion, the wrong memory usage is not a critical issue for operators.
I believe most cloud operators use automation tools (chef, ansible, puppet, etc). It is very easy to apply the changes on all kvm hosts.

From my perspective, using the value '60' resolves issues similar to those faced by @shaerul in #8453. What are the main impacts you foresee with this change?

As I said, I have faced an issue in the past. I am not sure if the issue was caused by windows virtio drivers or qemu/libvirt.
anyway, we do not need to make decision for users.
There is already a configuration for users to enable/disable it or set to a different value.

Regarding the issue encountered during the migration of a Windows instance with the memory balloon enabled (#8453 (comment)), could you explain the error scenario in detail? This would allow me to attempt to reproduce locally and, if necessary, extend this PR to address the issue related to the migration of Windows VMs with the memory balloon feature enabled.

I do not remember the details very clearly. the kvm host is Ubuntu 18 or 20
the issue was difficult to reproduce, it happened to some vms with virtio drivers, large memory and memory intensive applications.
but I am sure it was caused by memory stat, because the issue is gone after disabling memory stat on Windows vms.

cc @RodrigoDLopez

Copy link
Contributor

@GutoVeronezi GutoVeronezi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrijapanicsb @weizhouapache, what are your concerns with this change? Without it, a nice feature in ACS, that enables users to monitor users' VMs via UI does not work out of the box for KVM deployments, which can cause disappointments and misunderstandings as the ones already noticed in the mailing list and issues in Github.

@weizhouapache
Copy link
Member

hat are your concerns with this change? Without it, a nice feature in ACS, that enables users to monitor users' VMs via UI does not work out of the box for KVM deployments, which can cause disappointments and misunderstandings as the ones already noticed in the mailing list and issues in Github.

can please see my comments above ? @GutoVeronezi

@GutoVeronezi
Copy link
Contributor

can please see my comments above ? @GutoVeronezi

@weizhouapache you said you have faced an issue in the past, but did not specify how to reproduce it or its details. On the other hand, you said that you are not sure if the issue was caused by the virtio drivers; therefore, we do not have a solid reason to not change it.

@weizhouapache
Copy link
Member

can please see my comments above ? @GutoVeronezi

@weizhouapache you said you have faced an issue in the past, but did not specify how to reproduce it or its details. On the other hand, you said that you are not sure if the issue was caused by the virtio drivers; therefore, we do not have a solid reason to not change it.

@GutoVeronezi
I have mentioned this but I am sure it was caused by memory stat, because the issue is gone after disabling memory stat on Windows vms.

Anyone who wants to enable the memory stats collection and happy to take the risky, go for it. ACS has already provided the option to do it.

@GutoVeronezi
Copy link
Contributor

@GutoVeronezi I have mentioned this but I am sure it was caused by memory stat, because the issue is gone after disabling memory stat on Windows vms.

Anyone who wants to enable the memory stats collection and happy to take the risky, go for it. ACS has already provided the option to do it.

@weizhouapache do you have steps to reproduce it?

@weizhouapache
Copy link
Member

@GutoVeronezi I have mentioned this but I am sure it was caused by memory stat, because the issue is gone after disabling memory stat on Windows vms.
Anyone who wants to enable the memory stats collection and happy to take the risky, go for it. ACS has already provided the option to do it.

@weizhouapache do you have steps to reproduce it?

@GutoVeronezi
you know some issues are difficult to reproduce.
the issue could be related to the template (Windows version, virtio drivers, etc), and the host (Linux OS version, qemu/libvirt version), etc
I did not face any issue with Linux VMs caused by it. so I made some changes to enable it on Linux VMs, but disable it for Windows VMs.

@GutoVeronezi
Copy link
Contributor

@GutoVeronezi you know some issues are difficult to reproduce. the issue could be related to the template (Windows version, virtio drivers, etc), and the host (Linux OS version, qemu/libvirt version), etc I did not face any issue with Linux VMs caused by it. so I made some changes to enable it on Linux VMs, but disable it for Windows VMs.

But then we do not have a solid reason to not change the value, just speculations. If at least we could reproduce the error you saying that exist, we could look for the real cause and solve the problem (or at least find an explanation).

@weizhouapache
Copy link
Member

@GutoVeronezi you know some issues are difficult to reproduce. the issue could be related to the template (Windows version, virtio drivers, etc), and the host (Linux OS version, qemu/libvirt version), etc I did not face any issue with Linux VMs caused by it. so I made some changes to enable it on Linux VMs, but disable it for Windows VMs.

But then we do not have a solid reason to not change the value, just speculations. If at least we could reproduce the error you saying that exist, we could look for the real cause and solve the problem (or at least find an explanation).

@GutoVeronezi
If you are running a production, you might know, when we face an issue (Windows VM got frozen after live migration in my case), the first thing we need to do, is to find the workaround which fix the issue and then avoid it (in my case, the workaround is disabling memory stats).
root cause? in my case, we need to dive into the source code of QEMU and virtio driver. the root cause is not important to me, as I can accept the problem that Windows VM has wrong memory usage (100%) which is a minor issue, espscially comparing with the issue that Windows VM got frozen.

changing vm_memballoon_stats_period to 60, will add a new setting in the vm definition, which will probably cause some regression which is out of our control. We should disable it by default.

@GutoVeronezi
Copy link
Contributor

changing vm_memballoon_stats_period to 60, will add a new setting in the vm definition, which will probably cause some regression which is out of our control. We should disable it by default.

But again, it is a speculation. We have solid use cases of operators having to contact the community to understand why their Windows VMs are with wrong stats; and every time, the solution is to change the agent.properties. However, we do not have use cases that corroborate with the situation you said you faced a long time ago. Indeed, we do have use cases that shows that the aforementioned situation does not happen, vide #8453 (comment). Therefore, we do not have a solid use case/reason to not change the value.

Please, bring solid use cases so we can discuss it further.

@weizhouapache
Copy link
Member

changing vm_memballoon_stats_period to 60, will add a new setting in the vm definition, which will probably cause some regression which is out of our control. We should disable it by default.

But again, it is a speculation. We have solid use cases of operators having to contact the community to understand why their Windows VMs are with wrong stats; and every time, the solution is to change the agent.properties. However, we do not have use cases that corroborate with the situation you said you faced a long time ago. Indeed, we do have use cases that shows that the aforementioned situation does not happen, vide #8453 (comment). Therefore, we do not have a solid use case/reason to not change the value.

Please, bring solid use cases so we can discuss it further.

interesting
Your use case is solid
my issue is not solid

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants