Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve System VM startup and memory usage #3126

Merged
merged 7 commits into from
Jun 26, 2019

Conversation

PaulAngus
Copy link
Member

@PaulAngus PaulAngus commented Jan 11, 2019

Description

In order to reduce memory footprint and improve boot speed/predictability; The following changes have been made:

  • Add vm.min_free_kbytes to sysctl
  • periodically clear disk cache (depending on memory size)
  • only start guest services specific to hypervisor
  • use systemvm code to determine hypervisor type (not systemd)
  • start cloud service at end of post init rather than through systemd
  • reduce initial threads started for httpd
  • fix vmtools config file
  • disable all required services (do not start on boot)

Some changes will require a new systemvmtemplate that can be done in 4.14:

  • start only required services during post init.
  • allow one-shot ACS configuration services to terminate after running.

Fixes: #3039

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Screenshots (if appropriate):

How Has This Been Tested?

Changes have been subjected to regression testing and a burn-in test where portforwarding rules were constantly updated over a period of 3 days. no swapping occured in a 256MB RAM VR.

@PaulAngus
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@PaulAngus a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2529

@PaulAngus
Copy link
Member Author

@blueorangutan test matrix

@blueorangutan
Copy link

@PaulAngus a Trillian-Jenkins matrix job (centos6 mgmt + xs71, centos7 mgmt + vmware65, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@wido
Copy link
Contributor

wido commented Jan 24, 2019

See my comments here: #3127

I would not touch those sysctl settings

@GabrielBrascher GabrielBrascher modified the milestones: 4.12.0.0, 5.0.0.0 Feb 5, 2019
@apache apache deleted a comment from blueorangutan Mar 26, 2019
@apache apache deleted a comment from blueorangutan Mar 26, 2019
@apache apache deleted a comment from blueorangutan Mar 26, 2019
@PaulAngus
Copy link
Member Author

Annotation 2019-03-28 134017

Continuing to monitor the above VR while running:
while true; do ssh -p3922 -i /var/cloudstack/management/.ssh/id_rsa root@10.2.7.101 "echo hi"; done
memory usage growth in negligible

@PaulAngus
Copy link
Member Author

PaulAngus commented Apr 1, 2019

RESULTS of tests
The following script was left running from the management server:
while true; do ssh -p3922 -i /var/cloudstack/management/.ssh/id_rsa root@10.2.7.101 "echo hi"; done

The memory usage was then monitored by periodically running htop from the VR console.

Uptime Memory Usage
08:50 82.3MB
1d 00:43 94.3 MB
1d 04:51 94.8 MB
1d 22:57 97.4 MB
2d 07:42 91.9 MB
5d 05:35 92.0 MB

I believe that the variance is due to the timing of the check relative to the disk cache flushing.

IMO - I think that we can consider this a successful test

cc @andrijapanic @onitake

@andrijapanicsb
Copy link
Contributor

Based on the original issues reported, this looks LGTM.

I'm also OK with those systemctl settings, since kernel should do it's job, but it doesn't in some cases (had silly experience with same fix on systems with 64GB of ram, running GFS2)

@onitake
Copy link
Contributor

onitake commented Apr 7, 2019

I wonder... If swapping is such a big concern, why not simply turn off the swap completely?
If the VM is so badly under pressure that it needs to swap, it's probably better to just let it OOM kill and reboot itself right away, than suffer under terrible performance due to heavy I/O load.

Paul Angus and others added 2 commits June 24, 2019 13:51
Add vm.min_free_kbytes to sysctl
periodically clear disk cache (depending on memory size)
only start guest services specific to hypervisor
use systemvm code to determine hypervisor type (not systemd)
start cloud service at end of post init rather than through systemd
reduce initial threads started for httpd
fix vmtools config file
disable all required services (do not start on boot)
start only required services during post init.

add '@include null' to /etc/pam.d/systemd-user
as per systemd/systemd#8015 (comment)

remove cloud agent service startup from VR
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
@rohityadavcloud
Copy link
Member

Rebased to latest master and fixed conflicts.
I think @PaulAngus intends to get rid of the postinit service but any changes to the cloud-early-config systemd service would require creating a new systemvmtemplate. So, I propose such changes requiring systemd/cloud-early-config changes be addressed in 4.14 where we'll need a new Debian 10 based systemvmtemplate.

I'll run some tests, this is still in WIP.

@rohityadavcloud rohityadavcloud changed the title Improve System VM startup and memory usage [WIP] Improve System VM startup and memory usage Jun 24, 2019
Description=Service for virtual machines hosted on VMware
Documentation=http://open-vm-tools.sourceforge.net/about.php
DefaultDependencies=no
Before=cloud-init-local.service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PaulAngus I see references to cloud-init-local.service and cloud-init.service but they are not in codebase or the template. Is there any file to be added, or did you mean cloud-early-config.service?

@rohityadavcloud rohityadavcloud changed the title [WIP] Improve System VM startup and memory usage Improve System VM startup and memory usage Jun 25, 2019
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
@rohityadavcloud
Copy link
Member

Pinging for review - @nvazquez @shwstppr @anuragaw @borisstoyanov @andrijapanic
@blueorangutan package

@blueorangutan
Copy link

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

Copy link
Member

@rohityadavcloud rohityadavcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending regression testing.

@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-19

@borisstoyanov
Copy link
Contributor

@blueorangutan test matrix

@blueorangutan
Copy link

@borisstoyanov a Trillian-Jenkins matrix job (centos6 mgmt + xs71, centos7 mgmt + vmware65, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-20)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 33083 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3126-t20-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_internal_lb.py
Intermittent failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 70 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_hostha_enable_ha_when_host_in_maintenance Error 302.61 test_hostha_kvm.py

@blueorangutan
Copy link

Trillian test result (tid-21)
Environment: vmware-65u2 (x2), Advanced Networking with Mgmt server 7
Total time taken: 39680 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3126-t21-vmware-65u2.zip
Intermittent failure detected: /marvin/tests/smoke/test_routers.py
Smoke tests completed. 71 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link

Trillian test result (tid-19)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 40239 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3126-t19-xenserver-71.zip
Intermittent failure detected: /marvin/tests/smoke/test_internal_lb.py
Intermittent failure detected: /marvin/tests/smoke/test_password_server.py
Intermittent failure detected: /marvin/tests/smoke/test_scale_vm.py
Intermittent failure detected: /marvin/tests/smoke/test_templates.py
Intermittent failure detected: /marvin/tests/smoke/test_volumes.py
Smoke tests completed. 70 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_scale_vm Failure 47.16 test_scale_vm.py

Copy link
Contributor

@borisstoyanov borisstoyanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM based on test results, one is known issue, the other is related to xen licensing

@rohityadavcloud rohityadavcloud merged commit 0331999 into apache:master Jun 26, 2019
@rohityadavcloud
Copy link
Member

Created this issue to track further future changes: #3426

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

VMware 4.11.2.0 system VMs memory consumption grows overtime until heavy swapping occurs
8 participants