Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get Diagnostics: Download logs and diagnostics data from SSVM, CPVM, Router #3350

Merged

Conversation

PaulAngus
Copy link
Member

@PaulAngus PaulAngus commented May 23, 2019

Description

This implements a new feature to get logs and diagnostics data from systemvms (CPVM, SSVM) and virtual routers as a downloadable zip file from the secondary storage. The diagnostics zip files live in the diagnostics directory of the secondary storage. The feature is only supported for NFS based secondary storage and root admins.

FS: https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Get+Diagnostics+Data+API

The feature adds the following global settings:

diagnostics.data.gc.enable
diagnostics.data.gc.interval
diagnostics.data.retrieval.timeout
diagnostics.data.max.file.age
diagnostics.data.disable.threshold
diagnostics.data.systemvm.defaults
diagnostics.data.router.defaults

In the UI, the root admin can now see a download button to get the diagnostics data from CPVM, SSVM and VRs.

Fixes: #3593

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Screenshots (if appropriate):

Download icon in menu:
image

Prompt to override default files:
image

Download link:
image

Files in the tarball:
image

How Has This Been Tested?

Manual testing of downloading diagnostics for VRs, SSVMs and CPVMs

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one structural comment: the use of polymorphism in the DiagnosticsFiles fileprocessor package may not be the most efficient, but seems functional. havy testing needed to but bith unit and marvin tests are provided.

@rohityadavcloud
Copy link
Member

Design review:

  • Large archives transported via cmd-answer pattern can cause OOM in both management server and ssvm/kvm agent
  • The feature can benefit from an agnostic distributed file sharing manager/service (such as based on bittorrent or rsync+ssh based)

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2788

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-3591)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 35133 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3350-t3591-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Smoke tests completed. 70 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@nvazquez
Copy link
Contributor

nvazquez commented Jun 8, 2019

@rhtyd the current implementation is using cmd-answer pattern to indicate which files needs to be copied, but using SCP to copy the files using this library com.trilead.ssh2.SCPClient. Can you explain a bit further how would you think it should be redesigned?

@nvazquez nvazquez force-pushed the retrieve-diagnostics-data-rebase branch from d2f3096 to a3067f5 Compare June 10, 2019 13:54
@rohityadavcloud
Copy link
Member

@nvazquez let me revisit this PR soon, I need to check between which points we're scp-ing. The VR logs can grow as much as 0.5GBs in size, if we're passing the payload via cmd-answer pattern then it can potentially cause out-of-memory issues in the JVM process/agent that does this.

@rohityadavcloud rohityadavcloud self-assigned this Jun 18, 2019
@borisstoyanov borisstoyanov changed the title Retrieve diagnostics data rebase WIP:Retrieve diagnostics data rebase Jun 21, 2019
@rohityadavcloud
Copy link
Member

Rebased against master, will test and review the design and implementation.

@rohityadavcloud rohityadavcloud changed the title WIP:Retrieve diagnostics data rebase [WIP] Retrieve diagnostics data rebase Jul 1, 2019
@rohityadavcloud
Copy link
Member

Several issues with the implementation fixed, there are two global settings essentially to configure the list of CPVM/SSVM files and router files which are hosted in the NFS based secondary storage at diagnostics directory.

Global settings added by this PR:

diagnostics.data.gc.enable
diagnostics.data.gc.interval
diagnostics.data.retrieval.timeout
diagnostics.data.max.file.age
diagnostics.data.disable.threshold
diagnostics.data.systemvm.defaults
diagnostics.data.router.defaults

Copy link
Member

@rohityadavcloud rohityadavcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works OK, however it adds dependency on SSVM/secondary storage and copying of files assume that secondary storage is NFS based and is accessible. On non-NFS storages, this may not explicitly tell the admin that it does not work for non-NFS secondary storage (in most cases unlikely).

Pending testing on XenServer and VMware. KVM LGTM.

@rohityadavcloud rohityadavcloud changed the title [WIP] Retrieve diagnostics data rebase [WIP] Retrieve Diagnostics from SSVM, CPVM, Router Jul 1, 2019
@rohityadavcloud rohityadavcloud changed the title [WIP] Retrieve Diagnostics from SSVM, CPVM, Router [WIP] Get Diagnostics: Download logs and diagnostics data from SSVM, CPVM, Router Jul 1, 2019
@borisstoyanov
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

Packaging result: ✖centos6 ✖centos7 ✖debian. JID-555

@anuragaw
Copy link
Contributor

anuragaw commented Jan 9, 2020

@blueorangutan package

@blueorangutan
Copy link

@anuragaw a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-559

@borisstoyanov
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-736)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 61141 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3350-t736-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_public_ip_range.py
Intermittent failure detected: /marvin/tests/smoke/test_reset_vm_on_reboot.py
Intermittent failure detected: /marvin/tests/smoke/test_resource_accounting.py
Intermittent failure detected: /marvin/tests/smoke/test_router_dhcphosts.py
Intermittent failure detected: /marvin/tests/smoke/test_router_dns.py
Intermittent failure detected: /marvin/tests/smoke/test_router_dnsservice.py
Intermittent failure detected: /marvin/tests/smoke/test_routers_iptables_default_policy.py
Intermittent failure detected: /marvin/tests/smoke/test_routers_network_ops.py
Intermittent failure detected: /marvin/tests/smoke/test_routers.py
Intermittent failure detected: /marvin/tests/smoke/test_secondary_storage.py
Intermittent failure detected: /marvin/tests/smoke/test_service_offerings.py
Intermittent failure detected: /marvin/tests/smoke/test_snapshots.py
Intermittent failure detected: /marvin/tests/smoke/test_ssvm.py
Intermittent failure detected: /marvin/tests/smoke/test_templates.py
Intermittent failure detected: /marvin/tests/smoke/test_usage.py
Intermittent failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Intermittent failure detected: /marvin/tests/smoke/test_vm_snapshots.py
Intermittent failure detected: /marvin/tests/smoke/test_volumes.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_router_nics.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_vpn.py
Intermittent failure detected: /marvin/tests/smoke/test_host_maintenance.py
Intermittent failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 57 look OK, 21 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestResetVmOnReboot>:setup Error 0.00 test_reset_vm_on_reboot.py
ContextSuite context=TestRAMCPUResourceAccounting>:setup Error 0.00 test_resource_accounting.py
ContextSuite context=TestRouterServices>:setup Error 0.00 test_routers.py
ContextSuite context=TestRouterDHCPHosts>:setup Error 0.00 test_router_dhcphosts.py
ContextSuite context=TestRouterDHCPOpts>:setup Error 0.00 test_router_dhcphosts.py
ContextSuite context=TestRouterDns>:setup Error 0.00 test_router_dns.py
ContextSuite context=TestRouterDnsService>:setup Error 0.00 test_router_dnsservice.py
ContextSuite context=TestRouterIpTablesPolicies>:setup Error 0.00 test_routers_iptables_default_policy.py
ContextSuite context=TestVPCIpTablesPolicies>:setup Error 0.00 test_routers_iptables_default_policy.py
ContextSuite context=TestIsolatedNetworks>:setup Error 0.00 test_routers_network_ops.py
ContextSuite context=TestRedundantIsolateNetworks>:setup Error 0.00 test_routers_network_ops.py
test_01_sys_vm_start Failure 0.10 test_secondary_storage.py
ContextSuite context=TestCpuCapServiceOfferings>:setup Error 0.00 test_service_offerings.py
ContextSuite context=TestServiceOfferings>:setup Error 0.24 test_service_offerings.py
ContextSuite context=TestSnapshotRootDisk>:setup Error 0.00 test_snapshots.py
test_01_list_sec_storage_vm Failure 0.04 test_ssvm.py
test_02_list_cpvm_vm Failure 0.03 test_ssvm.py
test_03_ssvm_internals Failure 0.03 test_ssvm.py
test_04_cpvm_internals Failure 0.03 test_ssvm.py
test_05_stop_ssvm Failure 0.03 test_ssvm.py
test_06_stop_cpvm Failure 0.03 test_ssvm.py
test_07_reboot_ssvm Failure 0.04 test_ssvm.py
test_08_reboot_cpvm Failure 0.03 test_ssvm.py
test_09_destroy_ssvm Failure 0.03 test_ssvm.py
test_10_destroy_cpvm Failure 0.03 test_ssvm.py
test_02_create_template_with_checksum_sha1 Error 65.47 test_templates.py
test_03_create_template_with_checksum_sha256 Error 65.46 test_templates.py
test_04_create_template_with_checksum_md5 Error 65.47 test_templates.py
test_05_create_template_with_no_checksum Error 65.50 test_templates.py
test_02_deploy_vm_from_direct_download_template Error 1.32 test_templates.py
test_03_deploy_vm_wrong_checksum Error 1.35 test_templates.py
ContextSuite context=TestTemplates>:setup Error 17.67 test_templates.py
ContextSuite context=TestISOUsage>:setup Error 0.00 test_usage.py
ContextSuite context=TestLBRuleUsage>:setup Error 0.00 test_usage.py
ContextSuite context=TestNatRuleUsage>:setup Error 0.00 test_usage.py
ContextSuite context=TestPublicIPUsage>:setup Error 0.00 test_usage.py
ContextSuite context=TestSnapshotUsage>:setup Error 0.00 test_usage.py
ContextSuite context=TestVmUsage>:setup Error 0.00 test_usage.py
ContextSuite context=TestVolumeUsage>:setup Error 0.00 test_usage.py
ContextSuite context=TestVpnUsage>:setup Error 0.00 test_usage.py
ContextSuite context=Test01DeployVM>:setup Error 0.00 test_vm_life_cycle.py
ContextSuite context=Test02VMLifeCycle>:setup Error 0.00 test_vm_life_cycle.py
test_14_secure_to_secure_vm_migration Error 11.41 test_vm_life_cycle.py
test_15_secured_to_nonsecured_vm_migration Error 74.25 test_vm_life_cycle.py
test_16_nonsecured_to_secured_vm_migration Error 1.28 test_vm_life_cycle.py
ContextSuite context=TestVmSnapshot>:setup Error 2.01 test_vm_snapshots.py
ContextSuite context=TestCreateVolume>:setup Error 0.00 test_volumes.py
ContextSuite context=TestVolumes>:setup Error 0.00 test_volumes.py
ContextSuite context=TestVPCRedundancy>:setup Error 0.00 test_vpc_redundant.py
ContextSuite context=TestVPCNics>:setup Error 0.00 test_vpc_router_nics.py
ContextSuite context=TestRVPCSite2SiteVpn>:setup Error 0.00 test_vpc_vpn.py
ContextSuite context=TestVPCSite2SiteVPNMultipleOptions>:setup Error 0.00 test_vpc_vpn.py
ContextSuite context=TestVpcRemoteAccessVpn>:setup Error 0.00 test_vpc_vpn.py
ContextSuite context=TestVpcSite2SiteVpn>:setup Error 0.00 test_vpc_vpn.py
test_disable_oobm_ha_state_ineligible Error 1513.47 test_hostha_kvm.py

@borisstoyanov
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-568

@borisstoyanov
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-747)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 33207 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3350-t747-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_vpc_vpn.py
Intermittent failure detected: /marvin/tests/smoke/test_host_maintenance.py
Intermittent failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 78 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@anuragaw
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@anuragaw a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-592

@anuragaw anuragaw force-pushed the retrieve-diagnostics-data-rebase branch from 4e622d8 to b3d46d8 Compare January 14, 2020 14:31
@anuragaw
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@anuragaw a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-598

@anuragaw
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@anuragaw a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@andrijapanicsb
Copy link
Contributor

LGTM.

Tested manually by @borisstoyanov and reviewed by me

@blueorangutan
Copy link

Trillian test result (tid-772)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 34935 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3350-t772-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_host_maintenance.py
Smoke tests completed. 78 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@andrijapanicsb
Copy link
Contributor

Merging based on LGTMs/Approvals and extensive manual testing (in-house)

@andrijapanicsb andrijapanicsb merged commit be97470 into apache:master Jan 15, 2020
ustcweizhou pushed a commit to ustcweizhou/cloudstack that referenced this pull request Feb 28, 2020
…Router (apache#3350)

* * Complete API implementation
* Complete UI integration
* Complete marvin test
* Complete Secondary storage GC background task

* improve UI labels

* slight reword and add another missing description

* improve download message clarity

* Address comments

* multiple fixes and cleanups

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>

* fix more bugs, let it return ip rule list in another log file

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>

* fix missing iprule bug

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>

* add support for ARCHIVE type of object to be linked/setup on secstorage

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>

* Fix retrieving files for Xenserver

* Update get_diagnostics_files.py

* Fix bug where executable scripts weren't handled

* Fixed error on script cmd generation

* Do not filter name for log files as it would override similar prefix script names

* Addressed code review comments

* log error instead of printstacktrace

* Treat script as executable and shell script

* Check missing script name case and write to output instead of catching exception

* Use shell = true instead of shlex to support any executable

* fix xenserver bug

* don't set dir permission for vmware

* Code review comments - refactoring

* Add check for possible NPE

* Remove unused imoprt after rebase

* Add better description for configs

Co-authored-by: Nicolas Vazquez <nicovazquez90@gmail.com>
Co-authored-by: Rohit Yadav <rohit@apache.org>
Co-authored-by: Anurag Awasthi <anurag.awasthi@shapeblue.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Not able to call scripts with get diagnostics
9 participants