Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: ssh-keygen in PEM format and reduce main systemvm patching script #3333

Merged
merged 13 commits into from May 23, 2019

Conversation

@rhtyd
Copy link
Member

commented May 17, 2019

On first startup, the management server creates and saves a random
ssh keypair using ssh-keygen in the database. The command does
not specify keys in PEM format which is not the default as generated
by latest ssh-keygen tool.

The systemvmtemplate always needs re-building whenever there is a change
in the cloud-early-config file. This also tries to fix that by introducing a
stage 2 bootstrap.sh where the changes specific to hypervisor detection
etc are refactored/moved. The initial cloud-early-config only patches
before the other scripts are called.

Additional commentary:
On latest CentOS7 release (CentOS Linux release 7.6.1810 (Core)) I hit a
bug where router_proxy (ssh) failed due to public key issue and kept
prompting for a passphrase (the default is blank/empty). By adding a
newline to the id_rsa.cloud private key it did not prompt the same and
ssh worked. By performing a visual diff on bytes of the file I found
that the DB version of the keypair and the version that gets put on the
KVM host had a missing line feed (0xA0) character. This patch fixes
the issue by ensuring we save the ssh keypair in the database exactly
as it is generated by ssh-keygen. Furthermore, I found that sometimes
the VR startup could fail if the VR took more time to patch, so I
increased the number of retries it previously used to do.

Screenshot showing the binary difference between ssh private key files (as on management server vs on the kvm host):

Screenshot from 2019-05-17 13-43-46

After the fix the checksums match on both the kvm host and on the management server:

Screenshot from 2019-05-17 13-59-30

Screenshot from 2019-05-17 13-59-41

SSH packages details on my CentOS7 env:
libssh2.x86_64 1.4.3-12.el7_6.2 @updates
openssh.x86_64 7.4p1-16.el7 @anaconda
openssh-askpass.x86_64 7.4p1-16.el7 @base
openssh-clients.x86_64 7.4p1-16.el7 @anaconda
openssh-server.x86_64 7.4p1-16.el7 @anaconda

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
server: save SSH keypair exactly as generated by ssh-keygen
On first startup, the management server creates and saves a random
ssh keypair using `ssh-keygen` in the database. The previous code used
to call `trim()` on the string and imposed a restriction of 4k length on
the bytes read from the file.

On latest CentOS7 release (CentOS Linux release 7.6.1810 (Core)) I hit a
bug where router_proxy (ssh) failed due to public key issue and kept
prompting for a passphrase (the default is blank/empty). By adding a
newline to the id_rsa.cloud private key it did not prompt the same and
ssh worked. By performing a visual diff on bytes of the file I found
that the DB version of the keypair and the version that gets put on the
KVM host had a missing `line feed (0xA0)` character. This patch fixes
the issue by ensuring we save the ssh keypair in the database exactly
as it is generated by ssh-keygen. Furthermore, I found that sometimes
the VR startup could fail if the VR took more time to patch, so I
increased the number of retries it previously used to do.

SSH packages details on my CentOS7 env:
libssh2.x86_64                             1.4.3-12.el7_6.2            @updates
openssh.x86_64                             7.4p1-16.el7                @anaconda
openssh-askpass.x86_64                     7.4p1-16.el7                @base
openssh-clients.x86_64                     7.4p1-16.el7                @anaconda
openssh-server.x86_64                      7.4p1-16.el7                @anaconda

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
fix race condition when VR starts
Don't resend/set the cmdline against which can cause intermittent ssh
failures due to ssh public key issue.

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>

@apache apache deleted a comment from blueorangutan May 17, 2019

@apache apache deleted a comment from blueorangutan May 17, 2019

@apache apache deleted a comment from blueorangutan May 17, 2019

Fix VR patching that may fail if file is missing for latest 4.11 and …
…master

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>

@rhtyd rhtyd changed the title server: save SSH keypair exactly as generated by ssh-keygen [WIP] server: save SSH keypair exactly as generated by ssh-keygen May 17, 2019

rhtyd added some commits May 17, 2019

fix failing VR patching
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
fix slow VR patching
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
make cloud-early-config smaller, patch scripts first before bootstrap…
…ping

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>

@apache apache deleted a comment from blueorangutan May 20, 2019

@apache apache deleted a comment from blueorangutan May 20, 2019

@apache apache deleted a comment from blueorangutan May 20, 2019

rhtyd added some commits May 20, 2019

systemvm bootstrapping changes and keygen command changes
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
exclude failing test due to jdk version issue, same as master
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
@rhtyd

This comment has been minimized.

Copy link
Member Author

commented May 20, 2019

The main issue with failing VR came out to be due to the ssh private key having 'OPENSS... PRIVATE KEY' instead of 'RSA PRIVATE KEY' that ssh trilead-ssh2 library expects. This happens when you update to latest openssh-client

@blueorangutan package

@rhtyd rhtyd changed the title [WIP] server: save SSH keypair exactly as generated by ssh-keygen server: ssh-keygen in PEM format and reduce main systemvm patching script May 20, 2019

@blueorangutan

This comment has been minimized.

Copy link

commented May 20, 2019

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

final VirtualRoutingResource virtRouterResource = libvirtComputingResource.getVirtRouterResource();
if (libvirtComputingResource.passCmdLine(vmName, vmSpec.getBootArgs())) {
break;
}

This comment has been minimized.

Copy link
@ustcweizhou

ustcweizhou May 20, 2019

Contributor

@rhtyd
we pass cmdline to systemvm multiple times here, is because the operation passCmdLine will always succeed but actually systemvm cannot get the data via /dev/vport01p1 inside it, so we pass the cmdline to systemvm around every 15 seconds (virtRouterResource.connect) until control ip is reachable.

This comment has been minimized.

Copy link
@rhtyd

rhtyd May 21, 2019

Author Member

@ustcweizhou the blind waiting and passing cmdline is not necessary now, in the new patch.sh script it waits until it's able to communicate with the qemu-guest-agent running inside the systemvm appliance. See: https://github.com/apache/cloudstack/blob/master/scripts/vm/hypervisor/kvm/patch.sh#L61

This comment has been minimized.

Copy link
@ustcweizhou

ustcweizhou May 21, 2019

Contributor

@rhtyd ok, sorry I did not notice this change.
in this case, do we still need to pass cmd line for max 10 times ?

which ssh-keygen version is impacted ? I would like to test it.

This comment has been minimized.

Copy link
@rhtyd

rhtyd May 21, 2019

Author Member

@ustcweizhou I can reduce that to 3 maybe.
I could consistently reproduce the issue with latest Ubuntu 19.04 with latest Centos7.6 used as kvm host.

@blueorangutan

This comment has been minimized.

Copy link

commented May 20, 2019

Trillian test result (tid-3560)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 28167 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3333-t3560-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Smoke tests completed. 67 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_04_rvpc_network_garbage_collector_nics Failure 295.01 test_vpc_redundant.py
@blueorangutan

This comment has been minimized.

Copy link

commented May 21, 2019

Trillian test result (tid-3562)
Environment: vmware-65u2 (x2), Advanced Networking with Mgmt server 7
Total time taken: 41975 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3333-t3562-vmware-65u2.zip
Intermittent failure detected: /marvin/tests/smoke/test_public_ip_range.py
Intermittent failure detected: /marvin/tests/smoke/test_templates.py
Intermittent failure detected: /marvin/tests/smoke/test_usage.py
Intermittent failure detected: /marvin/tests/smoke/test_volumes.py
Smoke tests completed. 65 look OK, 3 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_04_extract_template Failure 136.57 test_templates.py
ContextSuite context=TestISOUsage>:setup Error 0.00 test_usage.py
test_06_download_detached_volume Failure 171.64 test_volumes.py

@apache apache deleted a comment from blueorangutan May 21, 2019

@blueorangutan

This comment has been minimized.

Copy link

commented May 21, 2019

Trillian test result (tid-3564)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 6
Total time taken: 31352 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3333-t3564-xenserver-71.zip
Intermittent failure detected: /marvin/tests/smoke/test_scale_vm.py
Intermittent failure detected: /marvin/tests/smoke/test_templates.py
Intermittent failure detected: /marvin/tests/smoke/test_usage.py
Smoke tests completed. 56 look OK, 2 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_scale_vm Error 16.37 test_scale_vm.py
test_01_ISO_usage Error 1.06 test_usage.py
test_01_lb_usage Error 8.28 test_usage.py
test_01_nat_usage Error 15.43 test_usage.py
test_01_public_ip_usage Error 1.08 test_usage.py
test_01_snapshot_usage Error 57.15 test_usage.py
test_01_template_usage Error 50.26 test_usage.py
test_01_vm_usage Error 27.69 test_usage.py
test_01_volume_usage Error 28.00 test_usage.py
test_01_vpn_usage Error 19.65 test_usage.py
@blueorangutan

This comment has been minimized.

Copy link

commented May 21, 2019

Trillian test result (tid-3565)
Environment: kvm-centos6 (x2), Advanced Networking with Mgmt server 7
Total time taken: 33066 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3333-t3565-kvm-centos6.zip
Intermittent failure detected: /marvin/tests/smoke/test_service_offerings.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Smoke tests completed. 67 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_service_offering_cpu_limit_use Error 1.35 test_service_offerings.py
@rhtyd

This comment has been minimized.

Copy link
Member Author

commented May 22, 2019

Tests LGTM, the one usage related tests failing on xenserver-71 env failed due to env issues:

  File "/usr/lib64/python2.7/site-packages/mysql/connector/connection_cext.py", line 182, in _open_connection
    sqlstate=exc.sqlstate)
OperationalError: 1043 (08S01): Bad handshake

However, the usage related tests are agnostic of hypervisor and have passed in other environments.

@borisstoyanov
Copy link
Contributor

left a comment

Good fix @rhtyd, LGTM based on test results and code review.

travis: split tests that are taking time to finish
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
@PaulAngus

This comment has been minimized.

Copy link
Member

commented May 22, 2019

@rhtyd #3126 &3127 has been waiting to be merged in 4.11 and master for some time. Can we merge that and then rebase this?

@rhtyd

This comment has been minimized.

Copy link
Member Author

commented May 22, 2019

@PaulAngus I can help review/merge #3126 after this PR.

rearrange tests to avoid timeout on travisci
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
@shwstppr
Copy link
Contributor

left a comment

LGTM

@@ -1362,7 +1362,7 @@ private boolean checkOvsNetwork(final String networkName) {
}

public boolean passCmdLine(final String vmName, final String cmdLine) throws InternalErrorException {
final Script command = new Script(_patchScriptPath, 30 * 1000, s_logger);
final Script command = new Script(_patchScriptPath, 300 * 1000, s_logger);

This comment has been minimized.

Copy link
@anuragaw

anuragaw May 23, 2019

Contributor

Curious question - Is that a desired timeout value? Seems too high.

This comment has been minimized.

Copy link
@rhtyd

rhtyd May 23, 2019

Author Member

In environment with slow cpu and storage network, the timeout may be even higher. The logic of 5minute timeout comes from the libvirt wrapper command handler:

                // try to patch and SSH into the systemvm for up to 5 minutes
                for (int count = 0; count < 10; count++) {

For historic reasons, you can see CLOUDSTACK-2823

This comment has been minimized.

Copy link
@anuragaw

anuragaw May 26, 2019

Contributor

Thanks!

@anuragaw
Copy link
Contributor

left a comment

LGTM

(tested locally on KVM Based environment built from scratch)

rhtyd added some commits May 23, 2019

travis: further re-arrange test groups to complete within 40mins
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
travis: another retry
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
@rhtyd

This comment has been minimized.

Copy link
Member Author

commented May 23, 2019

Merging based on code reviews from Bobby, Abhishek and Anurag and smoke test results and Travis.

@rhtyd rhtyd merged commit 0929866 into apache:4.11 May 23, 2019

1 of 2 checks passed

Jenkins Jenkins is validating pull request ...
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

gmueller-ewerk added a commit to gmueller-ewerk/cloudstack that referenced this pull request May 28, 2019

server: ssh-keygen in PEM format and reduce main systemvm patching sc…
…ript (apache#3333)

On first startup, the management server creates and saves a random
ssh keypair using ssh-keygen in the database. The command does
not specify keys in PEM format which is not the default as generated
by latest ssh-keygen tool.

The systemvmtemplate always needs re-building whenever there is a change
in the cloud-early-config file. This also tries to fix that by introducing a
stage 2 bootstrap.sh where the changes specific to hypervisor detection
etc are refactored/moved. The initial cloud-early-config only patches
before the other scripts are called.

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.