Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HA for Management Server - roundrobin: Certificate ownership #2930

Closed
DennisKonrad opened this issue Oct 26, 2018 · 18 comments
Closed

HA for Management Server - roundrobin: Certificate ownership #2930

DennisKonrad opened this issue Oct 26, 2018 · 18 comments

Comments

@DennisKonrad
Copy link
Contributor

ISSUE TYPE
  • Bug Report
COMPONENT NAME

Management Server HA

CLOUDSTACK VERSION

master

CONFIGURATION

indirect.agent.lb.algorithm = roundrobin
host = 10.24.48.46,10.24.48.47

SUMMARY

When trying to use the CS Management in roundrobin loadbalanced mode we get errors creating a
VPC for example.

If we use indirect.agent.lb.algorithm = static it works like a charm for both managementservers.

The log states it has something todo with the certificates issued:

image

It's not really clear what the error message itself is trying to say or how I can debug this further.

STEPS TO REPRODUCE

as stated above

EXPECTED RESULTS
ACTUAL RESULTS
@rohityadavcloud
Copy link
Member

@DennisKonrad are you also able to reproduce this for 4.11?

@rohityadavcloud
Copy link
Member

@DennisKonrad can you also describe your setup, is it kvm/xenserver/vmware etc?

@rohityadavcloud rohityadavcloud added this to the 4.11.2.0 milestone Oct 27, 2018
@rohityadavcloud
Copy link
Member

@DennisKonrad did you deploy multiple management servers concurrently? Ideally you should wait for the first management server to fully start before starting secondary management server.

From the screenshot tthe certificate was generated without IPs of the mgmt server, therefore the certificate validation logic failed the SSL connection (as the certificate's alt name/ip should match the connecting agent/mgmt server's address). For example the following is a valid mgmt server cert that has ipv4/v6 address in its alt name:

Certificate [1] :
 Serial: da32e26467ff7a4d
  Not Before:Sat Oct 27 07:44:03 UTC 2018
  Not After:Mon Oct 19 19:44:03 UTC 2048
  Signature Algorithm:SHA256withRSA
  Version:3
  Subject DN:CN=pr2376-t3127-kvm-centos7-mgmt2
  Issuer DN:CN=ca.cloudstack.apache.org
  Alternative Names:[[7, fe80:0:0:0:4af:4ff:fe01:7a8], [7, 10.2.2.176], [2, pr2376-t3127-kvm-centos7-mgmt2]]

I could not reproduce this with 4.11 branch, so will move to milestone 4.12.0.0/master. Please re-test and keep us posted, thanks.

@rohityadavcloud rohityadavcloud modified the milestones: 4.11.2.0, 4.12.0.0 Oct 29, 2018
@DennisKonrad
Copy link
Contributor Author

@rhtyd
So, unfortunately we are not able to test this on a 4.11 setup.

The setup is kvm. Anything more you want to know?

It is possible that we deployed our management servers at the same time. I will try to give the first management server the time to start before starting the secondary one. I will try this and report the results.

A question: When are those certificates generated? Is it sufficient to restart the management server to newly generate the certificates?

@rohityadavcloud
Copy link
Member

@DennisKonrad thanks. When a new management server starts, it first upgrades the DB and then various managers/components start. During this start stage, first the ConfigurationManagerImpl starts which configures various default settings, offerings, accounts (such as system, admin etc). When the CA manager starts, it asks the configured/default plugin to initialize which is the RootCA plugin by default. This plugin would check db is keypair (private/public keypair) exists and uses the CA cert to create and sign a self-signed cert for the mgmt server host. If during the initializing process, another mgmt server is started it might screw up big time conflicting for operations/default created by both ConfigurationManager and CAManager. Therefore, based on the output you've shared it's more of an env/setup issue than a bug. Please re-test master again, this time making sure that the first/primary mgmt server initializes/completes before secondary mgmt servers are added.

For an existing env, to force re-kick of cert generation, shutdown all mgmt servers, then in the db and set these global settings in cloud.configurations table to null:

  • ca.plugin.root.private.key
  • ca.plugin.root.public.key
  • ca.plugin.root.ca.certificate

And start first mgmt server, let it complete initialization and start other mgmt servers.

@DennisKonrad
Copy link
Contributor Author

@rhtyd
Ok, so are the certificates regenerated for the primary management server when adding a secondary one? Because the alternative name has to be updated, right? I dont see how it should work otherwise.

So can we do the cert re-kick safely even when we have already added multiple hosts? When we leave the CN at
CN=ca.cloudstack.apache.org
I see no problem but maybe you know something I'm not thinking about right now.

Can you also tell me which script is doing the issuing of the certificates? Then I would be able to check what it's doing.

@rohityadavcloud
Copy link
Member

@DennisKonrad I've already replied to you how cert generation works, tl;dr - each mgmt server generates its own cert on startup based on keypair/ca-cert from the db, the alt names are obtained by mgmt server by reading ips on network interfaces. The certs of mgmt server are only used when they peer/cluster with other mgmt servers, if you re-generated core keypair/ca-cert then kvm hosts certs will need to be re-provisioned. Please go figure: https://github.com/apache/cloudstack/blob/master/plugins/ca/root-ca/src/main/java/org/apache/cloudstack/ca/provider/RootCAProvider.java#L409

@rohityadavcloud
Copy link
Member

Given this is a setup /env issue, if the advise process fixes your issue and you're unable to reproduce the errors please close this issue @DennisKonrad

@DennisKonrad
Copy link
Contributor Author

Hi @rhtyd,

so I don't really get where this issue comes from. Some questions:

  1. What did we do to get into this problem? How can we avoid this?
  2. Can I somehow connect to the management server to see the certitificate? openssl s_client isn't working with java tcp connection...

I did some research in our logfiles and do not fully understand why we have this problem. The following happens:

Second management server starts, is detected and joins the high availability cluster:
image

Static Loadbalancing is starting and assigning hosts to the second management server:
image

After that the hosts try to connect to the second managementserver and fail??? Is this assertion right?
image

@rohityadavcloud
Copy link
Member

@DennisKonrad alright let me ask some questions:

  • Are you using 4.11.1.0 or 4.11.2.0-rcX or latest 4.11 branch?
  • Are all of your mgmt server connected to the same mysql server? Or, it is a cross-dc env?
  • Did you try my advice to stop all mgmt servers, delete the root private/public/certs global settings and start one mgmt server, wait for it to come online and then start the rest?
  • Based on the logs, the mgmt server SSL handshake failed due to invalid CA cert. You can add a breakpoint, attach a debugger from an IDE or jdb and see where in RootCAProvider this is failing and how.

@DennisKonrad
Copy link
Contributor Author

@rhtyd

  • We are building directly from master. At the moment we are using the master as of today.

  • All management servers are connected to the same mysql-cluster. Using the same database.

  • We did not delete the private/public/certs global settings because we have hosts connected with vms on them. It seems to me your solution suggests basically rebuilding the whole cluster and that's not possible for us.

  • I'm currently under a lot of workload but it would be possible to check with a debugger atached.

I'm wondering how it is possible to introduce such a problem. In the beginning there can only exists one management server as you stated. It would start up with only it's IP in the global variable (hosts). After that the only way to add a second management is to change the variable and delete root private/public/certs global settings?

If that is the case is that documented somewhere?

@rohityadavcloud
Copy link
Member

@DennisKonrad can you at least stop all your mgmt servers and start one by one? It's possible that you had a conflict b/w root ca cert/priv/pub keys, you can temporarily disable auth strictness, then remove the global settings and use provisionCertificate API to re-provision certificate with a new keypair/cert. You're currently using a messed up (sounds like) unstable/master branch and we cannot help once you fix your env. Also, please re-read my comments I'm not going to restate the same things again and again. About IPs, mgmt server when it starts discovers the IPs it needs to use to create a self-signed cert, and NO you don't need to delete the certs/keypair every time you add a new mgmt server. I advised that because you may have a case where you started multiple mgmt server during time of install/setup at once which stepped on each other and wrote incorrect ca keypair/cert in db (i.e. conflict+concurrent issue).

@rohityadavcloud rohityadavcloud removed their assignment Nov 12, 2018
@DennisKonrad
Copy link
Contributor Author

@rhtyd Starting one management up and later adding the second one did not work.

So I tried the following with one of two management server running:

  1. set auth strictness = false (restart management afterwards)
  2. stop management server (both off now)
  3. backup/clear db keys:
    ca.plugin.root.private.key
    ca.plugin.root.public.key
    ca.plugin.root.ca.certificate
  4. start first mgmt-server and wait for completion of reissue

After that I could not use the "provisionCertificate" api call because all hosts were disconnected.
I thought the "ca.plugin.root.auth.strictness" would allow me to push the new certificates to all hosts.

Your help is very appreciated and rephrasing (or restating how you call it) helps to figure out what could be wrong here. A thorough investigation will also help others with similar problems. Thanks for that

@rohityadavcloud
Copy link
Member

@DennisKonrad can you check in the logs if your kvm hosts are trying connect at all? You can try to delete the old keystore file at /etc/cloudstack/agent/cloud.jks and restart agent, then try the provision API.

@DennisKonrad
Copy link
Contributor Author

Hi @rhtyd,

I have three question left before I can try this on our cluster. To get the hosts to try to connect (your last question) it works to delete/move the cloud.jks. It's not even needed to restart the agent as far as I could tell. After that it worked to reprovision the kvm-host with the "provisionCertificate" api call.

When creating new CA/private/public I can now reprovision the hosts. I would also like to leave the systemvm's in place.
So I need to reprovision certs for agents in console proxy, storage vm and all virtual routers in the same manner? (1.)
Is this even possible with the "provisionCertificate" api call? (2.)
I was not able to list ALL virtual routers via listRouters. Even with listall and/or isrecursive flag. Is there an easy way to list all virtual routers? (3.)

I'm aiming for a process that leaves everything except the management-server running without any interruption.

My process looks as following:
Prerequisites:
get all relevant IDs (Hosts, SystemVMs) (api: listHosts, )
set auth strictness = false

Downtime:
stop all managament servers

backup and NULL the following db keys:
ca.plugin.root.private.key
ca.plugin.root.public.key
ca.plugin.root.ca.certificate

backup and delete per Host & SystemVM: /etc/cloudstack/agent/cloud.jks

start first mgmt-server and wait for completion
start second mgmt-server
reissue certificates for all hosts
turn auth strictness on again

@rohityadavcloud
Copy link
Member

@DennisKonrad

  1. Not virtual routers, only ssvm, cpvm and kvm hosts
  2. Yes, the APi accepts hostid and reconnect params
  3. N/A virtual routers don't need to be touched, VRs don't run any java agent

@rohityadavcloud rohityadavcloud removed this from the 4.12.0.0 milestone Dec 6, 2018
@rohityadavcloud
Copy link
Member

Hi @DennisKonrad I'm closing the issue as it could not reproduced. If you're able to consistently reproduce the error, please re-open the ticket with details and steps. Thanks.

@DennisKonrad
Copy link
Contributor Author

@rhtyd Yes.
I was not able to work on this for quite some time now. I think the issue is gone anyway since we fixed a lot of other things concerning multiple management servers. And at the time I'm either not seeing it or the problem is gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants