Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusing behaviour after monitoring is enabled #921

Closed
adelton opened this issue Jan 7, 2020 · 13 comments
Closed

Confusing behaviour after monitoring is enabled #921

adelton opened this issue Jan 7, 2020 · 13 comments

Comments

@adelton
Copy link

adelton commented Jan 7, 2020

General information

  • OS: Red Hat Enterprise Linux release 8.1 (Ootpa)
  • Hypervisor: KVM
  • Did you run crc setup before starting it (Yes/No)? Yes

CRC version

crc version: 1.3.0+918756b
OpenShift version: 4.2.10 (embedded in binary)

CRC status

CRC VM:          Running
OpenShift:       Running (v4.2.10)
Disk Usage:      11.63GB of 32.2GB (Inside the CRC VM)
Cache Usage:     13.75GB
Cache Directory: /home/test/.crc/cache

CRC config

Empty output.

Host Operating System

NAME="Red Hat Enterprise Linux"
VERSION="8.1 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.1"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.1 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8.1:GA"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.1
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.1"

Steps to reproduce

  1. crc setup
  2. crc start -p /tmp/pull.secret -m 16000 to give the VM enough memory to allow monitoring to work, also see [BUG] Documentation should state that monitoring will not work unless memory is increased from default #810
  3. oc login -u kubeadmin
  4. oc scale --replicas=1 statefulset --all -n openshift-monitoring; oc scale --replicas=1 deployment --all -n openshift-monitoring based on https://code-ready.github.io/crc/#starting-monitoring-alerting-telemetry_gsg
  5. Log in to console at https://console-openshift-console.apps-crc.testing/dashboards
  6. Try to make sense about the health of the cluster.

Expected

https://console-openshift-console.apps-crc.testing/dashboards says "Cluster is healthy" and there are no alerts.

https://console-openshift-console.apps-crc.testing/settings/cluster/ lists Last Completed Version and the Update History as completed.

https://console-openshift-console.apps-crc.testing/settings/cluster/clusteroperators shows all cluster operators "green" with no error messages.

Actual

https://console-openshift-console.apps-crc.testing/dashboards says "Cluster is healthy" but it lists bunch of alerts:

  • [orange] 100% of the etcd targets are down.
  • [red] etcd cluster "etcd": insufficient members (0).
  • [red] etcd cluster "etcd": members are down (1).
  • [orange] Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure.
  • [red] machine api operator is down
  • [red] Cluster version operator has disappeared from Prometheus target discovery. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible.
  • [orange] Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.

https://console-openshift-console.apps-crc.testing/settings/cluster/ shows

  • Channel: stable-4.2
  • Last Completed Version: None
  • Update Status: Failing, View details

https://console-openshift-console.apps-crc.testing/settings/cluster/clusteroperators at the top shows message

Cluster update in progress. Unable to apply 4.2.10: the cluster operator monitoring has not yet successfully rolled out

However, all operators have Status listed as Available with green check mark, monitoring has Message "Successfully rolled out the stack.", machine-api has Message "-".

Overall, it is not clear what the status of the machine-api operator is and why it's reported in alert as down when in the operator list it is shown running and available, it is not clear why the cluster version operator has the "has disappeared" alert and how to fix it, and why the message speaks about cluster operator monitoring not rolled out when the monitoring operator says otherwise.

Logs

You can start crc with crc start --log-level debug to collect logs.
Please consider posting this on http://gist.github.com/ and post the link in the issue.

@adelton
Copy link
Author

adelton commented Jan 7, 2020

It is well possible that some of these are issues in OpenShift itself rather than in CRC and the environment that it sets up for OpenShift. However, it's possible that there is something about CRC or in CRC which should be done differently to remediate the problem and have the OpenShift cluster fully healthy, except maybe some alerts about there only being one node in the cluster.

@gbraad
Copy link
Contributor

gbraad commented Jan 7, 2020

https://console-openshift-console.apps-crc.testing/dashboards says "Cluster is healthy" but it lists bunch of alerts:

This is part of a known issue:

  • Machine api operator
  • Cluster version operator

is disabled for CRC as they do not provide functionality we need and would otherwise consume resources. Also see: https://access.redhat.com/documentation/en-us/red_hat_codeready_containers/1.0/html-single/release_notes_and_known_issues/index#metrics_are_disabled_by_default

Take note that https://access.redhat.com/documentation/en-us/red_hat_codeready_containers/1.3/html-single/getting_started_guide/index#common-tasks_gsg

does mention:

 The machine-config cluster operator is expected to report False availability. The monitoring cluster operator is expected to report Unknown availability. 

which affects the results from metrics.

Note: we dropped the statement related to Cluster state reporting "Healthy" when this is not the case. I believe we did so under the pretense to add a message as part of the startup. Perhaps this got neglected/miscommunicated. Will bring this up again.

@adelton
Copy link
Author

adelton commented Jan 8, 2020

Thanks @gbraad for the explanation.

Is there an easy way to enable the machine API and cluster version operators if I don't worry about the resource consumption that much, similar to the ability of enabling monitoring per https://access.redhat.com/documentation/en-us/red_hat_codeready_containers/1.0/html-single/release_notes_and_known_issues/index#metrics_are_disabled_by_default ? I'd like to have that CRC-based testing cluster as close to the real OCP installation as possible, as a basis for future investigation.

@praveenkumar
Copy link
Member

@adelton Machine config operator doesn't work with single node cluster openshift/machine-config-operator#579 and to enable the cluster version operator is fairly simple just using oc scale --replicas=1 deployment --all -n openshift-cluster-version

@gbraad
Copy link
Contributor

gbraad commented Jan 8, 2020

But @praveenkumar, wouldn't enabling the CVO have unwanted side effects? It wants to pull the cluster into a certain state ...

@praveenkumar
Copy link
Member

wouldn't enabling the CVO have unwanted side effects? It wants to pull the cluster into a certain state ...

@gbraad No, it shouldn't because @adelton already enabling the monitor operator, that what we disable, it will also bring the cluster state back to what the we change https://github.com/code-ready/snc/blob/master/snc.sh#L221-L255 here.

@adelton
Copy link
Author

adelton commented Jan 8, 2020

enable the cluster version operator is fairly simple just using oc scale --replicas=1 deployment --all -n openshift-cluster-version

Thank you. Should I expected it to appear at https://console-openshift-console.apps-crc.testing/settings/cluster/clusteroperators?

The

Cluster version operator has disappeared

alert is gone but I see https://console-openshift-console.apps-crc.testing/k8s/cluster/config.openshift.io~v1~ClusterVersion/version reporting Failing because of

Could not update deployment "openshift-machine-config-operator/etcd-quorum-guard" (383 of 433)

And in OCM, in the Cluster operators listing, I see "version" Failing, with link pointing to https://console-openshift-console.apps-crc.testing/k8s/cluster/config.openshift.io~v1~ClusterOperator/version which says 404 Not Found. Note the ClusterOperator/version, as opposed to the ClusterVersion/version URL which has some content in it.

@praveenkumar
Copy link
Member

Could not update deployment "openshift-machine-config-operator/etcd-quorum-guard" (383 of 433)

@adelton This is expected since as per CVO, we should have 3 instances of etcd to perform a quorum, for CRC this is not possible :(

@adelton
Copy link
Author

adelton commented Jan 9, 2020

Digging into it a bit more, it seems like oc scale --replicas=1 deployment --all -n openshift-machine-config-operator makes the CRC-based cluster view itself as fully healthy with no alerts, and fully updated to version 4.2.10. At least temporarily before CVO (?) raises the number of replicas back to 3.

@adelton
Copy link
Author

adelton commented Jan 10, 2020

For the record, running this (enable monitoring, and openshift-cluster-version and openshift-machine-config-operator) allows me to initiate upgrade to 4.2.13. When the upgrade process gets stuck waiti for something (insights) bumping the replicas value to 1 and after upgrade back to 0 makes the upgrade proceed up to the 88 per cent, at which point it gets stuck because of machine-config reporting

Failed to resync 4.2.10 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 1, ready 0, updated: 0, unavailable: 1)

I've searched around but did not find definitive cause (and workaround) for the pool master is not ready error, at least for the 4.2 version of OCP.

@praveenkumar
Copy link
Member

praveenkumar commented Jan 13, 2020

When the upgrade process gets stuck waiti for something (insights) bumping the replicas value to 1 and after upgrade back to 0 makes the upgrade proceed up to the 88 per cent, at which point it gets stuck because of machine-config reporting

@adelton You can't perform the upgrade of the cluster using CRC since it is single node cluster and the machine config operator not going to supported for this. In real cluster upgrade happen one by one on the worker/master node and rebooting them. I think if you are really want to test upgrade then CRC might not be the good choice :(

@adelton
Copy link
Author

adelton commented Jan 13, 2020

Understood. Thank you.

@adelton adelton closed this as completed Jan 13, 2020
@nelsonspbr
Copy link

@praveenkumar Then I suggest that https://access.redhat.com/documentation/en-us/red_hat_codeready_containers/1.3/html-single/getting_started_guide/index (and possibly other places) should explicitly mention that CRC does not support machine-config, instead of just saying it is disabled by default. By that I had understood I could enable it if I wanted. I spent some time trying to make it work until I finally found this GitHub issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants