Confusing behaviour after monitoring is enabled #921

adelton · 2020-01-07T14:42:45Z

General information

OS: Red Hat Enterprise Linux release 8.1 (Ootpa)
Hypervisor: KVM
Did you run crc setup before starting it (Yes/No)? Yes

CRC version

crc version: 1.3.0+918756b
OpenShift version: 4.2.10 (embedded in binary)

CRC status

CRC VM:          Running
OpenShift:       Running (v4.2.10)
Disk Usage:      11.63GB of 32.2GB (Inside the CRC VM)
Cache Usage:     13.75GB
Cache Directory: /home/test/.crc/cache

CRC config

Empty output.

Host Operating System

NAME="Red Hat Enterprise Linux"
VERSION="8.1 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.1"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.1 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8.1:GA"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.1
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.1"

Steps to reproduce

crc setup
crc start -p /tmp/pull.secret -m 16000 to give the VM enough memory to allow monitoring to work, also see [BUG] Documentation should state that monitoring will not work unless memory is increased from default #810
oc login -u kubeadmin
oc scale --replicas=1 statefulset --all -n openshift-monitoring; oc scale --replicas=1 deployment --all -n openshift-monitoring based on https://code-ready.github.io/crc/#starting-monitoring-alerting-telemetry_gsg
Log in to console at https://console-openshift-console.apps-crc.testing/dashboards
Try to make sense about the health of the cluster.

Expected

https://console-openshift-console.apps-crc.testing/dashboards says "Cluster is healthy" and there are no alerts.

https://console-openshift-console.apps-crc.testing/settings/cluster/ lists Last Completed Version and the Update History as completed.

https://console-openshift-console.apps-crc.testing/settings/cluster/clusteroperators shows all cluster operators "green" with no error messages.

Actual

https://console-openshift-console.apps-crc.testing/dashboards says "Cluster is healthy" but it lists bunch of alerts:

[orange] 100% of the etcd targets are down.
[red] etcd cluster "etcd": insufficient members (0).
[red] etcd cluster "etcd": members are down (1).
[orange] Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure.
[red] machine api operator is down
[red] Cluster version operator has disappeared from Prometheus target discovery. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible.
[orange] Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.

https://console-openshift-console.apps-crc.testing/settings/cluster/ shows

Channel: stable-4.2
Last Completed Version: None
Update Status: Failing, View details

https://console-openshift-console.apps-crc.testing/settings/cluster/clusteroperators at the top shows message

Cluster update in progress. Unable to apply 4.2.10: the cluster operator monitoring has not yet successfully rolled out

However, all operators have Status listed as Available with green check mark, monitoring has Message "Successfully rolled out the stack.", machine-api has Message "-".

Overall, it is not clear what the status of the machine-api operator is and why it's reported in alert as down when in the operator list it is shown running and available, it is not clear why the cluster version operator has the "has disappeared" alert and how to fix it, and why the message speaks about cluster operator monitoring not rolled out when the monitoring operator says otherwise.

Logs

You can start crc with crc start --log-level debug to collect logs.
Please consider posting this on http://gist.github.com/ and post the link in the issue.

The text was updated successfully, but these errors were encountered:

adelton · 2020-01-07T14:45:24Z

It is well possible that some of these are issues in OpenShift itself rather than in CRC and the environment that it sets up for OpenShift. However, it's possible that there is something about CRC or in CRC which should be done differently to remediate the problem and have the OpenShift cluster fully healthy, except maybe some alerts about there only being one node in the cluster.

gbraad · 2020-01-07T15:13:09Z

https://console-openshift-console.apps-crc.testing/dashboards says "Cluster is healthy" but it lists bunch of alerts:

This is part of a known issue:

Machine api operator
Cluster version operator

is disabled for CRC as they do not provide functionality we need and would otherwise consume resources. Also see: https://access.redhat.com/documentation/en-us/red_hat_codeready_containers/1.0/html-single/release_notes_and_known_issues/index#metrics_are_disabled_by_default

Take note that https://access.redhat.com/documentation/en-us/red_hat_codeready_containers/1.3/html-single/getting_started_guide/index#common-tasks_gsg

does mention:

 The machine-config cluster operator is expected to report False availability. The monitoring cluster operator is expected to report Unknown availability.

which affects the results from metrics.

Note: we dropped the statement related to Cluster state reporting "Healthy" when this is not the case. I believe we did so under the pretense to add a message as part of the startup. Perhaps this got neglected/miscommunicated. Will bring this up again.

adelton · 2020-01-08T13:03:13Z

Thanks @gbraad for the explanation.

Is there an easy way to enable the machine API and cluster version operators if I don't worry about the resource consumption that much, similar to the ability of enabling monitoring per https://access.redhat.com/documentation/en-us/red_hat_codeready_containers/1.0/html-single/release_notes_and_known_issues/index#metrics_are_disabled_by_default ? I'd like to have that CRC-based testing cluster as close to the real OCP installation as possible, as a basis for future investigation.

praveenkumar · 2020-01-08T13:27:32Z

@adelton Machine config operator doesn't work with single node cluster openshift/machine-config-operator#579 and to enable the cluster version operator is fairly simple just using oc scale --replicas=1 deployment --all -n openshift-cluster-version

gbraad · 2020-01-08T13:29:19Z

But @praveenkumar, wouldn't enabling the CVO have unwanted side effects? It wants to pull the cluster into a certain state ...

praveenkumar · 2020-01-08T13:49:47Z

wouldn't enabling the CVO have unwanted side effects? It wants to pull the cluster into a certain state ...

@gbraad No, it shouldn't because @adelton already enabling the monitor operator, that what we disable, it will also bring the cluster state back to what the we change https://github.com/code-ready/snc/blob/master/snc.sh#L221-L255 here.

adelton · 2020-01-08T15:07:23Z

enable the cluster version operator is fairly simple just using oc scale --replicas=1 deployment --all -n openshift-cluster-version

Thank you. Should I expected it to appear at https://console-openshift-console.apps-crc.testing/settings/cluster/clusteroperators?

The

Cluster version operator has disappeared

alert is gone but I see https://console-openshift-console.apps-crc.testing/k8s/cluster/config.openshift.io~v1~ClusterVersion/version reporting Failing because of

Could not update deployment "openshift-machine-config-operator/etcd-quorum-guard" (383 of 433)

And in OCM, in the Cluster operators listing, I see "version" Failing, with link pointing to https://console-openshift-console.apps-crc.testing/k8s/cluster/config.openshift.io~v1~ClusterOperator/version which says 404 Not Found. Note the ClusterOperator/version, as opposed to the ClusterVersion/version URL which has some content in it.

praveenkumar · 2020-01-09T08:13:46Z

Could not update deployment "openshift-machine-config-operator/etcd-quorum-guard" (383 of 433)

@adelton This is expected since as per CVO, we should have 3 instances of etcd to perform a quorum, for CRC this is not possible :(

adelton · 2020-01-09T14:58:13Z

Digging into it a bit more, it seems like oc scale --replicas=1 deployment --all -n openshift-machine-config-operator makes the CRC-based cluster view itself as fully healthy with no alerts, and fully updated to version 4.2.10. At least temporarily before CVO (?) raises the number of replicas back to 3.

adelton · 2020-01-10T12:54:57Z

For the record, running this (enable monitoring, and openshift-cluster-version and openshift-machine-config-operator) allows me to initiate upgrade to 4.2.13. When the upgrade process gets stuck waiti for something (insights) bumping the replicas value to 1 and after upgrade back to 0 makes the upgrade proceed up to the 88 per cent, at which point it gets stuck because of machine-config reporting

Failed to resync 4.2.10 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 1, ready 0, updated: 0, unavailable: 1)

I've searched around but did not find definitive cause (and workaround) for the pool master is not ready error, at least for the 4.2 version of OCP.

praveenkumar · 2020-01-13T06:16:26Z

When the upgrade process gets stuck waiti for something (insights) bumping the replicas value to 1 and after upgrade back to 0 makes the upgrade proceed up to the 88 per cent, at which point it gets stuck because of machine-config reporting

@adelton You can't perform the upgrade of the cluster using CRC since it is single node cluster and the machine config operator not going to supported for this. In real cluster upgrade happen one by one on the worker/master node and rebooting them. I think if you are really want to test upgrade then CRC might not be the good choice :(

adelton · 2020-01-13T09:13:11Z

Understood. Thank you.

nelsonspbr · 2021-07-14T19:25:31Z

@praveenkumar Then I suggest that https://access.redhat.com/documentation/en-us/red_hat_codeready_containers/1.3/html-single/getting_started_guide/index (and possibly other places) should explicitly mention that CRC does not support machine-config, instead of just saying it is disabled by default. By that I had understood I could enable it if I wanted. I spent some time trying to make it work until I finally found this GitHub issue.

adelton closed this as completed Jan 13, 2020

kowen-rh mentioned this issue Jul 30, 2021

docs: Clarify machine-config Operator support status #2619

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusing behaviour after monitoring is enabled #921

Confusing behaviour after monitoring is enabled #921

adelton commented Jan 7, 2020

adelton commented Jan 7, 2020

gbraad commented Jan 7, 2020 •

edited

adelton commented Jan 8, 2020 •

edited

praveenkumar commented Jan 8, 2020

gbraad commented Jan 8, 2020

praveenkumar commented Jan 8, 2020

adelton commented Jan 8, 2020 •

edited

praveenkumar commented Jan 9, 2020

adelton commented Jan 9, 2020

adelton commented Jan 10, 2020

praveenkumar commented Jan 13, 2020 •

edited

adelton commented Jan 13, 2020

nelsonspbr commented Jul 14, 2021

Confusing behaviour after monitoring is enabled #921

Confusing behaviour after monitoring is enabled #921

Comments

adelton commented Jan 7, 2020

General information

CRC version

CRC status

CRC config

Host Operating System

Steps to reproduce

Expected

Actual

Logs

adelton commented Jan 7, 2020

gbraad commented Jan 7, 2020 • edited

adelton commented Jan 8, 2020 • edited

praveenkumar commented Jan 8, 2020

gbraad commented Jan 8, 2020

praveenkumar commented Jan 8, 2020

adelton commented Jan 8, 2020 • edited

praveenkumar commented Jan 9, 2020

adelton commented Jan 9, 2020

adelton commented Jan 10, 2020

praveenkumar commented Jan 13, 2020 • edited

adelton commented Jan 13, 2020

nelsonspbr commented Jul 14, 2021

gbraad commented Jan 7, 2020 •

edited

adelton commented Jan 8, 2020 •

edited

adelton commented Jan 8, 2020 •

edited

praveenkumar commented Jan 13, 2020 •

edited