Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][DNM]machine-check: Use node problem detector to detect machine check #116

Closed
wants to merge 1 commit into from

Conversation

mcastelino
Copy link
Contributor

Use node problem detector to detect machine check errors reported
in a cluster. When a machine check happens in a cluster it will
generate a event and report it to kubernetes

clear@clr-01 ~/clr-k8s-examples $ kubectl get events
LAST SEEN   TYPE      REASON             OBJECT
MESSAGE
11m         Warning   Hardware Error     node/clr-01
mce: [Hardware Error]: Machine check events logged
11m         Warning   Hardware Error     node/clr-01
Hardware event. This is not a software error.

Signed-off-by: Manohar Castelino manohar.r.castelino@intel.com

Use node problem detector to detect machine check errors reported
in a cluster. When a machine check happens in a cluster it will
generate a event and report it to kubernetes

To get cluster wide status:

```
$ kubectl get events --field-selector reason="Hardware Error"
LAST SEEN   TYPE      REASON           OBJECT        MESSAGE
22m         Warning   Hardware Error   node/clr-01   mce: [Hardware
Error]: Machine check events logged
22m         Warning   Hardware Error   node/clr-01   Hardware event.
This is not a software error.
```

For full details on all reported events:

```
$ kubectl get events --field-selector
reason="Hardware Error" -o json
{
    "apiVersion": "v1",
    "items": [
        {
            "apiVersion": "v1",
            "count": 2,
            "eventTime": null,
            "firstTimestamp": "2019-07-16T00:33:23Z",
            "involvedObject": {
                "kind": "Node",
                "name": "clr-01",
                "uid": "clr-01"
            },
            "kind": "Event",
            "lastTimestamp": "2019-07-16T00:33:23Z",
            "message": "mce: [Hardware Error]: Machine check events
logged",
            "metadata": {
                "creationTimestamp": "2019-07-16T00:33:23Z",
                "name": "clr-01.15b1bbf43c5827c5",
                "namespace": "default",
                "resourceVersion": "4800",
                "selfLink":
"/api/v1/namespaces/default/events/clr-01.15b1bbf43c5827c5",
                "uid": "13e0a6c3-34e0-47c9-93b3-7738db0f9f68"
            },
            "reason": "Hardware Error",
            "reportingComponent": "",
            "reportingInstance": "",
            "source": {
                "component": "kernel-monitor",
                "host": "clr-01"
            },
            "type": "Warning"
        },
        {
            "apiVersion": "v1",
            "count": 2,
            "eventTime": null,
            "firstTimestamp": "2019-07-16T00:33:23Z",
            "involvedObject": {
                "kind": "Node",
                "name": "clr-01",
                "uid": "clr-01"
            },
            "kind": "Event",
            "lastTimestamp": "2019-07-16T00:33:23Z",
            "message": "Hardware event. This is not a software error.",
            "metadata": {
                "creationTimestamp": "2019-07-16T00:33:23Z",
                "name": "clr-01.15b1bbf45382ac9e",
                "namespace": "default",
                "resourceVersion": "4802",
                "selfLink":
"/api/v1/namespaces/default/events/clr-01.15b1bbf45382ac9e",
                "uid": "f1255bd1-8614-4d30-b558-ffbae2daef79"
            },
            "reason": "Hardware Error",
            "reportingComponent": "",
            "reportingInstance": "",
            "source": {
                "component": "mce-monitor",
                "host": "clr-01"
            },
            "type": "Warning"
        }
    ],
    "kind": "List",
    "metadata": {
        "resourceVersion": "",
        "selfLink": ""
    }
}

```

Signed-off-by: Manohar Castelino <manohar.r.castelino@intel.com>
@ahsan518
Copy link
Contributor

Reviewing changes and testing

@krsna1729
Copy link
Contributor

#34

@krsna1729 krsna1729 mentioned this pull request Aug 28, 2019
@ahsan518
Copy link
Contributor

Tested this patch and works as expected. Shall we merge this and add the Feature 34 later in a different pull request @krsna1729

@krsna1729
Copy link
Contributor

@ahsan518 i was cross-referencing to the initial issue opened to look at node problem detector. dont mind me.

@jascott1
Copy link
Contributor

jascott1 commented Sep 3, 2019

Closing this as it is superseded by #160

@jascott1 jascott1 closed this Sep 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants