Report status per unit #36183

belimawr · 2023-07-31T13:53:07Z

What does this PR do?

This PR updates the ManagerV2 to set status per Unit based on its inputs. If any input on a Unit returns an error when starting the whole Unit is set as failed. If multiple inputs return an error, all errors are reported in the Message field.

If the output unit returns an error when starting, only the output Unit is set as failed. All other input unit states are not modified (this was the behaviour before this PR).

Why is it important?

It allow users to better understand which unit has failed and which ones are working.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~- [ ] I have made corresponding changes to the documentation~~
~~- [ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.
I have made my commit title and message explanatory about the purpose and the reason of the change

~~## Author's Checklist~~

How to test this PR locally

Build a development version of the Elastic-Agent
Build Filebeat from this PR (from the x-pack folder)
Replace Filebeat's binary from the Elastic-Agent components sub folder by the one you built
Run the Elastic-Agent with a standalone policy and two input units, one must contain an error (use the policy below)
Run the status command: ./elastic-agent status --output full
Only one input unit must report as FAILED.

elastic-agent.yml

outputs:
  default:
    type: elasticsearch
    hosts:
      - https://localhost:9200
    username: "elastic"
    password: "changeme"
    ssl.verification_mode: none


inputs:
  - type: filestream
    id: input-1
    streams:
      - id: filestream-input-id-stream-block
        data_stream:
          dataset: generic
        paths:
          - /var/log/*.log
  - type: filestream
    id: input-2
    streams:
      - id: filestream-input-id-stream-block
        data_stream:
          dataset: generic
        pathsBroken:
          - /var/log/*.log

agent.monitoring:
  enabled: false
  logs: false
  metrics: false

Expected output

┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   ├─ info
   │  ├─ id: dde03f6c-733d-420b-b9b5-075923d14b9b
   │  ├─ version: 8.10.0
   │  └─ commit: f2e4f71775af532b0e256287829df41b6fd8a962
   └─ filestream-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '1664568'
      ├─ filestream-default
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: OUTPUT
      ├─ filestream-default-input-1
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: INPUT
      └─ filestream-default-input-2
         ├─ status: (FAILED) no path is configured accessing config
         └─ type: INPUT

Related issues

Closes Errors loading inputs and outputs configured by the Elastic Agent should be reported per unit #35874

~~## Use cases~~
~~## Screenshots~~
~~## Logs~~

mergify · 2023-07-31T13:53:56Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @belimawr? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2023-07-31T15:06:21Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-08-02T18:36:29.261+0000
Duration: 71 min 10 sec

Test stats 🧪

Test	Results
Failed	0
Passed	27595
Skipped	2008
Total	29603

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine · 2023-07-31T17:24:11Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

ycombinator · 2023-07-31T20:43:42Z

Followed the instructions for testing this PR locally.

Before this PR

$ sudo elastic-agent status
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   └─ filestream-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '92195'
      ├─ filestream-default
      │  └─ status: (FAILED) [failed to reload inputs: 1 error: Error creating runner from config: no path is configured accessing config]
      ├─ filestream-default-input-1
      │  └─ status: (FAILED) [failed to reload inputs: 1 error: Error creating runner from config: no path is configured accessing config]
      └─ filestream-default-input-2
         └─ status: (FAILED) [failed to reload inputs: 1 error: Error creating runner from config: no path is configured accessing config]

With this PR

$ sudo elastic-agent status
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   └─ filestream-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '93096'
      └─ filestream-default-input-2
         └─ status: (FAILED) no path is configured accessing config

In the output of elastic-agent status with this PR, I only saw filestream-default-input-2. Shouldn't filestream-default-input-1 also be included?

belimawr · 2023-08-01T09:08:30Z

In the output of elastic-agent status with this PR, I only saw filestream-default-input-2. Shouldn't filestream-default-input-1 also be included?

Did you run ./elastic-agent status --output full? Without the parameter you won't get the full status, here is the difference between them.

./elastic-agent status
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   └─ filestream-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '2262415'
      └─ filestream-default-input-2
         └─ status: (FAILED) no path is configured accessing config

./elastic-agent status --output full
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   ├─ info
   │  ├─ id: dde03f6c-733d-420b-b9b5-075923d14b9b
   │  ├─ version: 8.10.0
   │  └─ commit: f2e4f71775af532b0e256287829df41b6fd8a962
   └─ filestream-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '2262415'
      ├─ filestream-default
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: OUTPUT
      ├─ filestream-default-input-1
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: INPUT
      └─ filestream-default-input-2
         ├─ status: (FAILED) no path is configured accessing config
         └─ type: INPUT

ycombinator · 2023-08-01T13:16:30Z

Thanks @belimawr, I did not run with --output full and will try that now.

But I'm concerned that the output of ./elastic-agent status before this PR shows both filestream-default-input-1 and filestream-default-input-2 inputs as well as the filestream-default output whereas, with this PR, the same command's output shows only the failing filestream-default-input-2 input. This feels like a regression?

ycombinator · 2023-08-01T13:31:38Z

I tried elastic-agent status --output full and it looks good (see below)!

However, I'm still concerned about the potential regression in the output of elastic-agent status (see previous comment).

Before this PR

$ sudo elastic-agent status --output full
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   ├─ info
   │  ├─ id: e4197319-ce44-44c5-aff6-50aaa8456999
   │  ├─ version: 8.10.0
   │  └─ commit: b60b8b04c41f67a07958b59d18977ba0229bf233
   └─ filestream-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '27711'
      ├─ filestream-default
      │  ├─ status: (FAILED) [failed to reload inputs: 1 error: Error creating runner from config: no path is configured accessing config]
      │  └─ type: OUTPUT
      ├─ filestream-default-input-1
      │  ├─ status: (FAILED) [failed to reload inputs: 1 error: Error creating runner from config: no path is configured accessing config]
      │  └─ type: INPUT
      └─ filestream-default-input-2
         ├─ status: (FAILED) [failed to reload inputs: 1 error: Error creating runner from config: no path is configured accessing config]
         └─ type: INPUT

With this PR

$ sudo elastic-agent status --output full
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   ├─ info
   │  ├─ id: 627093e4-ad6b-4afd-8a50-a3cfa84d6190
   │  ├─ version: 8.10.0
   │  └─ commit: b60b8b04c41f67a07958b59d18977ba0229bf233
   └─ filestream-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '27494'
      ├─ filestream-default
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: OUTPUT
      ├─ filestream-default-input-1
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: INPUT
      └─ filestream-default-input-2
         ├─ status: (FAILED) no path is configured accessing config
         └─ type: INPUT

pierrehilbert · 2023-08-02T05:36:11Z

In your last command, you have every input and not only the failing one @ycombinator.
Is it something that is not happening every time?

ycombinator · 2023-08-02T06:18:00Z

In your last command, you have every input and not only the failing one @ycombinator. Is it something that is not happening every time?

elastic-agent status --output full is working as expected and is an improvement with this PR. It's elastic-agent status that I'm concerned about — it feels like a regression to me (see #36183 (comment)).

belimawr · 2023-08-02T10:05:54Z

inputs:
  - type: filestream
    id: input-1
    streams:
      - id: filestream-input-id-stream-block
        data_stream:
          dataset: generic
        paths:
          - /var/log/*.log
  - type: filestream
    id: input-2
    streams:
      - id: filestream-input-id-stream-block
        data_stream:
          dataset: generic
        pathsBroken:
          - /var/log/*.log

./elastic-agent status will only report units that are not healthy, because before this PR the status was reported per Beat, all units of a Beat would be marked as failed together even though not all had failed.

With this PR only the the failed units go into a unhealthy state, hence there is less units reported by /elastic-agent status when not all of them have failed.

An important thing to notice is that due to the Beats internals when the output fails to start the Beat will not start any input. Trying to start the log input with a failed output will deadlock, trying to start filestream with a failed output will not return an error but the input will never be fully started.

This behaviour was there before this PR.

If you have a failed output the status command will show this:

┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   └─ filestream-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '3196877'
      ├─ filestream-default
      │  └─ status: (FAILED) could not start output: failed to reload output: missing required field accessing 'elasticsearch.hosts'
      ├─ filestream-default-input-1
      │  └─ status: (STARTING) Starting
      └─ filestream-default-input-2
         └─ status: (STARTING) Starting

ycombinator · 2023-08-02T13:00:34Z

I see, thanks for clarifying @belimawr. I missed the fact that the default value of --output, which is human, was always intended to show only non-healthy details. In that case, the behavior with this PR for elastic-agent status and elastic-agent status --output full is 👍. I'll proceed to reviewing the code now...

libbeat/cfgfile/list.go

x-pack/libbeat/management/managerV2.go

ycombinator

LGTM.

This commit updates the ManagerV2 to set status per Unit based on its inputs. If any input on a Unit returns an error when starting the whole Unit is set as failed. If multiple inputs return an error, all errors are reported in the `Message` field. If the output unit returns an error when starting, only the output Unit is set as failed. All other input unit states are not modified (this was the behaviour before this commit).

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jul 31, 2023

mergify bot assigned belimawr Jul 31, 2023

belimawr force-pushed the errors-per-unit branch from 4a98c0e to c4d4c5f Compare July 31, 2023 13:54

belimawr force-pushed the errors-per-unit branch 2 times, most recently from f6d5c97 to e65bd85 Compare July 31, 2023 16:03

belimawr marked this pull request as ready for review July 31, 2023 16:03

belimawr requested a review from a team as a code owner July 31, 2023 16:03

belimawr requested review from ycombinator and fearful-symmetry July 31, 2023 16:03

pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Jul 31, 2023

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jul 31, 2023

ycombinator reviewed Aug 2, 2023

View reviewed changes

libbeat/cfgfile/list.go Show resolved Hide resolved

ycombinator reviewed Aug 2, 2023

View reviewed changes

libbeat/cfgfile/list.go Show resolved Hide resolved

ycombinator reviewed Aug 2, 2023

View reviewed changes

x-pack/libbeat/management/managerV2.go Outdated Show resolved Hide resolved

ycombinator approved these changes Aug 2, 2023

View reviewed changes

belimawr added 2 commits August 2, 2023 20:35

PR improvements

ad18fff

belimawr force-pushed the errors-per-unit branch from e8d6326 to ad18fff Compare August 2, 2023 18:36

belimawr merged commit a2eaff7 into elastic:main Aug 4, 2023
83 of 86 checks passed

AndersonQ mentioned this pull request Sep 28, 2023

Agent remains Unhealthy even on updating invalid integration configuration to valid input. elastic/elastic-agent#2954

Closed

2 tasks

cmacknz mentioned this pull request Oct 10, 2023

[Fleet] Implement per-integration health reporting for output elastic/kibana#159300

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report status per unit #36183

Report status per unit #36183

belimawr commented Jul 31, 2023 •

edited

Loading

mergify bot commented Jul 31, 2023

elasticmachine commented Jul 31, 2023 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

elasticmachine commented Jul 31, 2023

ycombinator commented Jul 31, 2023

belimawr commented Aug 1, 2023

ycombinator commented Aug 1, 2023 •

edited

Loading

ycombinator commented Aug 1, 2023

pierrehilbert commented Aug 2, 2023

ycombinator commented Aug 2, 2023

belimawr commented Aug 2, 2023

ycombinator commented Aug 2, 2023 •

edited

Loading

ycombinator left a comment

Report status per unit #36183

Report status per unit #36183

Conversation

belimawr commented Jul 31, 2023 • edited Loading

What does this PR do?

Why is it important?

Checklist

How to test this PR locally

Related issues

mergify bot commented Jul 31, 2023

elasticmachine commented Jul 31, 2023 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

elasticmachine commented Jul 31, 2023

ycombinator commented Jul 31, 2023

Before this PR

With this PR

belimawr commented Aug 1, 2023

ycombinator commented Aug 1, 2023 • edited Loading

ycombinator commented Aug 1, 2023

Before this PR

With this PR

pierrehilbert commented Aug 2, 2023

ycombinator commented Aug 2, 2023

belimawr commented Aug 2, 2023

ycombinator commented Aug 2, 2023 • edited Loading

ycombinator left a comment

Choose a reason for hiding this comment

belimawr commented Jul 31, 2023 •

edited

Loading

elasticmachine commented Jul 31, 2023 •

edited by jenkins-beats-ci bot

Loading

ycombinator commented Aug 1, 2023 •

edited

Loading

ycombinator commented Aug 2, 2023 •

edited

Loading