Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report status per unit #36183

Merged
merged 2 commits into from
Aug 4, 2023
Merged

Report status per unit #36183

merged 2 commits into from
Aug 4, 2023

Conversation

belimawr
Copy link
Contributor

@belimawr belimawr commented Jul 31, 2023

What does this PR do?

This PR updates the ManagerV2 to set status per Unit based on its inputs. If any input on a Unit returns an error when starting the whole Unit is set as failed. If multiple inputs return an error, all errors are reported in the Message field.

If the output unit returns an error when starting, only the output Unit is set as failed. All other input unit states are not modified (this was the behaviour before this PR).

Why is it important?

It allow users to better understand which unit has failed and which ones are working.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
    - [ ] I have made corresponding changes to the documentation
    - [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.
  • I have made my commit title and message explanatory about the purpose and the reason of the change

## Author's Checklist

How to test this PR locally

  1. Build a development version of the Elastic-Agent
  2. Build Filebeat from this PR (from the x-pack folder)
  3. Replace Filebeat's binary from the Elastic-Agent components sub folder by the one you built
  4. Run the Elastic-Agent with a standalone policy and two input units, one must contain an error (use the policy below)
  5. Run the status command: ./elastic-agent status --output full
  6. Only one input unit must report as FAILED.
elastic-agent.yml

outputs:
  default:
    type: elasticsearch
    hosts:
      - https://localhost:9200
    username: "elastic"
    password: "changeme"
    ssl.verification_mode: none


inputs:
  - type: filestream
    id: input-1
    streams:
      - id: filestream-input-id-stream-block
        data_stream:
          dataset: generic
        paths:
          - /var/log/*.log
  - type: filestream
    id: input-2
    streams:
      - id: filestream-input-id-stream-block
        data_stream:
          dataset: generic
        pathsBroken:
          - /var/log/*.log

agent.monitoring:
  enabled: false
  logs: false
  metrics: false

Expected output

┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   ├─ info
   │  ├─ id: dde03f6c-733d-420b-b9b5-075923d14b9b
   │  ├─ version: 8.10.0
   │  └─ commit: f2e4f71775af532b0e256287829df41b6fd8a962
   └─ filestream-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '1664568'
      ├─ filestream-default
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: OUTPUT
      ├─ filestream-default-input-1
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: INPUT
      └─ filestream-default-input-2
         ├─ status: (FAILED) no path is configured accessing config
         └─ type: INPUT

Related issues

## Use cases
## Screenshots
## Logs

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jul 31, 2023
@mergify
Copy link
Contributor

mergify bot commented Jul 31, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @belimawr? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

elasticmachine commented Jul 31, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-08-02T18:36:29.261+0000

  • Duration: 71 min 10 sec

Test stats 🧪

Test Results
Failed 0
Passed 27595
Skipped 2008
Total 29603

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@belimawr belimawr force-pushed the errors-per-unit branch 2 times, most recently from f6d5c97 to e65bd85 Compare July 31, 2023 16:03
@belimawr belimawr marked this pull request as ready for review July 31, 2023 16:03
@belimawr belimawr requested a review from a team as a code owner July 31, 2023 16:03
@pierrehilbert pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Jul 31, 2023
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jul 31, 2023
@ycombinator
Copy link
Contributor

Followed the instructions for testing this PR locally.

Before this PR

$ sudo elastic-agent status
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   └─ filestream-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '92195'
      ├─ filestream-default
      │  └─ status: (FAILED) [failed to reload inputs: 1 error: Error creating runner from config: no path is configured accessing config]
      ├─ filestream-default-input-1
      │  └─ status: (FAILED) [failed to reload inputs: 1 error: Error creating runner from config: no path is configured accessing config]
      └─ filestream-default-input-2
         └─ status: (FAILED) [failed to reload inputs: 1 error: Error creating runner from config: no path is configured accessing config]

With this PR

$ sudo elastic-agent status
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   └─ filestream-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '93096'
      └─ filestream-default-input-2
         └─ status: (FAILED) no path is configured accessing config

In the output of elastic-agent status with this PR, I only saw filestream-default-input-2. Shouldn't filestream-default-input-1 also be included?

@belimawr
Copy link
Contributor Author

belimawr commented Aug 1, 2023

In the output of elastic-agent status with this PR, I only saw filestream-default-input-2. Shouldn't filestream-default-input-1 also be included?

Did you run ./elastic-agent status --output full? Without the parameter you won't get the full status, here is the difference between them.

./elastic-agent status
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   └─ filestream-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '2262415'
      └─ filestream-default-input-2
         └─ status: (FAILED) no path is configured accessing config

./elastic-agent status --output full
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   ├─ info
   │  ├─ id: dde03f6c-733d-420b-b9b5-075923d14b9b
   │  ├─ version: 8.10.0
   │  └─ commit: f2e4f71775af532b0e256287829df41b6fd8a962
   └─ filestream-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '2262415'
      ├─ filestream-default
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: OUTPUT
      ├─ filestream-default-input-1
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: INPUT
      └─ filestream-default-input-2
         ├─ status: (FAILED) no path is configured accessing config
         └─ type: INPUT

@ycombinator
Copy link
Contributor

ycombinator commented Aug 1, 2023

Thanks @belimawr, I did not run with --output full and will try that now.

But I'm concerned that the output of ./elastic-agent status before this PR shows both filestream-default-input-1 and filestream-default-input-2 inputs as well as the filestream-default output whereas, with this PR, the same command's output shows only the failing filestream-default-input-2 input. This feels like a regression?

@ycombinator
Copy link
Contributor

I tried elastic-agent status --output full and it looks good (see below)!

However, I'm still concerned about the potential regression in the output of elastic-agent status (see previous comment).

Before this PR

$ sudo elastic-agent status --output full
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   ├─ info
   │  ├─ id: e4197319-ce44-44c5-aff6-50aaa8456999
   │  ├─ version: 8.10.0
   │  └─ commit: b60b8b04c41f67a07958b59d18977ba0229bf233
   └─ filestream-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '27711'
      ├─ filestream-default
      │  ├─ status: (FAILED) [failed to reload inputs: 1 error: Error creating runner from config: no path is configured accessing config]
      │  └─ type: OUTPUT
      ├─ filestream-default-input-1
      │  ├─ status: (FAILED) [failed to reload inputs: 1 error: Error creating runner from config: no path is configured accessing config]
      │  └─ type: INPUT
      └─ filestream-default-input-2
         ├─ status: (FAILED) [failed to reload inputs: 1 error: Error creating runner from config: no path is configured accessing config]
         └─ type: INPUT

With this PR

$ sudo elastic-agent status --output full
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   ├─ info
   │  ├─ id: 627093e4-ad6b-4afd-8a50-a3cfa84d6190
   │  ├─ version: 8.10.0
   │  └─ commit: b60b8b04c41f67a07958b59d18977ba0229bf233
   └─ filestream-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '27494'
      ├─ filestream-default
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: OUTPUT
      ├─ filestream-default-input-1
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: INPUT
      └─ filestream-default-input-2
         ├─ status: (FAILED) no path is configured accessing config
         └─ type: INPUT

@pierrehilbert
Copy link
Collaborator

In your last command, you have every input and not only the failing one @ycombinator.
Is it something that is not happening every time?

@ycombinator
Copy link
Contributor

In your last command, you have every input and not only the failing one @ycombinator. Is it something that is not happening every time?

elastic-agent status --output full is working as expected and is an improvement with this PR. It's elastic-agent status that I'm concerned about — it feels like a regression to me (see #36183 (comment)).

@belimawr
Copy link
Contributor Author

belimawr commented Aug 2, 2023

inputs:
  - type: filestream
    id: input-1
    streams:
      - id: filestream-input-id-stream-block
        data_stream:
          dataset: generic
        paths:
          - /var/log/*.log
  - type: filestream
    id: input-2
    streams:
      - id: filestream-input-id-stream-block
        data_stream:
          dataset: generic
        pathsBroken:
          - /var/log/*.log

./elastic-agent status will only report units that are not healthy, because before this PR the status was reported per Beat, all units of a Beat would be marked as failed together even though not all had failed.

With this PR only the the failed units go into a unhealthy state, hence there is less units reported by /elastic-agent status when not all of them have failed.

An important thing to notice is that due to the Beats internals when the output fails to start the Beat will not start any input. Trying to start the log input with a failed output will deadlock, trying to start filestream with a failed output will not return an error but the input will never be fully started.

This behaviour was there before this PR.

If you have a failed output the status command will show this:

┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   └─ filestream-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '3196877'
      ├─ filestream-default
      │  └─ status: (FAILED) could not start output: failed to reload output: missing required field accessing 'elasticsearch.hosts'
      ├─ filestream-default-input-1
      │  └─ status: (STARTING) Starting
      └─ filestream-default-input-2
         └─ status: (STARTING) Starting

@ycombinator
Copy link
Contributor

ycombinator commented Aug 2, 2023

I see, thanks for clarifying @belimawr. I missed the fact that the default value of --output, which is human, was always intended to show only non-healthy details. In that case, the behavior with this PR for elastic-agent status and elastic-agent status --output full is 👍. I'll proceed to reviewing the code now...

Copy link
Contributor

@ycombinator ycombinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

This commit updates the ManagerV2 to set status per Unit based on its
inputs. If any input on a Unit returns an error when starting the
whole Unit is set as failed. If multiple inputs return an error, all
errors are reported in the `Message` field.

If the output unit returns an error when starting, only the output
Unit is set as failed. All other input unit states are not
modified (this was the behaviour before this commit).
@belimawr belimawr merged commit a2eaff7 into elastic:main Aug 4, 2023
83 of 86 checks passed
Scholar-Li pushed a commit to Scholar-Li/beats that referenced this pull request Feb 5, 2024
This commit updates the ManagerV2 to set status per Unit based on its
inputs. If any input on a Unit returns an error when starting the whole
Unit is set as failed. If multiple inputs return an error, all errors
are reported in the `Message` field.

If the output unit returns an error when starting, only the output Unit
is set as failed. All other input unit states are not modified (this was
the behaviour before this commit).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Errors loading inputs and outputs configured by the Elastic Agent should be reported per unit
4 participants