Add `SafeProcessor` wrapper #34647

rdner · 2023-02-22T14:47:16Z

What does this PR do?

Each processor might end up in multiple processor groups. Every group has its own Close that calls Close on each processor of that group which leads to multiple Close calls on the same processor.

If the SafeProcessor wrapper was not used,
the processor would have to handle multiple Close calls with adding sync.Once in its Close function.

We make it easer for processor developers and take care of it in the processor registry instead.

Why is it important?

Our customers could see panics in logs which makes it noisy.

Also, there are reports of higher impact:

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~- [ ] I have made corresponding changes to the documentation~~
~~- [ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Unfortunately, it's very difficult to reproduce the panic described in the issue, it seems to be a result of some kind of race condition on Beats shutting down.

The change is covered with unit tests and I made sure that a simple processor with the following config still works and there is no regression:

filebeat.inputs:
  - type: filestream
    id: my-filestream-id
    enabled: true
    paths:
      - "metadata-panic/logs/log*.ndjson"
    processors:
      - add_fields:
          target: project
          fields:
            name: myproject
            id: 'UNIQUE_VALUE'

path.data: "metadata-panic/data"
logging:
  level: debug
output.console:
  enabled: true

and I got this event published:

{
  "@timestamp": "2023-02-23T10:33:22.673Z",
  "@metadata": {
    "beat": "filebeat",
    "type": "_doc",
    "version": "8.8.0"
  },
  "log": {
    "offset": 32170800,
    "file": {
      "path": "metadata-panic/logs/log1.ndjson"
    }
  },
  "message": "{\"data\": \"sample\"}",
  "input": {
    "type": "filestream"
  },
  "project": {
    "name": "myproject",
    "id": "UNIQUE_VALUE"
  },
  "agent": {
    "ephemeral_id": "926e973d-f35b-4f1e-95aa-6b9cf239e21c",
    "id": "2f6fa2ce-1a37-4259-91e8-1df581639d07",
    "name": "MacBook-Pro.localdomain",
    "type": "filebeat",
    "version": "8.8.0"
  },
  "ecs": {
    "version": "8.0.0"
  },
  "host": {
    "name": "MacBook-Pro.localdomain"
  }
}

Note

"project": {
    "name": "myproject",
    "id": "UNIQUE_VALUE"
  }

Related issues

Closes add_kubernetes_metadata panics when a Beat shuts down #34219

Each processor might end up in multiple processor groups. Every group has its own `Close` that calls `Close` on each processor of that group which leads to multiple `Close` calls on the same processor. If the `SafeProcessor` wrapper was not used, the processor would have to handle multiple `Close` calls with adding `sync.Once` in its `Close` function. We make it easer for processor developers and take care of it in the processor registry instead.

mergify · 2023-02-22T14:47:51Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @rdner? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2023-02-22T16:07:04Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-02-24T08:20:13.190+0000
Duration: 66 min 5 sec

Test stats 🧪

Test	Results
Failed	0
Passed	25993
Skipped	1962
Total	27955

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine · 2023-02-23T12:24:43Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

libbeat/processors/safe_processor.go

Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>

rdner · 2023-02-24T08:21:16Z

@cmacknz there will be no 8.6 releases anymore, I don't think we need that backport label.

leehinman

code LGTM.

One optional nit on the name SafeProcessor. The name doesn't tell me what it is safe from, so if you just encountered this in registry.go you wouldn't have many clues as to why the constructor is being wrapped. Names like CloseOnce or AtomicCloser might be better.

rdner · 2023-02-24T16:02:53Z

@leehinman the name comes from the fact that it's not just addressing the Close function call but also the Run – so, you can't run the processor that was closed before and cause some panics by doing so.

I've put a lot of thought into the naming and I'd rather keep it this way, so we could perhaps extend the functionality of this wrapper in the future, e.g. blocking certain events from being processed by any processor, etc.

Each processor might end up in multiple processor groups. Every group has its own `Close` that calls `Close` on each processor of that group which leads to multiple `Close` calls on the same processor. If the `SafeProcessor` wrapper was not used, the processor would have to handle multiple `Close` calls with adding `sync.Once` in its `Close` function. We make it easer for processor developers and take care of it in the processor registry instead. (cherry picked from commit 5007b58)

* Add `SafeProcessor` wrapper (#34647) Each processor might end up in multiple processor groups. Every group has its own `Close` that calls `Close` on each processor of that group which leads to multiple `Close` calls on the same processor. If the `SafeProcessor` wrapper was not used, the processor would have to handle multiple `Close` calls with adding `sync.Once` in its `Close` function. We make it easer for processor developers and take care of it in the processor registry instead. (cherry picked from commit 5007b58) * Remove extra changelog entries --------- Co-authored-by: Denis <denis.rechkunov@elastic.co>

* Add `SafeProcessor` wrapper (#34647) Each processor might end up in multiple processor groups. Every group has its own `Close` that calls `Close` on each processor of that group which leads to multiple `Close` calls on the same processor. If the `SafeProcessor` wrapper was not used, the processor would have to handle multiple `Close` calls with adding `sync.Once` in its `Close` function. We make it easer for processor developers and take care of it in the processor registry instead. (cherry picked from commit 5007b58) * Remove extra changelog entries * Fix different constructor signature --------- Co-authored-by: Denis <denis.rechkunov@elastic.co>

Each processor might end up in multiple processor groups. Every group has its own `Close` that calls `Close` on each processor of that group which leads to multiple `Close` calls on the same processor. If the `SafeProcessor` wrapper was not used, the processor would have to handle multiple `Close` calls with adding `sync.Once` in its `Close` function. We make it easer for processor developers and take care of it in the processor registry instead. (cherry picked from commit 5007b58) Co-authored-by: Denis <denis.rechkunov@elastic.co>

After introducing the `SafeProcessor` wrapper in elastic#34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart.

* Stop re-using processors defined in the config After introducing the `SafeProcessor` wrapper in #34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart. * Fix linter issues * Add changelog entry

* Stop re-using processors defined in the config After introducing the `SafeProcessor` wrapper in #34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart. * Fix linter issues * Add changelog entry (cherry picked from commit 5cfe62c)

* Stop re-using processors defined in the config After introducing the `SafeProcessor` wrapper in #34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart. * Fix linter issues * Add changelog entry (cherry picked from commit 5cfe62c) # Conflicts: # filebeat/channel/runner.go # libbeat/processors/safe_processor.go

* Stop re-using processors defined in the config After introducing the `SafeProcessor` wrapper in #34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart. * Fix linter issues * Add changelog entry (cherry picked from commit 5cfe62c) Co-authored-by: Denis <denis.rechkunov@elastic.co>

#34764) * Stop re-using processors defined in the config (#34761) * Stop re-using processors defined in the config After introducing the `SafeProcessor` wrapper in #34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart. * Fix linter issues * Add changelog entry (cherry picked from commit 5cfe62c) # Conflicts: # filebeat/channel/runner.go # libbeat/processors/safe_processor.go * Resolve conflicts, fix changelog * Add new line to changelog * Revert comment auto-formatting --------- Co-authored-by: Denis <denis.rechkunov@elastic.co>

Each processor might end up in multiple processor groups. Every group has its own `Close` that calls `Close` on each processor of that group which leads to multiple `Close` calls on the same processor. If the `SafeProcessor` wrapper was not used, the processor would have to handle multiple `Close` calls with adding `sync.Once` in its `Close` function. We make it easer for processor developers and take care of it in the processor registry instead.

* Stop re-using processors defined in the config After introducing the `SafeProcessor` wrapper in #34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart. * Fix linter issues * Add changelog entry

rdner self-assigned this Feb 22, 2023

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Feb 22, 2023

rdner added bug bugfix and removed needs_team Indicates that the issue/PR needs a Team:* label labels Feb 22, 2023

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Feb 22, 2023

rdner mentioned this pull request Feb 23, 2023

K8s Integration does not report correct container.id when container restarts elastic/integrations#5348

Closed

Resolve linter issues

415ed00

rdner added backport-v7.17.0 Automated backport with mergify backport-v8.7.0 Automated backport with mergify labels Feb 23, 2023

rdner added 2 commits February 23, 2023 11:20

Remove wrong import

70a07d4

Add changelog entry

982bcd8

rdner added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Feb 23, 2023

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Feb 23, 2023

rdner marked this pull request as ready for review February 23, 2023 12:24

rdner requested a review from a team as a code owner February 23, 2023 12:24

rdner requested review from belimawr and leehinman and removed request for a team February 23, 2023 12:24

rdner requested a review from cmacknz February 23, 2023 12:24

cmacknz reviewed Feb 23, 2023

View reviewed changes

libbeat/processors/safe_processor.go Outdated Show resolved Hide resolved

cmacknz added the backport-v8.6.0 Automated backport with mergify label Feb 23, 2023

Update libbeat/processors/safe_processor.go

31af6b2

Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>

rdner requested a review from cmacknz February 24, 2023 14:49

leehinman approved these changes Feb 24, 2023

View reviewed changes

rdner merged commit 5007b58 into elastic:main Feb 24, 2023

rdner deleted the fix-add-kubernetes-metadata-panic branch February 24, 2023 16:03

rdner removed the backport-v8.6.0 Automated backport with mergify label Feb 24, 2023

This was referenced Feb 24, 2023

[8.6](backport #34647) Add SafeProcessor wrapper #34665

Merged

[7.17](backport #34647) Add SafeProcessor wrapper #34666

Merged

[8.7](backport #34647) Add SafeProcessor wrapper #34667

Merged

belimawr mentioned this pull request Mar 1, 2023

Closing add_kubernetes_metadata processor each time a filestream reader stops makes filebeat panic #32200

Closed

cmacknz mentioned this pull request Mar 2, 2023

Filebeat monitoring enters infinite error loop for "closed processor" #34716

Closed

rdner mentioned this pull request Mar 7, 2023

Stop re-using processors defined in the config #34761

Merged

1 task

rdner mentioned this pull request Mar 9, 2023

Cover possible processor re-use with tests #34783

Closed

ShourieG mentioned this pull request Mar 16, 2023

Cloudflare Logpush Integration not working reliably with S3/SQS elastic/integrations#5526

Closed

rdner mentioned this pull request Mar 21, 2023

Don't safe-wrap processors that don't have Close #34871

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `SafeProcessor` wrapper #34647

Add `SafeProcessor` wrapper #34647

rdner commented Feb 22, 2023 •

edited

mergify bot commented Feb 22, 2023

elasticmachine commented Feb 22, 2023 •

edited by jenkins-beats-ci bot

Build stats

Test stats 🧪

elasticmachine commented Feb 23, 2023

rdner commented Feb 24, 2023

leehinman left a comment

rdner commented Feb 24, 2023

Add SafeProcessor wrapper #34647

Add SafeProcessor wrapper #34647

Conversation

rdner commented Feb 22, 2023 • edited

What does this PR do?

Why is it important?

Checklist

How to test this PR locally

Related issues

mergify bot commented Feb 22, 2023

elasticmachine commented Feb 22, 2023 • edited by jenkins-beats-ci bot

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

elasticmachine commented Feb 23, 2023

rdner commented Feb 24, 2023

leehinman left a comment

Choose a reason for hiding this comment

rdner commented Feb 24, 2023

Add `SafeProcessor` wrapper #34647

Add `SafeProcessor` wrapper #34647

rdner commented Feb 22, 2023 •

edited

elasticmachine commented Feb 22, 2023 •

edited by jenkins-beats-ci bot