Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only log events at debug level #37229

Merged
merged 3 commits into from Nov 30, 2023

Conversation

belimawr
Copy link
Contributor

@belimawr belimawr commented Nov 29, 2023

Proposed commit message

The Elasticsearch client was logging raw events in error and warn level, this commit makes it only log the raw events (or any event data) at debug level. This means the error message returned by Elasticsearch is now only available at debug level because it can contain the whole value of a field causing a mapping conflict.

Why is Elasticsearch message only logged at debug level?

The new code removes two things from any level other then debug:

  • The raw event
  • The error message sent by Elasticsearch because it can contain the whole value of a field

Given the steps to test below, we get the following raw logs indicating indexing failure:

{"log.level":"warn","@timestamp":"2023-11-29T18:17:16.977+0100","log.logger":"elasticsearch","log.origin":{"file.name":"elasticsearch/client.go","file.line":450},"message":"Cannot index event (status=400): dropping event!","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-11-29T18:17:16.977+0100","log.logger":"elasticsearch","log.origin":{"file.name":"elasticsearch/client.go","file.line":451},"message":"Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Date(2023, time.November, 29, 18, 17, 6, 949459738, time.Local), Meta:null, Fields:{\"agent\":{\"ephemeral_id\":\"af12ba86-b25c-4831-9bfd-ea3115507bd7\",\"id\":\"9e4935c9-fa26-465c-b18c-53fd5b4c70c2\",\"name\":\"millennium-falcon\",\"type\":\"filebeat\",\"version\":\"8.12.0\"},\"ecs\":{\"version\":\"8.0.0\"},\"host\":{\"name\":\"millennium-falcon\"},\"input\":{\"type\":\"filestream\"},\"int\":\"not a number\",\"log\":{\"file\":{\"device_id\":\"34\",\"inode\":\"144884\",\"path\":\"/tmp/flog.log\"},\"offset\":101},\"message\":\"index failure\",\"string\":10}, Private:(*input_logfile.updateOp)(0xc0006e5e90), TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:mapstr.M(nil)}} (status=400): {\"type\":\"document_parsing_exception\",\"reason\":\"[1:437] failed to parse field [int] of type [long] in document with id 'CdYVHIwBAsOaLnJkwT8o'. Preview of field's value: 'not a number'\",\"caused_by\":{\"type\":\"illegal_argument_exception\",\"reason\":\"For input string: \\\"not a number\\\"\"}}, dropping event!","service.name":"filebeat","ecs.version":"1.6.0"}

The first line is at level warn with no event data, the second one is at level debug with the raw event and the response sent by Elasticsearch.

It is referring to the third log (line from the example below)

{"message":"index failure","int":"not a number","string":10}

Making the second log entry more human readable we have:

{
  "log.level": "debug",
  "@timestamp": "2023-11-29T18:17:16.977+0100",
  "log.logger": "elasticsearch",
  "log.origin": {
    "file.name": "elasticsearch/client.go",
    "file.line": 451
  },
  "message": "Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Date(2023, time.November, 29, 18, 17, 6, 949459738, time.Local), Meta:null, Fields:{\"agent\":{\"ephemeral_id\":\"af12ba86-b25c-4831-9bfd-ea3115507bd7\",\"id\":\"9e4935c9-fa26-465c-b18c-53fd5b4c70c2\",\"name\":\"millennium-falcon\",\"type\":\"filebeat\",\"version\":\"8.12.0\"},\"ecs\":{\"version\":\"8.0.0\"},\"host\":{\"name\":\"millennium-falcon\"},\"input\":{\"type\":\"filestream\"},\"int\":\"not a number\",\"log\":{\"file\":{\"device_id\":\"34\",\"inode\":\"144884\",\"path\":\"/tmp/flog.log\"},\"offset\":101},\"message\":\"index failure\",\"string\":10}, Private:(*input_logfile.updateOp)(0xc0006e5e90), TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:mapstr.M(nil)}} (status=400): {\"type\":\"document_parsing_exception\",\"reason\":\"[1:437] failed to parse field [int] of type [long] in document with id 'CdYVHIwBAsOaLnJkwT8o'. Preview of field's value: 'not a number'\",\"caused_by\":{\"type\":\"illegal_argument_exception\",\"reason\":\"For input string: \\\"not a number\\\"\"}}, dropping event!",
  "service.name": "filebeat",
  "ecs.version": "1.6.0"
}

Looking at the code that generates the debug line, we're interested in the msg argument.

client.log.Debugf("Cannot index event %#v (status=%v): %s, dropping event!", data[i], status, msg)

This happens to be in JSON format, making it more human readable we have:

{
  "type": "document_parsing_exception",
  "reason": "[1:437] failed to parse field [int] of type [long] in document with id 'CdYVHIwBAsOaLnJkwT8o'. Preview of field's value: 'not a number'",
  "caused_by": {
    "type": "illegal_argument_exception",
    "reason": "For input string: \"not a number\""
  }
}

Which contains the whole value of the field that caused the index failure.

Checklist

  • My code follows the style guidelines of this project
    - [ ] I have commented my code, particularly in hard-to-understand areas
    - [ ] I have made corresponding changes to the documentation
    - [ ] I have made corresponding change to the default configuration files
    - [ ] I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

## Author's Checklist

How to test this PR locally

Start Filebeat with the following configuration

filebeat.inputs:
  - type: filestream
    id: filestream-input-id
    enabled: true
    parsers:
      - ndjson:
          target: ""
          overwrite_keys: true
          expand_keys: true
          add_error_key: true
          ignore_decoding_error: false
    paths:
      - /tmp/flog.log

output:
  elasticsearch:
    hosts:
      - localhost:9200
    protocol: https
    username: elastic
    password: changeme
    allow_older_versions: true
    ssl.verification_mode: none

logging:
  level: debug

Create the log file /tmp/flog.log with the following content:

{"message":"foo bar","int":10,"string":"str"}
{"message":"another message","int":20,"string":"str2"}
{"message":"index failure","int":"not a number","string":10}
{"message":"second index failure","int":"not a number","string":10}
A broken JSON

Raw events should only be logged at debug level.

## Related issues
## Use cases
## Screenshots

Logs

How the new log lines look like when there is an event dropped

{"log.level":"warn","@timestamp":"2023-11-29T18:17:16.977+0100","log.logger":"elasticsearch","log.origin":{"file.name":"elasticsearch/client.go","file.line":450},"message":"Cannot index event (status=400): dropping event!","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-11-29T18:17:16.977+0100","log.logger":"elasticsearch","log.origin":{"file.name":"elasticsearch/client.go","file.line":451},"message":"Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Date(2023, time.November, 29, 18, 17, 6, 949459738, time.Local), Meta:null, Fields:{\"agent\":{\"ephemeral_id\":\"af12ba86-b25c-4831-9bfd-ea3115507bd7\",\"id\":\"9e4935c9-fa26-465c-b18c-53fd5b4c70c2\",\"name\":\"millennium-falcon\",\"type\":\"filebeat\",\"version\":\"8.12.0\"},\"ecs\":{\"version\":\"8.0.0\"},\"host\":{\"name\":\"millennium-falcon\"},\"input\":{\"type\":\"filestream\"},\"int\":\"not a number\",\"log\":{\"file\":{\"device_id\":\"34\",\"inode\":\"144884\",\"path\":\"/tmp/flog.log\"},\"offset\":101},\"message\":\"index failure\",\"string\":10}, Private:(*input_logfile.updateOp)(0xc0006e5e90), TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:mapstr.M(nil)}} (status=400): {\"type\":\"document_parsing_exception\",\"reason\":\"[1:437] failed to parse field [int] of type [long] in document with id 'CdYVHIwBAsOaLnJkwT8o'. Preview of field's value: 'not a number'\",\"caused_by\":{\"type\":\"illegal_argument_exception\",\"reason\":\"For input string: \\\"not a number\\\"\"}}, dropping event!","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-11-29T18:17:16.977+0100","log.logger":"elasticsearch","log.origin":{"file.name":"elasticsearch/client.go","file.line":450},"message":"Cannot index event (status=400): dropping event!","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-11-29T18:17:16.977+0100","log.logger":"elasticsearch","log.origin":{"file.name":"elasticsearch/client.go","file.line":451},"message":"Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Date(2023, time.November, 29, 18, 17, 6, 949531339, time.Local), Meta:null, Fields:{\"agent\":{\"ephemeral_id\":\"af12ba86-b25c-4831-9bfd-ea3115507bd7\",\"id\":\"9e4935c9-fa26-465c-b18c-53fd5b4c70c2\",\"name\":\"millennium-falcon\",\"type\":\"filebeat\",\"version\":\"8.12.0\"},\"ecs\":{\"version\":\"8.0.0\"},\"host\":{\"name\":\"millennium-falcon\"},\"input\":{\"type\":\"filestream\"},\"int\":\"not a number\",\"log\":{\"file\":{\"device_id\":\"34\",\"inode\":\"144884\",\"path\":\"/tmp/flog.log\"},\"offset\":162},\"message\":\"second index failure\",\"string\":10}, Private:(*input_logfile.updateOp)(0xc000364330), TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:mapstr.M(nil)}} (status=400): {\"type\":\"document_parsing_exception\",\"reason\":\"[1:432] failed to parse field [int] of type [long] in document with id 'CtYVHIwBAsOaLnJkwT8o'. Preview of field's value: 'not a number'\",\"caused_by\":{\"type\":\"illegal_argument_exception\",\"reason\":\"For input string: \\\"not a number\\\"\"}}, dropping event!","service.name":"filebeat","ecs.version":"1.6.0"}

@belimawr belimawr added the Team:Elastic-Agent Label for the Agent team label Nov 29, 2023
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Nov 29, 2023
Copy link
Contributor

mergify bot commented Nov 29, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @belimawr? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

The Elasticsearch client was logging raw events in error and warn
level, this commit makes it only log the raw events or the returned
message from Elasticsearch, that might contain some fields from the
event, on debug level.

Log messages containing only the returned status from Elasticsearch
are maintained on their original log level.
@belimawr belimawr force-pushed the log-events-only-on-debug-level branch from 7ae0bb9 to 9b4430c Compare November 29, 2023 12:18
@belimawr belimawr added backport-skip Skip notification from the automated backport with mergify backport-v7.17.0 Automated backport with mergify and removed backport-skip Skip notification from the automated backport with mergify labels Nov 29, 2023
@belimawr belimawr marked this pull request as ready for review November 29, 2023 12:25
@belimawr belimawr requested a review from a team as a code owner November 29, 2023 12:25
Comment on lines 450 to 451
client.log.Warnf("Cannot index event (status=%v): dropping event!", status)
client.log.Debugf("Cannot index event %#v (status=%v): %s, dropping event!", data[i], status, msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to safely log the msg? Does it contain event data if you simulate a mapping conflict (map something as keyword and then write a json object to it or something similar)? I would hope just the field names involved.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is just the fields names then I guess we can remove the debug log and only log the msg right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I want us to keep the error message if we can since that is the most valuable part. It usually tells you enough to know exactly what is wrong. We just need to do a quick test to confirm this is safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately Elasticsearch returns both the field name and its value, I updated the PR description with a detailed reasoning, but here is the TLDR:

Given the event:

{"message":"index failure","int":"not a number","string":10}

The field int is supposed to be an number (see the "how to test this PR above"), that causes the mapping failure with the following message:

{
  "type": "document_parsing_exception",
  "reason": "[1:437] failed to parse field [int] of type [long] in document with id 'CdYVHIwBAsOaLnJkwT8o'. Preview of field's value: 'not a number'",
  "caused_by": {
    "type": "illegal_argument_exception",
    "reason": "For input string: \"not a number\""
  }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have seen instances of payload events showing up on the RLC logs on Synthetics service when there were mapping conflicts (Mostly likely customer has modified mappings on their ES stack). Especially happened for some of the network payload events which had cookies/auth headers which shouldn't be logged.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@belimawr I agree with what Craig just said above. Let's change the log level for now but remove the mention of Preview of field's value: 'xxx'" in a follow up PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@belimawr I think we should open an issue in the Elasticsearch repo to change this error message to not include this information by default. This would benefit not only Beats, but also Logstash and language clients by providing a more secure-by-default behavior.

Copy link
Member

@joshdover joshdover Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear - we can continue with this PR as written to get a fix in the short-term, but I do think losing the error message is going to make debugging issues quite painful. We should have Elasticsearch change this (or add a different field with just the error message without the field value, or an opt out flag) and then come back and log this information again when it's sure it doesn't contain sensitive information.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, there is already a follow up issue in the security repository about having ES make a change here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added @cmacknz's suggestions, the way it is now:

  1. The original log entries are in debug level
  2. For all entries I changed the level, there is a shorter one in the original log level stating that more details are available in debug level.

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 130 min 7 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

CHANGELOG.next.asciidoc Outdated Show resolved Hide resolved
belimawr and others added 2 commits November 30, 2023 18:36
Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
@belimawr belimawr enabled auto-merge (squash) November 30, 2023 17:39
@elasticmachine
Copy link
Collaborator

❕ Build Aborted

Either there was a build timeout or someone aborted the build.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Duration: 6 min 41 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@cmacknz cmacknz added the backport-v8.11.0 Automated backport with mergify label Nov 30, 2023
@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-11-30T17:38:13.817+0000

  • Duration: 132 min 55 sec

Test stats 🧪

Test Results
Failed 0
Passed 28694
Skipped 2015
Total 30709

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@belimawr belimawr merged commit ac7309a into elastic:main Nov 30, 2023
88 checks passed
mergify bot pushed a commit that referenced this pull request Nov 30, 2023
The Elasticsearch client was logging raw events in error and warn level, this commit makes it only log the raw events (or any event data) at debug level. This means the error message returned by Elasticsearch is now only available at debug level because it can contain the whole value of a field causing a mapping conflict.

---------

Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
(cherry picked from commit ac7309a)
mergify bot pushed a commit that referenced this pull request Nov 30, 2023
The Elasticsearch client was logging raw events in error and warn level, this commit makes it only log the raw events (or any event data) at debug level. This means the error message returned by Elasticsearch is now only available at debug level because it can contain the whole value of a field causing a mapping conflict.

---------

Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
(cherry picked from commit ac7309a)
cmacknz pushed a commit that referenced this pull request Nov 30, 2023
The Elasticsearch client was logging raw events in error and warn level, this commit makes it only log the raw events (or any event data) at debug level. This means the error message returned by Elasticsearch is now only available at debug level because it can contain the whole value of a field causing a mapping conflict.

---------

Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
(cherry picked from commit ac7309a)

Co-authored-by: Tiago Queiroz <tiago.queiroz@elastic.co>
belimawr pushed a commit that referenced this pull request Dec 1, 2023
The Elasticsearch client was logging raw events in error and warn level, this commit makes it only log the raw events (or any event data) at debug level. This means the error message returned by Elasticsearch is now only available at debug level because it can contain the whole value of a field causing a mapping conflict.

---------

Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
(cherry picked from commit ac7309a)
@belimawr belimawr mentioned this pull request Dec 4, 2023
1 task
@neiljbrookes
Copy link

One question, and one request !

Request :
Can the status code returned from be added to a http.response.status_code field ? Being able to filter on this would be useful. A 429 and a 400 can mean different things in terms of the overall status of the pipeline.

Question :
Does this change mean that, in the case of a mapping conflict, the only warn level logging will be :

"message":"Cannot index event (status=400): dropping event!"

This isn't useful (sorry !). There is plenty of useful information about this error in the response from elasticsearch which (as an operator of elastic-agent/beats) I would like to see. Needing to enable debug level logs for every 400 we may see isn't a great experience.

@cmacknz
Copy link
Member

cmacknz commented Dec 7, 2023

This isn't useful (sorry !). There is plenty of useful information about this error in the response from elasticsearch which (as an operator of elastic-agent/beats) I would like to see. Needing to enable debug level logs for every 400 we may see isn't a great experience.

Agreed we need a better solution for this.

The fundamental problem is we don't want secrets or PII ending up in logs that are shipped to remote clusters that do not want to store this information, but the Elasticsearch errors themselves can contain field values. What we want is a sanitized version of reason that just has the field names and types. AFAIK there is no strict schema for the content of reason either to make parsing it simpler.

Example:

{
  "type": "document_parsing_exception",
  "reason": "[1:437] failed to parse field [int] of type [long] in document with id 'CdYVHIwBAsOaLnJkwT8o'. Preview of field's value: 'not a number'",
  "caused_by": {
    "type": "illegal_argument_exception",
    "reason": "For input string: \"not a number\""
  }
}

Scholar-Li pushed a commit to Scholar-Li/beats that referenced this pull request Feb 5, 2024
The Elasticsearch client was logging raw events in error and warn level, this commit makes it only log the raw events (or any event data) at debug level. This means the error message returned by Elasticsearch is now only available at debug level because it can contain the whole value of a field causing a mapping conflict.

---------

Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v7.17.0 Automated backport with mergify backport-v8.11.0 Automated backport with mergify Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants