Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Filebeat] Add postgresql csv fileset #23334

Merged
merged 31 commits into from
Feb 17, 2021
Merged

[Filebeat] Add postgresql csv fileset #23334

merged 31 commits into from
Feb 17, 2021

Conversation

azlev
Copy link
Contributor

@azlev azlev commented Jan 4, 2021

As a configuration option, create another pipeline inside the module to
parse CSV log files.

Although it's necessary to configure the database to emit such logs,
there is some advantages: configured properly, some events like
statement timeout and lock timeout will display the query in the
same event, opposed to multiple lines, one with the error message and
other with statement.

What does this PR do?

This PR creates a new fileset/pipeline inside PostgreSQL module. This new fileset parses and create events based on CSV log files from postgresql.

Why is it important?

PostgreSQL logs weren't designed to be parsed, so, without very complicated logic, it's not possible to encapsulate all information in just one event. One example of that is statement timeout:

2021-01-04 02:07:15.470 UTC [13470] [user=postgres,db=postgres,app=[unknown]] ERROR:  canceling statement due to statement timeout
2021-01-04 02:07:15.470 UTC [13470] [user=postgres,db=postgres,app=[unknown]] STATEMENT:  SELECT pg_sleep(100)

Using the CSV log, those 2 lines become 1:

2021-01-04 02:07:15.470 UTC,"postgres","postgres",13470,"127.21.1.7:28182",5ff27834.349e,1,"SELECT",2021-01-04 02:06:44 UTC,16/4181895,0,ERROR,57014,"canceling statement due to statement timeout",,,,,,"SELECT pg_sleep(100)",,,""

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • Documentation

As a configuration option, create another pipeline inside the module to
parse CSV log files.

Although it's necessary to configure the database to emit such logs,
there is some advantages: configured properly, some events like
`statement timeout` and `lock timeout` will display the query in the
same event, opposed to multiple lines, one with the error message and
other with statement.
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jan 4, 2021
@elasticmachine
Copy link
Collaborator

elasticmachine commented Jan 4, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #23334 updated

  • Start Time: 2021-02-17T12:54:54.177+0000

  • Duration: 49 min 46 sec

  • Commit: a518be6

Test stats 🧪

Test Results
Failed 0
Passed 13056
Skipped 2215
Total 15271

Trends 🧪

Image of Build Times

Image of Tests

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 13056
Skipped 2215
Total 15271

@andresrc andresrc added the Team:Integrations Label for the Integrations team label Jan 4, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations (Team:Integrations)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jan 4, 2021
@jsoriano jsoriano added the review label Jan 9, 2021
Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution, very interesting, and quite complete!

filebeat/module/postgresql/csv/_meta/fields.yml Outdated Show resolved Hide resolved
filebeat/module/postgresql/csv/_meta/fields.yml Outdated Show resolved Hide resolved
filebeat/module/postgresql/csv/_meta/fields.yml Outdated Show resolved Hide resolved
@jsoriano
Copy link
Member

jsoriano commented Jan 9, 2021

jenkins run the tests please

@jsoriano jsoriano self-assigned this Jan 9, 2021
Do not generate duplicate field groups.

Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>
@jsoriano
Copy link
Member

jsoriano commented Jan 9, 2021

jenkins run the tests please

@jsoriano
Copy link
Member

By the way, there is another approach to support multiple formats. For example the elasticsearch module contains pipelines for plain and JSON logs, and it chooses automatically which one to use.

Multiple pipelines can be included like this in the fileset manifest:

ingest_pipeline:
- ingest/pipeline.yml
- ingest/pipeline-plaintext.yml
- ingest/pipeline-json.yml

The first pipeline is executed when an event is received, in this case pipeline.yml. It contains some common processing for all formats, but in an specific part, it chooses to run one of the other pipelines depending on the format of the message:

- pipeline:
if: ctx.first_char != '{'
name: '{< IngestPipeline "pipeline-plaintext" >}'
- pipeline:
if: ctx.first_char == '{'
name: '{< IngestPipeline "pipeline-json" >}'

This approach is also used in other modules like kibana, logstash, coredns, envoyproxy... you can look for examples that use the IngestPipeline helper in the pipeline.

If we can do something similar for these postgresql logs, we could add a CSV-specific pipeline to the existing log metricset, and we could more easily reuse the common parts of the pipelines, as well as the common fields. And it would be easier for users, as they wouldn't need to configure a different fileset depending on the format.

@azlev what do you think? would it be possible to do it this way in this case?

Commit cacde06 changed field names.

I commited in github without pay attention to tests. This commit fixes
all tests and the pipeline.
@azlev
Copy link
Contributor Author

azlev commented Jan 16, 2021

By the way, there is another approach to support multiple formats. For example the elasticsearch module contains pipelines for plain and JSON logs, and it chooses automatically which one to use.

Multiple pipelines can be included like this in the fileset manifest:

ingest_pipeline:
- ingest/pipeline.yml
- ingest/pipeline-plaintext.yml
- ingest/pipeline-json.yml

The first pipeline is executed when an event is received, in this case pipeline.yml. It contains some common processing for all formats, but in an specific part, it chooses to run one of the other pipelines depending on the format of the message:

- pipeline:
if: ctx.first_char != '{'
name: '{< IngestPipeline "pipeline-plaintext" >}'
- pipeline:
if: ctx.first_char == '{'
name: '{< IngestPipeline "pipeline-json" >}'

This approach is also used in other modules like kibana, logstash, coredns, envoyproxy... you can look for examples that use the IngestPipeline helper in the pipeline.

If we can do something similar for these postgresql logs, we could add a CSV-specific pipeline to the existing log metricset, and we could more easily reuse the common parts of the pipelines, as well as the common fields. And it would be easier for users, as they wouldn't need to configure a different fileset depending on the format.

@azlev what do you think? would it be possible to do it this way in this case?

I tried a minimal approach to avoid breaking changes, and I assumed that it would be easier to merge. So now I'll try to do a more integrated code. I'll give a try, but those changes will take a while to implement.

* Adding postgresql.log.query_name in grok

The query_step was ignored because it's a subset of command_tag

* Separate message and query fields

Message is the main message, and query is what message is referencing
@azlev
Copy link
Contributor Author

azlev commented Jan 17, 2021

@jsoriano I found 2 ways to choose the ingest pipeline:

  • file extension (PostgreSQL always use .csv to those files)
  • regular expression like this: ctx =~ /^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}(.\d{1,3})( \w{3,4}),/

I couldn't find any way to get the file name, and this will break the tests. The regular expression doesn't look as a good idea in a critical path like this, and looks fragile.

Do you have other ideas?

@jsoriano
Copy link
Member

@jsoriano I found 2 ways to choose the ingest pipeline:

  • file extension (PostgreSQL always use .csv to those files)
  • regular expression like this: ctx =~ /^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}(.\d{1,3})( \w{3,4}),/

I couldn't find any way to get the file name, and this will break the tests. The regular expression doesn't look as a good idea in a critical path like this, and looks fragile.

Do you have other ideas?

@azlev I agree that the simpler we can do the selection, the better, to avoid affecting the critical path.

Checking the file extension could be a good option. File path should be available in log.file.path. You don't see it in tests because it is removed to avoid issues validating the expected events, but at the moment of processing the path should be there.
The problem I see with this approach is that not all inputs collect from files, e.g. the postgresql module could be used with logs collected with syslog, or from s3 buckets. So even if this is more straightforward, it may not work in all cases.

Ideally, the decision of using one or the other pipeline should be based on something available in the message itself, as you are doing with the regular expression.

Other possible approach could be to add a format setting to the fileset, this value could be set in the events with add_fields, so pipelines can decide what to do. But it would be better if the module can do the parsing automatically without depending on user configuration.

Another approach you could try is to parse as grok with the current pipeline, but if grok fails, fallback to the CSV pipeline with on_failure. This can also affect the performance of the pipeline, but maybe you could give it a try if other options are not good enough.

If we are not convinced by any of these approaches I am ok with going back to the approach of using two different filesets.

There is no `core_id` inside PostgreSQL logs. Deprecating field and moving
it to `session_line_number`.
This commit will open the path to add 2 pipelines:

- The default one, using log files
- The CSV pipeline

Since both files represent the same information, makes sense to have
only one fileset.
Creates a new grok pattern to match only the timestamp and the next
char, the separator.

Based on separator, choose the right pipeline.
Both of then represents the same information, so use the same pipeline
to do the job.
@azlev
Copy link
Contributor Author

azlev commented Feb 14, 2021

I've split the grok pipeline in 2 parts: the first one parse the timestamp plus 1 char. This char will be used to choose the correct pipeline: log or csv.

In the process, I deprecated the core_id field since I couldn't find it inside postgresql documentation.

Still, there is some fields that I need work on, the 2 that I'm currently thinking about are error.code and event.type

- remove unused `thread_id` field
- issue a `make update`
Drop `postgresql.log.error_severity` in favor of `log.level`
Based on PostgreSQL Error Codes, everything that not starts with `00`,
`01` or `02` is an error.
Copy link
Contributor Author

@azlev azlev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsoriano I have a question here: https://github.com/elastic/beats/blob/master/filebeat/module/postgresql/log/ingest/pipeline.yml#L40

In the current code, if there is an error, the final event will have: "event.type": ["info", "error"]. Is that right?

Regenerate files to have a consistent field order.
@jsoriano
Copy link
Member

jenkins run the tests please

@jsoriano jsoriano added the needs_backport PR is waiting to be backported to other branches. label Feb 16, 2021
Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@azlev I think this is almost ready to merge, I have added some small questions/suggestions, let me know what you think.

filebeat/module/postgresql/log/_meta/fields.yml Outdated Show resolved Hide resolved
filebeat/module/postgresql/log/ingest/pipeline-csv.yml Outdated Show resolved Hide resolved
GREEDYDATA: |-
(.|
| )*
POSTGRESQL_QUERY_STEP: '(parse|bind|statement|fastpath function call|execute|execute fetch from)'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query step is captured in postgresql.log.query_step in the plain text pipeline, should it be also captured in this pipeline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That one I was in doubt. This query_step overlaps with command_tag, but not 100%, so I opted to let it empty since I was creating a new fileset at first. Now that I merged, I can review this. I'll populate it, at least to some backward compatibility.

@@ -41,15 +47,22 @@ processors:
field: event.type
value:
- info
if: "ctx?.postgresql?.log?.sql_state_code == null || (ctx.postgresql.log.sql_state_code ==~ /^0[012].*/)"
- append:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The set processor could be used to add event.type instead of append, so it is not set as an array in the document.

@jsoriano
Copy link
Member

jenkins run the tests again please

azlev and others added 2 commits February 16, 2021 07:26
Change part of the description to a comment

Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>
- Run update since last `fields.yml` was done only in github
- Use `set` instead of `append` in pipeline to not create an array
- Capture the `postgresql.log.query_step` in csv pipeline
Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's wait for a green build. Thanks @azlev!

@jsoriano
Copy link
Member

jenkins run the tests

@jsoriano
Copy link
Member

/test

Now PostgreSQL has 3 files in the ingest directory:

- pipeline.yml
- pipeline-log.yml
- pipeline-csv.yml

Change the count to 3 to reflect that.
postresql.log.sql_state_code -> `postgresql.log.sql_state_code`.
@jsoriano jsoriano self-requested a review February 17, 2021 10:34
Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Only a couple of small details, sorry I didn't see them before.

filebeat/module/postgresql/log/ingest/pipeline-csv.yml Outdated Show resolved Hide resolved
description: >
A possible solution to solve an error.
- name: internal_query
description: internal query that led to the error (if any).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please uppercase first letter of sentences in these descriptions.

Suggested change
description: internal query that led to the error (if any).
description: >
Internal query that led to the error (if any).

@@ -34,7 +34,7 @@ func TestGetPipelinePath(t *testing.T) {
},
{
pipelinePath: "../../module/postgresql/log/ingest",
count: 1,
count: 3,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙂

@azlev
Copy link
Contributor Author

azlev commented Feb 17, 2021

👀

Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great 👍

@jsoriano jsoriano merged commit bb78931 into elastic:master Feb 17, 2021
@jsoriano jsoriano added v7.12.0 and removed needs_backport PR is waiting to be backported to other branches. labels Feb 17, 2021
jsoriano pushed a commit to jsoriano/beats that referenced this pull request Feb 17, 2021
Refactor PostgreSQL module to support logs in CSV format.

The fileset has three pipelines now, first one is executed
to parse the initial part of the lines, till it decides if the logs
are plain text or CSV, once decided it invokes one of the
other two pipelines, one specific for plain text logs and
the other for CSV.

Several test cases and documentation are added for CSV
support.

Additional fields are available now, and some others have
been renamed to represent more accurately the values
they store.

Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>
(cherry picked from commit bb78931)
jsoriano added a commit that referenced this pull request Feb 17, 2021
Refactor PostgreSQL module to support logs in CSV format.

The fileset has three pipelines now, first one is executed
to parse the initial part of the lines, till it decides if the logs
are plain text or CSV, once decided it invokes one of the
other two pipelines, one specific for plain text logs and
the other for CSV.

Several test cases and documentation are added for CSV
support.

Additional fields are available now, and some others have
been renamed to represent more accurately the values
they store.

(cherry picked from commit bb78931)

Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>
Co-authored-by: José Arthur Benetasso Villanova <jose.arthur@gmail.com>
@jsoriano
Copy link
Member

Merged and backported to 7.x, this will be hopefuly available in 7.12, thanks @azlev!

v1v added a commit to v1v/beats that referenced this pull request Feb 17, 2021
…dows-7

* upstream/master: (332 commits)
  Use ECS v1.8.0 (elastic#24086)
  Add support for postgresql csv logs (elastic#23334)
  [Heartbeat] Refactor config system (elastic#23467)
  [CI] install docker-compose with retry (elastic#24069)
  Add nodes to filebeat-kubernetes.yaml ClusterRole - fixes elastic#24051 (elastic#24052)
  updating manifest files for filebeat threatintel module (elastic#24074)
  Add Zeek Signatures (elastic#23772)
  Update Beats to ECS 1.8.0 (elastic#23465)
  Support running Docker logging plugin on ARM64 (elastic#24034)
  Fix ec2 metricset fields.yml and add integration test (elastic#23726)
  Only build targz and zip versions of Beats if PACKAGES is set in agent (elastic#24060)
  [Filebeat] Add field definitions for known Netflow/IPFIX vendor fields (elastic#23773)
  [Elastic Agent] Enroll with Fleet Server (elastic#23865)
  [Filebeat] Convert logstash logEvent.action objects to strings (elastic#23944)
  [Ingest Management] Fix reloading of log level for services (elastic#24055)
  Add Agent standalone k8s manifest (elastic#23679)
  [Metricbeat][Kubernetes] Extend state_node with more conditions (elastic#23905)
  [CI] googleStorageUploadExt step (elastic#24048)
  Check fields are documented for aws metricsets (elastic#23887)
  Update go-concert to 0.1.0 (elastic#23770)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
review Team:Integrations Label for the Integrations team v7.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants