[Filebeat] Add postgresql csv fileset #23334

azlev · 2021-01-04T02:36:43Z

As a configuration option, create another pipeline inside the module to
parse CSV log files.

Although it's necessary to configure the database to emit such logs,
there is some advantages: configured properly, some events like
statement timeout and lock timeout will display the query in the
same event, opposed to multiple lines, one with the error message and
other with statement.

What does this PR do?

This PR creates a new fileset/pipeline inside PostgreSQL module. This new fileset parses and create events based on CSV log files from postgresql.

Why is it important?

PostgreSQL logs weren't designed to be parsed, so, without very complicated logic, it's not possible to encapsulate all information in just one event. One example of that is statement timeout:

2021-01-04 02:07:15.470 UTC [13470] [user=postgres,db=postgres,app=[unknown]] ERROR:  canceling statement due to statement timeout
2021-01-04 02:07:15.470 UTC [13470] [user=postgres,db=postgres,app=[unknown]] STATEMENT:  SELECT pg_sleep(100)

Using the CSV log, those 2 lines become 1:

2021-01-04 02:07:15.470 UTC,"postgres","postgres",13470,"127.21.1.7:28182",5ff27834.349e,1,"SELECT",2021-01-04 02:06:44 UTC,16/4181895,0,ERROR,57014,"canceling statement due to statement timeout",,,,,,"SELECT pg_sleep(100)",,,""

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

Documentation

As a configuration option, create another pipeline inside the module to parse CSV log files. Although it's necessary to configure the database to emit such logs, there is some advantages: configured properly, some events like `statement timeout` and `lock timeout` will display the query in the same event, opposed to multiple lines, one with the error message and other with statement.

elasticmachine · 2021-01-04T02:41:14Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Build Cause: Pull request #23334 updated
Start Time: 2021-02-17T12:54:54.177+0000
Duration: 49 min 46 sec
Commit: a518be6

Test stats 🧪

Test	Results
Failed	0
Passed	13056
Skipped	2215
Total	15271

Trends 🧪

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test	Results
Failed	0
Passed	13056
Skipped	2215
Total	15271

elasticmachine · 2021-01-04T11:17:03Z

Pinging @elastic/integrations (Team:Integrations)

jsoriano

Thanks for this contribution, very interesting, and quite complete!

filebeat/module/postgresql/csv/_meta/fields.yml

filebeat/module/postgresql/_meta/docs.asciidoc

jsoriano · 2021-01-09T12:49:59Z

jenkins run the tests please

Do not generate duplicate field groups. Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>

jsoriano · 2021-01-09T20:01:16Z

jenkins run the tests please

jsoriano · 2021-01-11T11:36:15Z

By the way, there is another approach to support multiple formats. For example the elasticsearch module contains pipelines for plain and JSON logs, and it chooses automatically which one to use.

Multiple pipelines can be included like this in the fileset manifest:

beats/filebeat/module/elasticsearch/server/manifest.yml

Lines 15 to 18 in 8e8bfc8

    
           ingest_pipeline: 
        
             - ingest/pipeline.yml 
        
             - ingest/pipeline-plaintext.yml 
        
             - ingest/pipeline-json.yml

The first pipeline is executed when an event is received, in this case pipeline.yml. It contains some common processing for all formats, but in an specific part, it chooses to run one of the other pipelines depending on the format of the message:

beats/filebeat/module/elasticsearch/server/ingest/pipeline.yml

Lines 15 to 20 in 8e8bfc8

    
           - pipeline: 
        
               if: ctx.first_char != '{' 
        
               name: '{< IngestPipeline "pipeline-plaintext" >}' 
        
           - pipeline: 
        
               if: ctx.first_char == '{' 
        
               name: '{< IngestPipeline "pipeline-json" >}'

This approach is also used in other modules like kibana, logstash, coredns, envoyproxy... you can look for examples that use the IngestPipeline helper in the pipeline.

If we can do something similar for these postgresql logs, we could add a CSV-specific pipeline to the existing log metricset, and we could more easily reuse the common parts of the pipelines, as well as the common fields. And it would be easier for users, as they wouldn't need to configure a different fileset depending on the format.

@azlev what do you think? would it be possible to do it this way in this case?

Commit cacde06 changed field names. I commited in github without pay attention to tests. This commit fixes all tests and the pipeline.

azlev · 2021-01-16T12:30:40Z

By the way, there is another approach to support multiple formats. For example the elasticsearch module contains pipelines for plain and JSON logs, and it chooses automatically which one to use.

Multiple pipelines can be included like this in the fileset manifest:

beats/filebeat/module/elasticsearch/server/manifest.yml

Lines 15 to 18 in 8e8bfc8

ingest_pipeline:

- ingest/pipeline.yml

- ingest/pipeline-plaintext.yml

- ingest/pipeline-json.yml

The first pipeline is executed when an event is received, in this case pipeline.yml. It contains some common processing for all formats, but in an specific part, it chooses to run one of the other pipelines depending on the format of the message:

beats/filebeat/module/elasticsearch/server/ingest/pipeline.yml

Lines 15 to 20 in 8e8bfc8

- pipeline:

if: ctx.first_char != '{'

name: '{< IngestPipeline "pipeline-plaintext" >}'

- pipeline:

if: ctx.first_char == '{'

name: '{< IngestPipeline "pipeline-json" >}'

This approach is also used in other modules like kibana, logstash, coredns, envoyproxy... you can look for examples that use the IngestPipeline helper in the pipeline.

If we can do something similar for these postgresql logs, we could add a CSV-specific pipeline to the existing log metricset, and we could more easily reuse the common parts of the pipelines, as well as the common fields. And it would be easier for users, as they wouldn't need to configure a different fileset depending on the format.

@azlev what do you think? would it be possible to do it this way in this case?

I tried a minimal approach to avoid breaking changes, and I assumed that it would be easier to merge. So now I'll try to do a more integrated code. I'll give a try, but those changes will take a while to implement.

* Adding postgresql.log.query_name in grok The query_step was ignored because it's a subset of command_tag * Separate message and query fields Message is the main message, and query is what message is referencing

azlev · 2021-01-17T00:15:32Z

@jsoriano I found 2 ways to choose the ingest pipeline:

file extension (PostgreSQL always use .csv to those files)
regular expression like this: ctx =~ /^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}(.\d{1,3})( \w{3,4}),/

I couldn't find any way to get the file name, and this will break the tests. The regular expression doesn't look as a good idea in a critical path like this, and looks fragile.

Do you have other ideas?

jsoriano · 2021-01-18T17:01:21Z

@jsoriano I found 2 ways to choose the ingest pipeline:

file extension (PostgreSQL always use .csv to those files)

regular expression like this: ctx =~ /^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}(.\d{1,3})( \w{3,4}),/

I couldn't find any way to get the file name, and this will break the tests. The regular expression doesn't look as a good idea in a critical path like this, and looks fragile.

Do you have other ideas?

@azlev I agree that the simpler we can do the selection, the better, to avoid affecting the critical path.

Checking the file extension could be a good option. File path should be available in log.file.path. You don't see it in tests because it is removed to avoid issues validating the expected events, but at the moment of processing the path should be there.
The problem I see with this approach is that not all inputs collect from files, e.g. the postgresql module could be used with logs collected with syslog, or from s3 buckets. So even if this is more straightforward, it may not work in all cases.

Ideally, the decision of using one or the other pipeline should be based on something available in the message itself, as you are doing with the regular expression.

Other possible approach could be to add a format setting to the fileset, this value could be set in the events with add_fields, so pipelines can decide what to do. But it would be better if the module can do the parsing automatically without depending on user configuration.

Another approach you could try is to parse as grok with the current pipeline, but if grok fails, fallback to the CSV pipeline with on_failure. This can also affect the performance of the pipeline, but maybe you could give it a try if other options are not good enough.

If we are not convinced by any of these approaches I am ok with going back to the approach of using two different filesets.

There is no `core_id` inside PostgreSQL logs. Deprecating field and moving it to `session_line_number`.

This commit will open the path to add 2 pipelines: - The default one, using log files - The CSV pipeline Since both files represent the same information, makes sense to have only one fileset.

Creates a new grok pattern to match only the timestamp and the next char, the separator. Based on separator, choose the right pipeline.

Both of then represents the same information, so use the same pipeline to do the job.

azlev · 2021-02-14T00:40:41Z

I've split the grok pipeline in 2 parts: the first one parse the timestamp plus 1 char. This char will be used to choose the correct pipeline: log or csv.

In the process, I deprecated the core_id field since I couldn't find it inside postgresql documentation.

Still, there is some fields that I need work on, the 2 that I'm currently thinking about are error.code and event.type

The PostgreSQL error code is not a number: https://www.postgresql.org/docs/current/errcodes-appendix.html
I duplicated event.type in error_severity. I'm checking how to deal with it

- remove unused `thread_id` field - issue a `make update`

Drop `postgresql.log.error_severity` in favor of `log.level`

Based on PostgreSQL Error Codes, everything that not starts with `00`, `01` or `02` is an error.

azlev

@jsoriano I have a question here: https://github.com/elastic/beats/blob/master/filebeat/module/postgresql/log/ingest/pipeline.yml#L40

In the current code, if there is an error, the final event will have: "event.type": ["info", "error"]. Is that right?

Regenerate files to have a consistent field order.

jsoriano · 2021-02-16T09:34:04Z

jenkins run the tests please

jsoriano

@azlev I think this is almost ready to merge, I have added some small questions/suggestions, let me know what you think.

filebeat/module/postgresql/log/_meta/fields.yml

filebeat/module/postgresql/log/ingest/pipeline-csv.yml

jsoriano · 2021-02-16T09:48:39Z

filebeat/module/postgresql/log/ingest/pipeline-csv.yml

+      GREEDYDATA: |-
+        (.|
+        |   )*
+      POSTGRESQL_QUERY_STEP: '(parse|bind|statement|fastpath function call|execute|execute fetch from)'


Query step is captured in postgresql.log.query_step in the plain text pipeline, should it be also captured in this pipeline?

That one I was in doubt. This query_step overlaps with command_tag, but not 100%, so I opted to let it empty since I was creating a new fileset at first. Now that I merged, I can review this. I'll populate it, at least to some backward compatibility.

jsoriano · 2021-02-16T09:53:05Z

filebeat/module/postgresql/log/ingest/pipeline.yml

@@ -41,15 +47,22 @@ processors:
    field: event.type
    value:
      - info
+    if: "ctx?.postgresql?.log?.sql_state_code == null || (ctx.postgresql.log.sql_state_code ==~ /^0[012].*/)"
 - append:


The set processor could be used to add event.type instead of append, so it is not set as an array in the document.

jsoriano · 2021-02-16T10:01:33Z

jenkins run the tests again please

Change part of the description to a comment Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>

- Run update since last `fields.yml` was done only in github - Use `set` instead of `append` in pipeline to not create an array - Capture the `postgresql.log.query_step` in csv pipeline

jsoriano

LGTM, let's wait for a green build. Thanks @azlev!

filebeat/module/postgresql/log/test/postgresql-13-error-code.csv.log-expected.json

jsoriano · 2021-02-16T12:10:46Z

jenkins run the tests

jsoriano · 2021-02-16T12:10:54Z

/test

Now PostgreSQL has 3 files in the ingest directory: - pipeline.yml - pipeline-log.yml - pipeline-csv.yml Change the count to 3 to reflect that.

postresql.log.sql_state_code -> `postgresql.log.sql_state_code`.

jsoriano

Looks good. Only a couple of small details, sorry I didn't see them before.

filebeat/module/postgresql/log/ingest/pipeline-csv.yml

jsoriano · 2021-02-17T10:42:23Z

filebeat/module/postgresql/log/_meta/fields.yml

+      description: >
+        A possible solution to solve an error.
+    - name: internal_query
+      description: internal query that led to the error (if any).


Please uppercase first letter of sentences in these descriptions.

Suggested change

description: internal query that led to the error (if any).

description: >

Internal query that led to the error (if any).

jsoriano · 2021-02-17T10:43:45Z

filebeat/scripts/tester/main_test.go

@@ -34,7 +34,7 @@ func TestGetPipelinePath(t *testing.T) {
 		},
 		{
 			pipelinePath: "../../module/postgresql/log/ingest",
-			count:        1,
+			count:        3,


Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>

azlev · 2021-02-17T14:25:31Z

👀

jsoriano

Great 👍

Refactor PostgreSQL module to support logs in CSV format. The fileset has three pipelines now, first one is executed to parse the initial part of the lines, till it decides if the logs are plain text or CSV, once decided it invokes one of the other two pipelines, one specific for plain text logs and the other for CSV. Several test cases and documentation are added for CSV support. Additional fields are available now, and some others have been renamed to represent more accurately the values they store. Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co> (cherry picked from commit bb78931)

Refactor PostgreSQL module to support logs in CSV format. The fileset has three pipelines now, first one is executed to parse the initial part of the lines, till it decides if the logs are plain text or CSV, once decided it invokes one of the other two pipelines, one specific for plain text logs and the other for CSV. Several test cases and documentation are added for CSV support. Additional fields are available now, and some others have been renamed to represent more accurately the values they store. (cherry picked from commit bb78931) Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co> Co-authored-by: José Arthur Benetasso Villanova <jose.arthur@gmail.com>

jsoriano · 2021-02-17T16:38:19Z

Merged and backported to 7.x, this will be hopefuly available in 7.12, thanks @azlev!

…dows-7 * upstream/master: (332 commits) Use ECS v1.8.0 (elastic#24086) Add support for postgresql csv logs (elastic#23334) [Heartbeat] Refactor config system (elastic#23467) [CI] install docker-compose with retry (elastic#24069) Add nodes to filebeat-kubernetes.yaml ClusterRole - fixes elastic#24051 (elastic#24052) updating manifest files for filebeat threatintel module (elastic#24074) Add Zeek Signatures (elastic#23772) Update Beats to ECS 1.8.0 (elastic#23465) Support running Docker logging plugin on ARM64 (elastic#24034) Fix ec2 metricset fields.yml and add integration test (elastic#23726) Only build targz and zip versions of Beats if PACKAGES is set in agent (elastic#24060) [Filebeat] Add field definitions for known Netflow/IPFIX vendor fields (elastic#23773) [Elastic Agent] Enroll with Fleet Server (elastic#23865) [Filebeat] Convert logstash logEvent.action objects to strings (elastic#23944) [Ingest Management] Fix reloading of log level for services (elastic#24055) Add Agent standalone k8s manifest (elastic#23679) [Metricbeat][Kubernetes] Extend state_node with more conditions (elastic#23905) [CI] googleStorageUploadExt step (elastic#24048) Check fields are documented for aws metricsets (elastic#23887) Update go-concert to 0.1.0 (elastic#23770) ...

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jan 4, 2021

andresrc added the Team:Integrations Label for the Integrations team label Jan 4, 2021

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jan 4, 2021

Added PostgreSQL CSV log ingestion in CHANGELOG

a6b2ea0

jsoriano added the review label Jan 9, 2021

jsoriano reviewed Jan 9, 2021

View reviewed changes

jsoriano self-assigned this Jan 9, 2021

Update filebeat/module/postgresql/csv/_meta/fields.yml

cacde06

Do not generate duplicate field groups. Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>

Running 'make update'

7ec70d8

filebeat(module/postgres): Fix CSV ingestion / tests

31334cc

Commit cacde06 changed field names. I commited in github without pay attention to tests. This commit fixes all tests and the pipeline.

Refactor the PostgreSQL CSV Log pipeline

606173e

* Adding postgresql.log.query_name in grok The query_step was ignored because it's a subset of command_tag * Separate message and query fields Message is the main message, and query is what message is referencing

azlev added 4 commits January 24, 2021 08:34

Fix(filebeat/postgresql): Change core_id to session_line_number

d358b32

There is no `core_id` inside PostgreSQL logs. Deprecating field and moving it to `session_line_number`.

[Filebeat] Refactor PostgreSQL log pipeline

94de6c1

This commit will open the path to add 2 pipelines: - The default one, using log files - The CSV pipeline Since both files represent the same information, makes sense to have only one fileset.

[Filebeat] PostgreSQL split grok logic

56f2d54

Creates a new grok pattern to match only the timestamp and the next char, the separator. Based on separator, choose the right pipeline.

[Filebeat] Merge PostgreSQL's log and csv filesets

fa308d6

Both of then represents the same information, so use the same pipeline to do the job.

azlev added 3 commits February 13, 2021 23:03

[Filebeat] PostgreSQL module adjustments

7c90863

- remove unused `thread_id` field - issue a `make update`

[Filebeat] PostgreSQL module: drop error_severity

eee16be

Drop `postgresql.log.error_severity` in favor of `log.level`

[Filebeat] PostgreSQL: append "error" in relevant events

17724a0

Based on PostgreSQL Error Codes, everything that not starts with `00`, `01` or `02` is an error.

azlev commented Feb 14, 2021

View reviewed changes

[Filebeat] re-generate test results

c95556f

Regenerate files to have a consistent field order.

jsoriano added the needs_backport PR is waiting to be backported to other branches. label Feb 16, 2021

jsoriano reviewed Feb 16, 2021

View reviewed changes

Merge branch 'master' into azlev.filebeat.postgresql_csv

f131b46

azlev and others added 2 commits February 16, 2021 07:26

Update filebeat/module/postgresql/log/_meta/fields.yml

3671f00

Change part of the description to a comment Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>

[Filebeat] PostgreSQL module adjustments

60f064f

- Run update since last `fields.yml` was done only in github - Use `set` instead of `append` in pipeline to not create an array - Capture the `postgresql.log.query_step` in csv pipeline

jsoriano approved these changes Feb 16, 2021

View reviewed changes

filebeat/module/postgresql/log/test/postgresql-13-error-code.csv.log-expected.json Show resolved Hide resolved

azlev added 2 commits February 16, 2021 16:29

[Filebeat] Fix tester

1d23ce9

Now PostgreSQL has 3 files in the ingest directory: - pipeline.yml - pipeline-log.yml - pipeline-csv.yml Change the count to 3 to reflect that.

[Filebeat] fix a typo in PostgreSQL module

5fa37c8

postresql.log.sql_state_code -> `postgresql.log.sql_state_code`.

jsoriano self-requested a review February 17, 2021 10:34

jsoriano reviewed Feb 17, 2021

View reviewed changes

azlev and others added 4 commits February 17, 2021 08:30

Update filebeat/module/postgresql/log/ingest/pipeline-csv.yml

991a227

Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>

[Filebeat] PostgreSQL module: formatting fields.yml

0bfbde6

[Filebeat] Update tests cases to reflect last update in fields.yml

633a8cd

[Filebeat] run make update to update field descriptions

a518be6

jsoriano approved these changes Feb 17, 2021

View reviewed changes

jsoriano merged commit bb78931 into elastic:master Feb 17, 2021

jsoriano mentioned this pull request Feb 17, 2021

Cherry-pick #23334 to 7.x: [Filebeat] Add postgresql csv fileset #24088

Merged

7 tasks

jsoriano added v7.12.0 and removed needs_backport PR is waiting to be backported to other branches. labels Feb 17, 2021

jsoriano mentioned this pull request Feb 17, 2021

Add CSV logs support to postgresql integration elastic/integrations#720

Closed

jsoriano mentioned this pull request Feb 25, 2021

Update PostgreSQL integration to support logs in CSV format elastic/integrations#747

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Filebeat] Add postgresql csv fileset #23334

[Filebeat] Add postgresql csv fileset #23334

azlev commented Jan 4, 2021 •

edited by jsoriano

Loading

elasticmachine commented Jan 4, 2021 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

Trends 🧪

Test stats 🧪

elasticmachine commented Jan 4, 2021

jsoriano left a comment

jsoriano commented Jan 9, 2021

jsoriano commented Jan 9, 2021

jsoriano commented Jan 11, 2021

azlev commented Jan 16, 2021

azlev commented Jan 17, 2021

jsoriano commented Jan 18, 2021

azlev commented Feb 14, 2021

azlev left a comment •

edited

Loading

jsoriano commented Feb 16, 2021

jsoriano left a comment

jsoriano Feb 16, 2021

azlev Feb 16, 2021

jsoriano Feb 16, 2021

jsoriano commented Feb 16, 2021

jsoriano left a comment

jsoriano commented Feb 16, 2021

jsoriano commented Feb 16, 2021

jsoriano left a comment

jsoriano Feb 17, 2021

jsoriano Feb 17, 2021

azlev commented Feb 17, 2021

jsoriano left a comment

jsoriano commented Feb 17, 2021

	description: internal query that led to the error (if any).
	description: >
	Internal query that led to the error (if any).

[Filebeat] Add postgresql csv fileset #23334

[Filebeat] Add postgresql csv fileset #23334

Conversation

azlev commented Jan 4, 2021 • edited by jsoriano Loading

What does this PR do?

Why is it important?

Checklist

Author's Checklist

elasticmachine commented Jan 4, 2021 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

Trends 🧪

💚 Flaky test report

Test stats 🧪

elasticmachine commented Jan 4, 2021

jsoriano left a comment

Choose a reason for hiding this comment

jsoriano commented Jan 9, 2021

jsoriano commented Jan 9, 2021

jsoriano commented Jan 11, 2021

azlev commented Jan 16, 2021

azlev commented Jan 17, 2021

jsoriano commented Jan 18, 2021

azlev commented Feb 14, 2021

azlev left a comment • edited Loading

Choose a reason for hiding this comment

jsoriano commented Feb 16, 2021

jsoriano left a comment

Choose a reason for hiding this comment

jsoriano Feb 16, 2021

Choose a reason for hiding this comment

azlev Feb 16, 2021

Choose a reason for hiding this comment

jsoriano Feb 16, 2021

Choose a reason for hiding this comment

jsoriano commented Feb 16, 2021

jsoriano left a comment

Choose a reason for hiding this comment

jsoriano commented Feb 16, 2021

jsoriano commented Feb 16, 2021

jsoriano left a comment

Choose a reason for hiding this comment

jsoriano Feb 17, 2021

Choose a reason for hiding this comment

jsoriano Feb 17, 2021

Choose a reason for hiding this comment

azlev commented Feb 17, 2021

jsoriano left a comment

Choose a reason for hiding this comment

jsoriano commented Feb 17, 2021

azlev commented Jan 4, 2021 •

edited by jsoriano

Loading

elasticmachine commented Jan 4, 2021 •

edited by jenkins-beats-ci bot

Loading

azlev left a comment •

edited

Loading