Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Logs+] Add JSON parsing pipeline #95522

Closed
felixbarny opened this issue Apr 25, 2023 · 6 comments · Fixed by #96083
Closed

[Logs+] Add JSON parsing pipeline #95522

felixbarny opened this issue Apr 25, 2023 · 6 comments · Fixed by #96083
Assignees
Labels
:Data Management/Data streams Data streams and their lifecycles >enhancement Team:Data Management Meta label for data/management team

Comments

@felixbarny
Copy link
Member

felixbarny commented Apr 25, 2023

Enhancing the logs-*-* index template with a default ingest pipeline that is first doing a pre-flight check if the message field might be JSON and then uses the JSON processor to decode the JSON and merge it top-level with the document.

See also this prototype: https://gist.github.com/felixbarny/a9a2f6243153d5508643fd95ac968a88#file-routing-yml-L114-L174

Open questions and things to consider

  • How do users opt-out of default JSON parsing? They could override the index template. Maybe that's good enough.
  • Similarly to the logs@custom component template, should we call out to a custom index pipeline? Align with naming in Fleet.
@felixbarny felixbarny added the :Data Management/Data streams Data streams and their lifecycles label Apr 25, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Apr 25, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@eyalkoren
Copy link
Contributor

This issue is currently blocked on #95782

@eyalkoren eyalkoren changed the title Automatically parse JSON logs by default [Logs+] Automatically parse JSON logs by default May 10, 2023
@eyalkoren
Copy link
Contributor

Open questions and things to consider

  • How do users opt-out of default JSON parsing? They could override the index template. Maybe that's good enough.
  • Similarly to the logs@custom component template, should we call out to a custom index pipeline? Align with naming in Fleet.

@felixbarny I think of adding exactly what you proposed in your prototype, only with "ignore_missing_pipeline": true for the logs-json pipeline processor. Would this be a good-enough opting out solution (removing the logs-json pipeline)? In addition (or instead), the pipeline processor condition can also look for a dedicated field (e.g. "log.json.parse": false) in the original document, thus allowing opting out for specific log events.
Any other idea for disabling this specific pipeline?
Anything else required to make it more "custom"?

@eyalkoren
Copy link
Contributor

eyalkoren commented May 11, 2023

One issue with having a managed pipeline referring to another managed pipeline is that we'd need to add a dependency-validation to make sure that the dependee pipeline is installed before the dependent pipeline. This would add some complexity for multiple reasons:

  • this will require parsing and resolution of the pipeline configuration in order to know which processors, as well as any additional configurations, are used.
  • this validation will need to be recursive because each pipeline can in turn be dependent on another pipeline
  • this is not restricted to pipeline processors, for example on_failure can refer to a pipeline (I don't know if this has to be an inlined pipeline or can refer to independent pipeline though)
  • we need to take ignore_missing_pipeline into account

One option is to add a dependeePipelines list to IngestPipelineConfig. Maybe not the prettiest, but would enforce proper ordering and data consistency through robust and simple implementation. One caveat though- we'd need to keep booking of non-required dependee pipelines that get explicitely removed.

@felixbarny
Copy link
Member Author

Would this be a good-enough opting out solution (removing the logs-json pipeline)?

I don't think so as the pipeline would be re-installed

the pipeline processor condition can also look for a dedicated field (e.g. "log.json.parse": false) in the original document, thus allowing opting out for specific log events

I think something like that could work. The question is where users would set that, though. I suppose the answer is that they can set it in the logs@custom pipeline.

make sure that the dependee pipeline is installed before the dependent pipeline

As a simple workaround, can we just set the ignore_missing_pipeline when invoking the JSON pipeline?

@eyalkoren
Copy link
Contributor

I don't think so as the pipeline would be re-installed

That's what I meant with

One caveat though- we'd need to keep booking of non-required dependee pipelines that get explicitely removed

not ideal and I am not sure how to do that, but an option

As a simple workaround, can we just set the ignore_missing_pipeline when invoking the JSON pipeline?

It is not only a problem of whether or not there is an error due to the pipeline, it is also the problem of race condition when installing them concurrently, which may lead to indexing inconsistency. Maybe we can simply leave with that.
But if we don't rely on the JSON pipeline removal for opting out, I think my proposal for the implementation should be simple and robust.

@felixbarny felixbarny changed the title [Logs+] Automatically parse JSON logs by default [Logs+] Add JSON parsing pipeline Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Data streams Data streams and their lifecycles >enhancement Team:Data Management Meta label for data/management team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants