ECS tooling rewrite brought about by the need to reuse field sets as a different name #864

webmat · 2020-06-04T20:49:00Z

This PR introduces a significant rewrite of the ECS tooling. It's ultimately meant to allow us to reuse field sets as a different name inside themselves, which was not possible before. Examples of this are reusing process at process.parent, or the upcoming ways to capture multiple users in an event (#809).

This PR does not use the new features to fix how process.parent.* fields are currently defined (they're currently explicitly duplicated, and there are subtle mistakes), nor does it introduce the new user fields. The new features are however thoroughly tested in the unit tests.

The Plan

Since this was a significant rewrite, I've tried to reduce the amount of changes to ECS per se to a minimum.

This way reviewers can review the generated files (the asciidoc, and ecs_nested.yml), in order to confirm that things are still working as expected. Of course a deep review of the code is welcome for those who have time, as well 😉

I will open up follow-up PRs to this one:

Put in place many improvements that were noticed during this rewrite. Some of these are already in this PR, as commented code that starts with a TODO. Additional cleanup and improvements noticed when working on #864 #871
Define process.parent.* fields with the new nesting mechanism instead of manual duplication Define process.parent via the new field reuse mechanism #868
Define the new user.* fields to represent multiple users in an event, as described in the [ECS] Multiple users in an event proposal #809 proposal Define fields to allow representing multiple users in an event. #869

TODO

I'm opening this PR as a draft because there's still one thing that must be fixed before this is fully ready. However I wanted to open it as early as possible, to leave more time for review.

Finalize the changelog to reflect more recent changes introduced by this rewrite.
We need to fix issues around the tracing fields. Tracing is a special field set in that it's marked as root=true just like Base fields (meaning they're not nested under tracing.). However it contains two nested ID fields, trace.id and transaction.id. This rewrite didn't account for this, and one of them gets overwritten.

Major changes

This PR adds the ability to reuse field sets as a different field name (more details below in schemas/*.yml)
The ECS tooling no longer makes efforts to ensure that custom fields merged in via --include result in schema definitions that are "good ECS citizens".
- The --include CLI flag will soon be used by the new RFC process, to build ECS artifacts that include early stage (unofficial) major additions. See Document ECS RFC process #833 for the first step of introducing this new process.
- This means the tooling must now accept included files as they are, with all of the power this entails. We can still re-introduce a safety net later, but a this time the ECS tools are as safe to use as a loaded shotgun (and it's pointed at your feet) 😄
Documentation about how schema files are defined (schemas/README.md) was updated significantly.
generated/ecs/ecs_nested.yml
- breaking in generated/ecs/ecs_nested.yml, the format of the array reusable.expected is changing in a breaking way. It used to be an array of flat names (e.g. client.user, server.user...). Now it's an array of objects ({at:client, as:user, full: client.user}, {at:user, as:target, full: user.target}). Programs that consume this file will need to adjust to this new format, in order to be able to consume this file as of ECS 1.6.0. This is necessary to support field set reuse as a different name.
- addition (non-breaking): field sets that are the recipients of reused fields (e.g. the source field set, if you think of reusing user at source.user) now have a new attribute called reused_here, listing all field sets that are, well, reused here :-)
  - This new attribute is meant to replace nestings which was a list of flat names and is not enough to capture reuse as a different name.
schemas/*.yml
- The old way to define field reuse still works (list of destinations as flat names). However to support nesting as a different name, the array now also supports objects such as {at: process, as: parent} or {at: user, as: target}. See comments in schemas/process.yml and schemas/user.yml.
- We're introducing an optional attribute reusable.order, to ensure chained reuses happen in the right order (group => user, then user => many places). We used to do reuse by reference, which caused problems and added complexity.
  - A better approach long term may be to walk the reuse tree instead (think of it like a dependency tree). But at this time, order is used exactly once in all of ECS, for group. So this better approach sounded overkill.
  - The introduction of this attribute has one big advantage over how it used to be done: we can now deep copy fields immediately when performing reuse, and right away fill in new values specific to the new nesting location.
We're introducing a third file in generated/ecs/ that contains the deeply nested representation of fields. This is a representation we've had to develop a while ago, to allow field merging. But it's not meant to be published. So the file is generated at generated/ecs/ecs.yml, but it's ignored by git. Its purpose is to help us debug the code, not to publish a third file format :-)
For a while the asciidoc generator operated on this deeply nested structure, but now it's back to operating on the "ecs_nested" representation. The introduction of reused_here made this possible.
generator.py contains almost no logic now. All of the logic is now in new scripts with narrower focus. This makes them much easier to test, without massive test setups. The new scripts are in scripts/schema/*.py. They also supersede scripts/schema_reader.py which is now gone.
The Python unit tests now run in verbose mode.

Caveats

In generated/ecs/ecs_nested.yml: chained reuses don't populate nestings all the way anymore. Example: group is reused in user, then user in client: client's nestings attribute will no longer list client.user.group.
- This is something that could be fixed, but I'm not sure this is worth it.

…tion

… name

…d of 'top level').

Bunch of stuff moving to new script schema_processor.py

Anything can be merged over other fields. Here's not the place to enforce good ECS citizenry.

… nested tree

My brain can't deal with Python list comprehensions

webmat · 2020-06-09T15:22:51Z

I'm not super happy with the way I'm currently fixing the tracing fields getting mangled.

I would have preferred a change that didn't create so many changes in ecs_nested.yml. The nesting structure there used to be under the "contextual name" of a field, now it's nested under the full name of the field. I would expect programs reading this file to use the attributes (name or flat_name), not the structure of the nesting, for information. So I think this may be an acceptable change in the YAML format.

To illustrate, where we used to have

client:
  description: 'A client is defined ...'
  fields:
    address: # <-- contextual name
      name: address
      flat_name: client.address
      type: keyword

Now we have this instead:

client:
  description: 'A client is defined ...'
  fields:
    client.address: # <-- full name
      name: address
      flat_name: client.address
      type: keyword

I took this approach because the current attribute name had problems. In general it acts as a contextual name, or the name of the field within the field set (e.g. client.address has the contextual name address, when inside client). But this was never correctly adjusted for fields that were being reused. I suppose most folks used flat_name anyway, as the correct full name. The only place where the contextual name is absolutely necessary is in the Beats generator. My recent changes reconstruct it there only.

The new approach of adding node_name attributes everywhere simplifies edge cases like the tracing field set. This new attribute is internal only, and is therefore not output in ecs_flat.yml nor ecs_nested.yml.

I'm open to suggestions on how we can solve this better, if there's any :-)

ebeahan

A couple more observations, asking more-so for my own better understanding of the changes.

docs/field-details.asciidoc

… them.

marshallmain · 2020-06-09T19:45:58Z

One more use case that throws a wrench in things...in our custom endpoint schemas we have a target fieldset where we reuse the process fieldset. It's useful to have target.process.parent as well in this case, but when process.parent is switched over to use this implementation it won't be carried over to target.process.parent anymore since the self-reused fields don't come with regular reuses.

Since this PR isn't making that change yet it's not a blocker but something to consider since that change is coming up soon. Could we implement a flag that determines whether self-reused fieldsets get carried with the fieldset in regular reuse?

webmat · 2020-06-09T20:16:19Z

Haha I had been wondering if this assumption would hold. Seems like it doesn't 😂

There's no inherent reason to prevent this wholesale. The two cases that currently make use of self-nestings fall on both sides of the question:

I think it makes sense that process.parent would follow along, when reusing process elsewhere, as you describe.
I do not want the 3 self-nestings of user described in [ECS] Multiple users in an event proposal #809 to follow along in all the places where user is reused. I don't think it's needed.

So I'm open to adding an option to handle whether the self-nestings follow along.

Another way we may be able to work around this on the endpoint side is by you explicitly reusing twice, like this:

- name: process
  ...
  reusable:
    expected:
      - at: target
        as: process
      - at: target.process
        as: parent

Right now this would probably break, as I've been using the "nest as" notation exclusively for self-nestings. So the code makes assumptions in this direction for now.

But the semantics of the schema DSL make it sensible to define things like this. We could adjust the tools to support this as well.

So these are two ways we can make this work, I think. Happy to go either way.

marshallmain · 2020-06-09T20:48:09Z

Ah I didn't think about reusing it twice, that's a good idea and probably the simplest and most versatile solution.

webmat · 2020-06-09T21:06:59Z

FYI 2 of the branches for follow-up PRs (once this is merged), if you're curious:

define process.parent via reuse, fixing a few issues at the same time: nest-as-process-parent
define multiple users in an event (from [ECS] Multiple users in an event proposal #809): multiple-users

I haven't started work on the 3rd PR to put in place the various improvements I've noted along the way.

webmat · 2020-06-09T21:07:51Z

@marshallmain The code may still need adjustments to make it work, though.

webmat · 2020-06-10T18:22:54Z

@marshallmain Are there any blockers for you here?

marshallmain

Looks good!

ebeahan

LGTM - no additional comments or outstanding issues from my end.

simitt · 2020-06-11T07:14:34Z

@webmat I was recently working on go struct generation using the ecs_nested.yml.
One thing I was wondering was, what are your thoughts on fully nesting the fields? E.g. instead of having agent.build.original have a structure like:

agent:
  fields: 
    build:
      fields:
        original:

It would make parsing the file more straight forward, and since every field additionally has the flat_name attribute, no information would be lost.
Switching over to a fully nested structure would also allow for having a description on every field set. E.g. having cloud.project.name and cloud.project.id, there is no description available for cloud.project. When creating structs out of the file one can use the description for auto-generating documentation, therefore it would be helpful if it was complete.

Don't feel blocked by my comments, just some thoughts and feedback around working with the nested fields.

webmat · 2020-06-11T19:04:19Z

@simitt Funny you ask that.

First, to directly answer your question of documenting the intermediate parent fields, this is already supported. We do it for dns.answers, log.syslog and network.inner. All of these are object fields that have additional fields defined under them, either directly or via field reuse. They do end up on the documentation (see log.syslog, for example).

So if you need to do this to add context on a parent field in your custom fields, or in fields you'd like to submit as a PR to ECS, go ahead! It works 🙂

As an aside, the format you describe is precisely the format of the new git-ignored file introduced by this PR, at generated/ecs/ecs.yml 🙂 If you're curious to check it out, you can simply run make and it'll be generated. Alternatively, the format is described at the top of the scripts/schema/loader.py as well. We've introduced it a while ago (thanks Marshall) to simplify merging additional fields on top of ECS. But saving it as a file is new to this PR, as a debugging help.

For now I'm holding off on publishing a third official representation for the fields. It seems like 3 would be too much. Although I'm curious if other folks also think it would be good to replace ecs_nested.yml with this new format.

I think ecs_nested is easier to parse in that there's only 2 levels of details, the field sets details, and a simple list of all of their fields. On the other hand, when building some of the artifacts like the Elasticsearch template, we do have to re-create the deeply nested structure again, to match ES' way of defining all the fields 😂

So we're not adding it for now, but if people think there's a need for that, let us know 🙂

webmat · 2020-06-11T19:07:08Z

Thanks for the feedback everyone

webmat · 2020-06-17T20:51:49Z

All 3 follow-up PRs to this are now linked in the body of the PR: #868, #869 and #871

Mathieu Martin added 30 commits April 16, 2020 16:20

Function to turn single word reuse locations into the dictionary nota…

6568fe2

…tion

Validate that keys 'as' and 'at' are present in dictionary notation

e0befee

Actually resolve the shorthands when reading schemas

53556a0

Unrelated: removed dead code

88313e7

Precompute the full path of a nesting, considering nesting as another…

05d2847

… name

First pass at self-nesting

ee51313

Missed a generated file

108dd35

Code formatting

ab82e6e

Changelog

a3b5fde

Fix asciidoc link when nesting as another name

2918cdd

Change the wording of using a field set at the root of events (instea…

d0c0271

…d of 'top level').

Remove commented code

c90885f

WIP

df9843d

Start massive refactor of schema_reader...

503f90b

Bunch of stuff moving to new script schema_processor.py

Make the 'reusable' field set properties explicit

abee212

Move a bunch of code to schema_processor.py

e11fe38

Give schemas/README.md some much needed attention

4b2039c

new schema loader WIP

819575c

About to turn the schema loader into a shotgun...

0af1aa3

Anything can be merged over other fields. Here's not the place to enforce good ECS citizenry.

The schema loader now merges EVERYTHING, taking special care of arrays

4725167

Import the changes from PR elastic#821

d16c361

Move nesting structure comment to top of file

5b27a8b

Adjust tests for safe loading of schemas for new script name

50a49a3

Start future schema cleaner w/ basic visitor function to navigate the…

a1c20d6

… nested tree

Raise clear error on missing schema name

94ce10b

Add 'mock' testing dependency

5344341

Make 'title' a schema_detail as well

3fe758b

Keep improving schema/readme

e28c737

Create a helper for list subtraction...

9bb6722

My brain can't deal with Python list comprehensions

schema level cleanups looking pretty good

6202462

Mathieu Martin added 3 commits June 9, 2020 09:53

Fix merge_fields bug

91e4660

Fix issue with tracing fields getting mangled

98745ca

Adjust sanity check for new nesting under complete field name

866f1b9

webmat marked this pull request as ready for review June 9, 2020 15:03

webmat mentioned this pull request Jun 9, 2020

Add flexibility to the field set reuse mechanism #820

Closed

ebeahan reviewed Jun 9, 2020

View reviewed changes

docs/field-details.asciidoc Outdated Show resolved Hide resolved

docs/field-details.asciidoc Show resolved Hide resolved

Mathieu Martin added 5 commits June 9, 2020 11:40

Changelog

34595a3

Space. The final frontier.

97613cf

Raise if either reused schema or destination schema have root=true

c841c53

Fix spacing

d6ec0ac

Marshall was right, these two functions were the same. Deleted one of…

91e7f16

… them.

marshallmain approved these changes Jun 10, 2020

View reviewed changes

ebeahan approved these changes Jun 10, 2020

View reviewed changes

webmat merged commit b217459 into elastic:master Jun 11, 2020

webmat mentioned this pull request Jun 17, 2020

Additional cleanup and improvements noticed when working on #864 #871

Merged

webmat pushed a commit that referenced this pull request Jun 26, 2020

Additional cleanup and improvements noticed when working on #864 (#871)

d8a5e2a

ebeahan mentioned this pull request Oct 13, 2020

Merge custom and core multi_fields array #982

Merged

ebeahan mentioned this pull request Dec 18, 2020

Document reuseable.order attribute #1207

Open

ebeahan mentioned this pull request Jun 16, 2021

Add support for multi-level self-nestings #1459

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ECS tooling rewrite brought about by the need to reuse field sets as a different name #864

ECS tooling rewrite brought about by the need to reuse field sets as a different name #864

webmat commented Jun 4, 2020 •

edited

Loading

webmat commented Jun 9, 2020 •

edited

Loading

ebeahan left a comment

marshallmain commented Jun 9, 2020

webmat commented Jun 9, 2020

marshallmain commented Jun 9, 2020

webmat commented Jun 9, 2020

webmat commented Jun 9, 2020

webmat commented Jun 10, 2020

marshallmain left a comment

ebeahan left a comment

simitt commented Jun 11, 2020

webmat commented Jun 11, 2020

webmat commented Jun 11, 2020

webmat commented Jun 17, 2020

ECS tooling rewrite brought about by the need to reuse field sets as a different name #864

ECS tooling rewrite brought about by the need to reuse field sets as a different name #864

Conversation

webmat commented Jun 4, 2020 • edited Loading

webmat commented Jun 9, 2020 • edited Loading

ebeahan left a comment

Choose a reason for hiding this comment

marshallmain commented Jun 9, 2020

webmat commented Jun 9, 2020

marshallmain commented Jun 9, 2020

webmat commented Jun 9, 2020

webmat commented Jun 9, 2020

webmat commented Jun 10, 2020

marshallmain left a comment

Choose a reason for hiding this comment

ebeahan left a comment

Choose a reason for hiding this comment

simitt commented Jun 11, 2020

webmat commented Jun 11, 2020

webmat commented Jun 11, 2020

webmat commented Jun 17, 2020

webmat commented Jun 4, 2020 •

edited

Loading

webmat commented Jun 9, 2020 •

edited

Loading