Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECS tooling rewrite brought about by the need to reuse field sets as a different name #864

Merged
merged 113 commits into from
Jun 11, 2020

Conversation

webmat
Copy link
Contributor

@webmat webmat commented Jun 4, 2020

This PR introduces a significant rewrite of the ECS tooling. It's ultimately meant to allow us to reuse field sets as a different name inside themselves, which was not possible before. Examples of this are reusing process at process.parent, or the upcoming ways to capture multiple users in an event (#809).

This PR does not use the new features to fix how process.parent.* fields are currently defined (they're currently explicitly duplicated, and there are subtle mistakes), nor does it introduce the new user fields. The new features are however thoroughly tested in the unit tests.

The Plan

Since this was a significant rewrite, I've tried to reduce the amount of changes to ECS per se to a minimum.

This way reviewers can review the generated files (the asciidoc, and ecs_nested.yml), in order to confirm that things are still working as expected. Of course a deep review of the code is welcome for those who have time, as well 😉

I will open up follow-up PRs to this one:

TODO

I'm opening this PR as a draft because there's still one thing that must be fixed before this is fully ready. However I wanted to open it as early as possible, to leave more time for review.

  • Finalize the changelog to reflect more recent changes introduced by this rewrite.
  • We need to fix issues around the tracing fields. Tracing is a special field set in that it's marked as root=true just like Base fields (meaning they're not nested under tracing.). However it contains two nested ID fields, trace.id and transaction.id. This rewrite didn't account for this, and one of them gets overwritten.

Major changes

  • This PR adds the ability to reuse field sets as a different field name (more details below in schemas/*.yml)
  • The ECS tooling no longer makes efforts to ensure that custom fields merged in via --include result in schema definitions that are "good ECS citizens".
    • The --include CLI flag will soon be used by the new RFC process, to build ECS artifacts that include early stage (unofficial) major additions. See Document ECS RFC process #833 for the first step of introducing this new process.
    • This means the tooling must now accept included files as they are, with all of the power this entails. We can still re-introduce a safety net later, but a this time the ECS tools are as safe to use as a loaded shotgun (and it's pointed at your feet) 😄
  • Documentation about how schema files are defined (schemas/README.md) was updated significantly.
  • generated/ecs/ecs_nested.yml
    • breaking in generated/ecs/ecs_nested.yml, the format of the array reusable.expected is changing in a breaking way. It used to be an array of flat names (e.g. client.user, server.user...). Now it's an array of objects ({at:client, as:user, full: client.user}, {at:user, as:target, full: user.target}). Programs that consume this file will need to adjust to this new format, in order to be able to consume this file as of ECS 1.6.0. This is necessary to support field set reuse as a different name.
    • addition (non-breaking): field sets that are the recipients of reused fields (e.g. the source field set, if you think of reusing user at source.user) now have a new attribute called reused_here, listing all field sets that are, well, reused here :-)
      • This new attribute is meant to replace nestings which was a list of flat names and is not enough to capture reuse as a different name.
  • schemas/*.yml
    • The old way to define field reuse still works (list of destinations as flat names). However to support nesting as a different name, the array now also supports objects such as {at: process, as: parent} or {at: user, as: target}. See comments in schemas/process.yml and schemas/user.yml.
    • We're introducing an optional attribute reusable.order, to ensure chained reuses happen in the right order (group => user, then user => many places). We used to do reuse by reference, which caused problems and added complexity.
      • A better approach long term may be to walk the reuse tree instead (think of it like a dependency tree). But at this time, order is used exactly once in all of ECS, for group. So this better approach sounded overkill.
      • The introduction of this attribute has one big advantage over how it used to be done: we can now deep copy fields immediately when performing reuse, and right away fill in new values specific to the new nesting location.
  • We're introducing a third file in generated/ecs/ that contains the deeply nested representation of fields. This is a representation we've had to develop a while ago, to allow field merging. But it's not meant to be published. So the file is generated at generated/ecs/ecs.yml, but it's ignored by git. Its purpose is to help us debug the code, not to publish a third file format :-)
  • For a while the asciidoc generator operated on this deeply nested structure, but now it's back to operating on the "ecs_nested" representation. The introduction of reused_here made this possible.
  • generator.py contains almost no logic now. All of the logic is now in new scripts with narrower focus. This makes them much easier to test, without massive test setups. The new scripts are in scripts/schema/*.py. They also supersede scripts/schema_reader.py which is now gone.
  • The Python unit tests now run in verbose mode.

Caveats

  • In generated/ecs/ecs_nested.yml: chained reuses don't populate nestings all the way anymore. Example: group is reused in user, then user in client: client's nestings attribute will no longer list client.user.group.
    • This is something that could be fixed, but I'm not sure this is worth it.

Mathieu Martin added 30 commits April 16, 2020 16:20
Bunch of stuff moving to new script schema_processor.py
Anything can be merged over other fields. Here's not the place to enforce
good ECS citizenry.
My brain can't deal with Python list comprehensions
@webmat webmat marked this pull request as ready for review June 9, 2020 15:03
@webmat
Copy link
Contributor Author

webmat commented Jun 9, 2020

I'm not super happy with the way I'm currently fixing the tracing fields getting mangled.

I would have preferred a change that didn't create so many changes in ecs_nested.yml. The nesting structure there used to be under the "contextual name" of a field, now it's nested under the full name of the field. I would expect programs reading this file to use the attributes (name or flat_name), not the structure of the nesting, for information. So I think this may be an acceptable change in the YAML format.

To illustrate, where we used to have

client:
  description: 'A client is defined ...'
  fields:
    address: # <-- contextual name
      name: address
      flat_name: client.address
      type: keyword

Now we have this instead:

client:
  description: 'A client is defined ...'
  fields:
    client.address: # <-- full name
      name: address
      flat_name: client.address
      type: keyword

I took this approach because the current attribute name had problems. In general it acts as a contextual name, or the name of the field within the field set (e.g. client.address has the contextual name address, when inside client). But this was never correctly adjusted for fields that were being reused. I suppose most folks used flat_name anyway, as the correct full name. The only place where the contextual name is absolutely necessary is in the Beats generator. My recent changes reconstruct it there only.

The new approach of adding node_name attributes everywhere simplifies edge cases like the tracing field set. This new attribute is internal only, and is therefore not output in ecs_flat.yml nor ecs_nested.yml.

I'm open to suggestions on how we can solve this better, if there's any :-)

Copy link
Member

@ebeahan ebeahan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more observations, asking more-so for my own better understanding of the changes.

docs/field-details.asciidoc Outdated Show resolved Hide resolved
docs/field-details.asciidoc Show resolved Hide resolved
@marshallmain
Copy link
Contributor

One more use case that throws a wrench in things...in our custom endpoint schemas we have a target fieldset where we reuse the process fieldset. It's useful to have target.process.parent as well in this case, but when process.parent is switched over to use this implementation it won't be carried over to target.process.parent anymore since the self-reused fields don't come with regular reuses.

Since this PR isn't making that change yet it's not a blocker but something to consider since that change is coming up soon. Could we implement a flag that determines whether self-reused fieldsets get carried with the fieldset in regular reuse?

@webmat
Copy link
Contributor Author

webmat commented Jun 9, 2020

Haha I had been wondering if this assumption would hold. Seems like it doesn't 😂

There's no inherent reason to prevent this wholesale. The two cases that currently make use of self-nestings fall on both sides of the question:

  • I think it makes sense that process.parent would follow along, when reusing process elsewhere, as you describe.
  • I do not want the 3 self-nestings of user described in [ECS] Multiple users in an event proposal #809 to follow along in all the places where user is reused. I don't think it's needed.

So I'm open to adding an option to handle whether the self-nestings follow along.

Another way we may be able to work around this on the endpoint side is by you explicitly reusing twice, like this:

- name: process
  ...
  reusable:
    expected:
      - at: target
        as: process
      - at: target.process
        as: parent

Right now this would probably break, as I've been using the "nest as" notation exclusively for self-nestings. So the code makes assumptions in this direction for now.

But the semantics of the schema DSL make it sensible to define things like this. We could adjust the tools to support this as well.

So these are two ways we can make this work, I think. Happy to go either way.

@marshallmain
Copy link
Contributor

Ah I didn't think about reusing it twice, that's a good idea and probably the simplest and most versatile solution.

@webmat
Copy link
Contributor Author

webmat commented Jun 9, 2020

FYI 2 of the branches for follow-up PRs (once this is merged), if you're curious:

I haven't started work on the 3rd PR to put in place the various improvements I've noted along the way.

@webmat
Copy link
Contributor Author

webmat commented Jun 9, 2020

@marshallmain The code may still need adjustments to make it work, though.

@webmat
Copy link
Contributor Author

webmat commented Jun 10, 2020

@marshallmain Are there any blockers for you here?

Copy link
Contributor

@marshallmain marshallmain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Copy link
Member

@ebeahan ebeahan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - no additional comments or outstanding issues from my end.

@simitt
Copy link
Contributor

simitt commented Jun 11, 2020

@webmat I was recently working on go struct generation using the ecs_nested.yml.
One thing I was wondering was, what are your thoughts on fully nesting the fields? E.g. instead of having agent.build.original have a structure like:

agent:
  fields: 
    build:
      fields:
        original: 

It would make parsing the file more straight forward, and since every field additionally has the flat_name attribute, no information would be lost.
Switching over to a fully nested structure would also allow for having a description on every field set. E.g. having cloud.project.name and cloud.project.id, there is no description available for cloud.project. When creating structs out of the file one can use the description for auto-generating documentation, therefore it would be helpful if it was complete.

Don't feel blocked by my comments, just some thoughts and feedback around working with the nested fields.

@webmat
Copy link
Contributor Author

webmat commented Jun 11, 2020

@simitt Funny you ask that.

First, to directly answer your question of documenting the intermediate parent fields, this is already supported. We do it for dns.answers, log.syslog and network.inner. All of these are object fields that have additional fields defined under them, either directly or via field reuse. They do end up on the documentation (see log.syslog, for example).

So if you need to do this to add context on a parent field in your custom fields, or in fields you'd like to submit as a PR to ECS, go ahead! It works 🙂

As an aside, the format you describe is precisely the format of the new git-ignored file introduced by this PR, at generated/ecs/ecs.yml 🙂 If you're curious to check it out, you can simply run make and it'll be generated. Alternatively, the format is described at the top of the scripts/schema/loader.py as well. We've introduced it a while ago (thanks Marshall) to simplify merging additional fields on top of ECS. But saving it as a file is new to this PR, as a debugging help.

For now I'm holding off on publishing a third official representation for the fields. It seems like 3 would be too much. Although I'm curious if other folks also think it would be good to replace ecs_nested.yml with this new format.

I think ecs_nested is easier to parse in that there's only 2 levels of details, the field sets details, and a simple list of all of their fields. On the other hand, when building some of the artifacts like the Elasticsearch template, we do have to re-create the deeply nested structure again, to match ES' way of defining all the fields 😂

So we're not adding it for now, but if people think there's a need for that, let us know 🙂

@webmat
Copy link
Contributor Author

webmat commented Jun 11, 2020

Thanks for the feedback everyone

@webmat
Copy link
Contributor Author

webmat commented Jun 17, 2020

All 3 follow-up PRs to this are now linked in the body of the PR: #868, #869 and #871

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants