Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use safe_merge to handle multiple definitions of a single field #821

Closed

Conversation

jonathan-buttner
Copy link
Contributor

This PR addresses a potential issue where a user defining their own schema files accidentally defines a custom extension for the same field twice. For example in endpoint we extend the file schema:

---
- name: file
  title: File
  short: Fields describing files.
  description: ...
  type: group
  fields:
    - name: entry_modified
      level: custom
      type: double
      description: >
        Time of last status change.  See `st_ctim` member of `struct stat`.

If we had another file that redefines fields for file, the actual type used would depend on which schema file was read last by the ecs generator.

This probably won't happen in practice, but I figured it was worth pointing out just in case.

With this change, it will throw an error indicating that file was found more than once:

python scripts/generator.py --out ../gen --include ../endpoint-app-team/custom_schemas --subset ../endpoint-app-team/custom_subsets/elastic_endpoint/events/*
Running generator. ECS version 1.6.0-dev
Loading default schemas
Loading user defined schemas: ['../endpoint-app-team/custom_schemas/custom_call_stack.yml', '../endpoint-app-team/custom_schemas/custom_dll.yml', '../endpoint-app-team/custom_schemas/custom_elastic.yml', '../endpoint-app-team/custom_schemas/custom_endpoint.yml', '../endpoint-app-team/custom_schemas/custom_file.yml', '../endpoint-app-team/custom_schemas/custom_hash.yml', '../endpoint-app-team/custom_schemas/custom_http.yml', '../endpoint-app-team/custom_schemas/custom_macro.yml', '../endpoint-app-team/custom_schemas/custom_malware_classification.yml', '../endpoint-app-team/custom_schemas/custom_os.yml', '../endpoint-app-team/custom_schemas/custom_process.yml', '../endpoint-app-team/custom_schemas/custom_target.yml', '../endpoint-app-team/custom_schemas/custom_test.yml', '../endpoint-app-team/custom_schemas/custom_token.yml']
Traceback (most recent call last):
  File "scripts/generator.py", line 91, in <module>
    main()
  File "scripts/generator.py", line 33, in main
    intermediate_custom = schema_reader.load_schemas(include_glob)
  File "/Users/jbuttner/proj/es/ecs/scripts/schema_reader.py", line 22, in load_schemas
    fields_intermediate = load_schema_files(files)
  File "/Users/jbuttner/proj/es/ecs/scripts/schema_reader.py", line 31, in load_schema_files
    fields_nested = ecs_helpers.safe_merge_dicts(fields_nested, new_fields)
  File "/Users/jbuttner/proj/es/ecs/scripts/generators/ecs_helpers.py", line 50, in safe_merge_dicts
    raise ValueError('Duplicate key found when merging dictionaries: {0}'.format(key))
ValueError: Duplicate key found when merging dictionaries: file

@webmat
Copy link
Contributor

webmat commented Apr 21, 2020

Actually the goal is specifically to allow that. A few ways this can be used are:

  • To add a single field within an existing schema. This is discouraged in general, but there can be legit reasons to do that (e.g. be forward-compatible with an upcoming version of ECS)
  • To nest ECS field sets within the user's custom schemas:
- name: file
  reusable:
    expected:
    - my_custom_stuff

@webmat webmat closed this Apr 21, 2020
@jonathan-buttner
Copy link
Contributor Author

Actually the goal is specifically to allow that. A few ways this can be used are:

  • To add a single field within an existing schema. This is discouraged in general, but there can be legit reasons to do that (e.g. be forward-compatible with an upcoming version of ECS)
  • To nest ECS field sets within the user's custom schemas:
- name: file
  reusable:
    expected:
    - my_custom_stuff

Hmm, I think we might be talking about different things. I believe both of the cases you mentioned should still work correctly even with this change.

The scenario I'm talking about is this:

Let's say I have a custom schema for file, and I accidentally create another custom schema for file again:

Intentional custom schema
file.yml

---
- name: file
  title: File
  group: 2
  short: Fields describing files.
  description: >
    TODO
  type: group
  fields:
    - name: entry_modified
      level: custom
      type: keyword <---- keyword in this file
      description: >
        Time of last status change.  See `st_ctim` member of `struct stat`.

Accidental custom schema
other_file.yml

---
- name: file
  title: File
  group: 2
  short: Fields describing files.
  description: >
    TODO
  type: group
  fields:
    - name: entry_modified
      level: custom
      type: double <--- double in this file
      description: >
        Time of last status change.  See `st_ctim` member of `struct stat`.

subset file:

---
base:
  fields:
    message:
      fields: "*"
file:
  fields:
    entry_modified:
      fields: "*"

If I run the generator script without the change, this template will be produced:

{
  "index_patterns": [
    "ecs-*"
  ],
  "mappings": {
    "_meta": {
      "version": "1.6.0-dev"
    },
    "date_detection": false,
    "dynamic_templates": [
      {
        "strings_as_keyword": {
          "mapping": {
            "ignore_above": 1024,
            "type": "keyword"
          },
          "match_mapping_type": "string"
        }
      }
    ],
    "properties": {
      "file": {
        "properties": {
          "entry_modified": {
            "type": "double" <--- uses a double from the accidental other_file.yml
          }
        }
      },
      "message": {
        "norms": false,
        "type": "text"
      }
    }
  },
  "order": 1,
  "settings": {
    "index": {
      "mapping": {
        "total_fields": {
          "limit": 10000
        }
      },
      "refresh_interval": "5s"
    }
  }
}

It choses the double type because the other_file.yml was alphabetically later:

python scripts/generator.py --out ../gen --include ../test_schema/schema --subset ../test_schema/*.yml
Running generator. ECS version 1.6.0-dev
Loading default schemas
Loading user defined schemas: ['../test_schema/schema/file.yml', '../test_schema/schema/other_file.yml']

What the change is trying to prevent is a user (or ecs core) accidentally defining the top level file schema twice. If that happens, this line fields_nested.update(new_fields) will overwrite the whatever is in the map, and use what is defined in the file that was read last.

@webmat webmat reopened this Apr 21, 2020
@webmat
Copy link
Contributor

webmat commented Apr 21, 2020

Ok I misunderstood. Could you add unit tests for the following cases, please?

  • Raises when an existing field gets redefined (as you explain here)
  • Allows merging 2x the same field sets with only different fields

@jonathan-buttner
Copy link
Contributor Author

Ok I misunderstood. Could you add unit tests for the following cases, please?

  • Raises when an existing field gets redefined (as you explain here)
  • Allows merging 2x the same field sets with only different fields

Yep will do!

@webmat
Copy link
Contributor

webmat commented Apr 29, 2020

@jonathan-buttner I haven't forgotten this. However I'm working on a massive PR that touches some of this code. I may just get this change in my PR. This is a good additional safeguard.

@jonathan-buttner
Copy link
Contributor Author

@jonathan-buttner I haven't forgotten this. However I'm working on a massive PR that touches some of this code. I may just get this change in my PR. This is a good additional safeguard.

Thanks @webmat sounds good!

webmat pushed a commit to webmat/ecs that referenced this pull request May 7, 2020
@webmat
Copy link
Contributor

webmat commented May 11, 2020

@elasticmachine, run elasticsearch-ci/docs

@ebeahan ebeahan added the review label Jul 8, 2020
@ebeahan
Copy link
Member

ebeahan commented Jul 8, 2020

Looks these changes were merged as part of d16c361 (#864)

@ebeahan ebeahan closed this Jul 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants