Skip to content

Conversation

MikePaquette
Copy link
Contributor

@MikePaquette MikePaquette commented Aug 15, 2025

1. What does this PR do?

Adds newly proposed entity.* fields to the Entity Fields RFC
Adds currently used entity.* fields to this RFC
Proposes a nested location in the naming hierarchy for entity.* fields populated by ECS producers
Proposes a root namespace location for entity.* fields when used by ECS consumers and entity-related data stores.

2. Which ECS fields are affected/introduced?

@MikePaquette MikePaquette requested a review from a team as a code owner August 15, 2025 15:35
Copy link

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@eyalkraft
Copy link

Hi Mike, Thanks for the proposal!

Is there a reason for having the new entity.schema_version instead of using the standard ecs.version?

@MikePaquette
Copy link
Contributor Author

Is there a reason for having the new entity.schema_version instead of using the standard ecs.version?

Good question. I don't know when this was added, or how it is used, but I found it in the existing mappings in 8.19.1.

"entity": {
          "properties": {
            "definition_id": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "definition_version": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "display_name": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 1024
                }
              }
            },
            "id": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "identity_fields": {
              "type": "keyword"
            },
            "last_seen_timestamp": {
              "type": "date"
            },
            "name": {
              "type": "keyword"
            },
            "schema_version": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "source": {
              "type": "keyword"
            },
            "type": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },

| Field | Type | Description |
|-------|------|-------------|
| entity.definition_id | keyword | Used Elastic solutions (e.g., Security, Observability) to denote the ID of the entity definition which is used to extract entity details from ingested logs, events, intelligence, and other data types. Use of this value is reserved, and ECS producers, including data ingestion pipelines, must not populate this field|
| entity.definition_id | keyword | Used by Elastic solutions (e.g., Security, Observability) to denote the version of the entity definition which is used to extract entity details from ingested logs, events, intelligence, and other data types. Use of this value is reserved, and ECS producers, including data ingestion pipelines, must not populate this field|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

entity.definition_id is repeated here


| Field | Type | Description |
|-------|------|-------------|
| entity.definition_id | keyword | Used Elastic solutions (e.g., Security, Observability) to denote the ID of the entity definition which is used to extract entity details from ingested logs, events, intelligence, and other data types. Use of this value is reserved, and ECS producers, including data ingestion pipelines, must not populate this field|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too sure about this field being reserved. There are no other fields in ECS that are reserved for use by Elastic, and I don't really know if it makes sense to have them. Since the intention of ECS is a common schema that's shared with others, I don't if this would make sense to include.

Are there alternatives to adding this, such as using a custom field that's not defined in ECS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, it's an internal-only field - no need to have it defined in the ECS spec. We already intend to publish the entity store schema in our docs. Any problem with using entity.Definition_id as a custom field? (nesting a custom leaf field under an ECS-defined root object entity.*

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MikePaquette as we speak we are storing this information under entity.Metadata.EngineType. What do you think about keeping it under metadata?

| entity.reference | keyword | A URI, URL, or other direct reference to access or locate the entity in its source system. This could be an API endpoint, web console URL, or other addressable location. Format may vary by entity type and source system. |
| entity.attributes.* | object | Normalized entity attributes using capitalized field names (e.g., `entity.attributes.StorageClass`, `entity.attributes.MfaEnabled`). Use this field set when you need specific data types, advanced search capabilities, or normalized values across different providers/sources. The capitalization pattern indicates these are entity-specific fields that won't be enumerated in the ECS schema. |
| entity.raw.* | flattened | Original, unmodified fields from the source system stored in a flattened format that maintains basic searchability. While `entity.attributes` should be used for normalized fields requiring advanced queries, this field preserves all source metadata with basic search capabilities. Supports existence queries, exact value matches, and simple aggregations. |
| entity.attributes.* | object | A set of static or semi-static attributes of the entity. Usually boolean or keyword field data types. Examples include: `entity.attributes.Storage_class`, `entity.attributes.Mfa_enabled` , `entity.attributes.Privileged` , `entity.attributes.Granted_permissions` , `entity.attributes.Known_redirect` , `entity.attributes.Asset` , `entity.attributes.Managed` ,`entity.attrbitues.Os_current` , `entity.attibutes.Os_patch_current` , `entity.attributes.Oauth_consent_restriction`). Use this field set when you need to track static or semi-static characterstics of an entity for advanced searching and correlation of normalized values across different providers/sources and entity types. Note the initial capitalization pattern for Examples indicates that any such fields are custom entity-specific fields that won't be enumerated in the ECS schema, and won't collide with any fields that may be defined by ECS in the future. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| entity.attributes.* | object | A set of static or semi-static attributes of the entity. Usually boolean or keyword field data types. Examples include: `entity.attributes.Storage_class`, `entity.attributes.Mfa_enabled` , `entity.attributes.Privileged` , `entity.attributes.Granted_permissions` , `entity.attributes.Known_redirect` , `entity.attributes.Asset` , `entity.attributes.Managed` ,`entity.attrbitues.Os_current` , `entity.attibutes.Os_patch_current` , `entity.attributes.Oauth_consent_restriction`). Use this field set when you need to track static or semi-static characterstics of an entity for advanced searching and correlation of normalized values across different providers/sources and entity types. Note the initial capitalization pattern for Examples indicates that any such fields are custom entity-specific fields that won't be enumerated in the ECS schema, and won't collide with any fields that may be defined by ECS in the future. |
| entity.attributes | object | A set of static or semi-static attributes of the entity. Usually boolean or keyword field data types. Examples include: `entity.attributes.Storage_class`, `entity.attributes.Mfa_enabled` , `entity.attributes.Privileged` , `entity.attributes.Granted_permissions` , `entity.attributes.Known_redirect` , `entity.attributes.Asset` , `entity.attributes.Managed` ,`entity.attrbitues.Os_current` , `entity.attibutes.Os_patch_current` , `entity.attributes.Oauth_consent_restriction`). Use this field set when you need to track static or semi-static characterstics of an entity for advanced searching and correlation of normalized values across different providers/sources and entity types. Note the initial capitalization pattern for Examples indicates that any such fields are custom entity-specific fields that won't be enumerated in the ECS schema, and won't collide with any fields that may be defined by ECS in the future. |

The existing flattened fields in ECS don't use .* in the name. I think it makes sense to remove here too, to be consistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mjwolf what do you mean by "flattened" here? As proposed, the object will contain keyword or boolean leaf fields, and not the flattened field datatype. No problem with removing the .* though, thanks.


For ECS producers, such as Beats, Elastic Agent integrations, ingest pipelines, and other methods for shipping data to Elastic, the `entity.*` fields are expected to be nested as follows:
- If the entity type is one of host, user, service, cloud, orchestrator), then the entity fields should be nested under the respecitve root field set, for example `host.entity.*` , `user.entity.*`, etc.
- If the entity type is not one of the above, then that `entity.*` fields should be nested under a new root-level object, called `generic`, as `generic.entity.*`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think generic will have any other fields apart from generic.entity.*

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not foreseen. Can anyone else think of a reason why we'd use the root field set generic.* for another purpose?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of any other reason


### ECS Consumers or Data Stores

For ECS consumers, such as the Elastic Security Solution entity store indices, the `entity.*` fields should be used directly at the root of the events.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand on what this means, I don't think I understand it. What do you mean by "used". Why can a data store only use entity.* if the producers can write to other top level fieldsets?

I think this is also an implementation specific detail that doesn't need to be part of ECS, so maybe it can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we can remove this from the ECS spec, but the idea is that the entity store schema will include entity.* at the root level in each document. this will allow us and users to query the entity store for entity ID's (and other attributes) w/o burdening them with knowing under which entity class {host, user, service, generic, etc.} it might be stored.

MikePaquette and others added 2 commits September 4, 2025 07:51
typo

Co-authored-by: Uri Weisman <68195305+uri-weisman@users.noreply.github.com>
fix typos

Co-authored-by: Rômulo Farias <romulodefarias@gmail.com>

| Field | Type | Description |
|-------|------|-------------|
| entity.definition_id | keyword | Used Elastic solutions (e.g., Security, Observability) to denote the ID of the entity definition which is used to extract entity details from ingested logs, events, intelligence, and other data types. Use of this value is reserved, and ECS producers, including data ingestion pipelines, must not populate this field|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MikePaquette as we speak we are storing this information under entity.Metadata.EngineType. What do you think about keeping it under metadata?

|-------|------|-------------|
| entity.definition_id | keyword | Used Elastic solutions (e.g., Security, Observability) to denote the ID of the entity definition which is used to extract entity details from ingested logs, events, intelligence, and other data types. Use of this value is reserved, and ECS producers, including data ingestion pipelines, must not populate this field|
| entity.definition_id | keyword | Used by Elastic solutions (e.g., Security, Observability) to denote the version of the entity definition which is used to extract entity details from ingested logs, events, intelligence, and other data types. Use of this value is reserved, and ECS producers, including data ingestion pipelines, must not populate this field|
| entity.schema_version | keyword | Denotes the version of the entity schema,as published in Elastic Security documentation, to which this entity information conforms. Usually conforms to the Elastic Stack version.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a new guideline? Or is it something that already happens? I'm not aware of an entity schema published in Elastic Security Docs that usually conforms to elastic stack version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I found this in the existing index mappings 8.19.1, and was not aware of it. Included here in case it was already in use, but it does not appear to be used, so we can remove it from this RFC.

| entity.reference | keyword | A URI, URL, or other direct reference to access or locate the entity in its source system. This could be an API endpoint, web console URL, or other addressable location. Format may vary by entity type and source system. |
| entity.attributes.* | object | Normalized entity attributes using capitalized field names (e.g., `entity.attributes.StorageClass`, `entity.attributes.MfaEnabled`). Use this field set when you need specific data types, advanced search capabilities, or normalized values across different providers/sources. The capitalization pattern indicates these are entity-specific fields that won't be enumerated in the ECS schema. |
| entity.raw.* | flattened | Original, unmodified fields from the source system stored in a flattened format that maintains basic searchability. While `entity.attributes` should be used for normalized fields requiring advanced queries, this field preserves all source metadata with basic search capabilities. Supports existence queries, exact value matches, and simple aggregations. |
| entity.attributes | object | A set of static or semi-static attributes of the entity. Usually boolean or keyword field data types. Examples include: `entity.attributes.Storage_class`, `entity.attributes.Mfa_enabled` , `entity.attributes.Privileged` , `entity.attributes.Granted_permissions` , `entity.attributes.Known_redirect` , `entity.attributes.Asset` , `entity.attributes.Managed` ,`entity.attributes.Os_current` , `entity.attributes.Os_patch_current` , `entity.attributes.Oauth_consent_restriction`). Use this field set when you need to track static or semi-static characterstics of an entity for advanced searching and correlation of normalized values across different providers/sources and entity types. Note the initial capitalization pattern for Examples indicates that any such fields are custom entity-specific fields that won't be enumerated in the ECS schema, and won't collide with any fields that may be defined by ECS in the future. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current pattern of snake case with capital first letter isn't friendly for most coding tools and linters, since it's not a standard pattern.

Could we adopt a pattern such as PascalCase. Reading the documentation over capitizaion for non ecs fields I see precedent for it, since it's mentioned both in HAProxy and NGINX examples

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjwolf do you have a position on this topic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ECS uses snake case for multiword fields, so I think it makes the most sense to keep using that. OTel semantic conventions also uses snake case.

I think for these examples in documentation it should keep using snake case. For the actual implementations, we don't require it, so other cases are allowed (as long as it starts with a capital).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for ECS, this is irrelevant because we should not provide examples using custom fields.

| entity.reference | keyword | A URI, URL, or other direct reference to access or locate the entity in its source system. This could be an API endpoint, web console URL, or other addressable location. Format may vary by entity type and source system. |
| entity.attributes.* | object | Normalized entity attributes using capitalized field names (e.g., `entity.attributes.StorageClass`, `entity.attributes.MfaEnabled`). Use this field set when you need specific data types, advanced search capabilities, or normalized values across different providers/sources. The capitalization pattern indicates these are entity-specific fields that won't be enumerated in the ECS schema. |
| entity.raw.* | flattened | Original, unmodified fields from the source system stored in a flattened format that maintains basic searchability. While `entity.attributes` should be used for normalized fields requiring advanced queries, this field preserves all source metadata with basic search capabilities. Supports existence queries, exact value matches, and simple aggregations. |
| entity.attributes | object | A set of static or semi-static attributes of the entity. Usually boolean or keyword field data types. Examples include: `entity.attributes.Storage_class`, `entity.attributes.Mfa_enabled` , `entity.attributes.Privileged` , `entity.attributes.Granted_permissions` , `entity.attributes.Known_redirect` , `entity.attributes.Asset` , `entity.attributes.Managed` ,`entity.attributes.Os_current` , `entity.attributes.Os_patch_current` , `entity.attributes.Oauth_consent_restriction`). Use this field set when you need to track static or semi-static characterstics of an entity for advanced searching and correlation of normalized values across different providers/sources and entity types. Note the initial capitalization pattern for Examples indicates that any such fields are custom entity-specific fields that won't be enumerated in the ECS schema, and won't collide with any fields that may be defined by ECS in the future. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what you mean "static or semi-static attributes" as opposed to lifecycle and behaviour fields, but maybe we have a better way of describing what do we mean by that? By the static or semi-static definition, first_seen and issued_at would also fit under it, wouldn't it?

As I see attributes should be used to describe non temporal entity properties that could not be expressed in other parts of the entity field set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, perhaps we can add "non-temporal" to the definition of the entity.attributes.* fields? Would that help?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so!

| entity.attributes | object | A set of static or semi-static attributes of the entity. Usually boolean or keyword field data types. Examples include: `entity.attributes.Storage_class`, `entity.attributes.Mfa_enabled` , `entity.attributes.Privileged` , `entity.attributes.Granted_permissions` , `entity.attributes.Known_redirect` , `entity.attributes.Asset` , `entity.attributes.Managed` ,`entity.attributes.Os_current` , `entity.attributes.Os_patch_current` , `entity.attributes.Oauth_consent_restriction`). Use this field set when you need to track static or semi-static characterstics of an entity for advanced searching and correlation of normalized values across different providers/sources and entity types. Note the initial capitalization pattern for Examples indicates that any such fields are custom entity-specific fields that won't be enumerated in the ECS schema, and won't collide with any fields that may be defined by ECS in the future. |
| entity.lifecycle.* | object | A set of temporal characteristics of the entity. Usually date field data type. Examples include: `entity.lifecycle.First_seen`, `entity.lifecycle.Last_activity` , `entity.lifecycle.Issued_at` , `entity.lifecycle.Last_password_change` ,etc. ). Use this field set when you need to track temporal characterstics of an entity for advanced searching and correlation of normalized values across different providers/sources and entity types. Note the initial capitalization pattern indicates that any such fields are custom entity-specific fields that won't be enumerated in the ECS schema, and won't collide with any fields that may be defined by ECS in the future. |
| entity.behavior.* | object | A set of ephemeral characteristics of the entity, derived from observed behaviors during a specific time period. Behaviors are usually captured in event logs under fields such as `event.action` and other fields, but this field set captures "attributified" behavior indicators, using semantics like "this behavior was seen one or more times during this time period." Sytems using this field set may need to force a "reset" of these behavioral indicators at the end of their current period. Usually boolean field data type. Examples include: `entity.behavior.Used_usb_device`, `entity.behavior.Brute_force_victim` , `entity.behavior.New_country_login` ,etc. ). Use this field set when you need to capture and track ephemeral characterstics of an entity for advanced searching, correlation of normalized values across different providers/sources and entity types. Note the initial capitalization pattern indicates that any such fields are custom entity-specific fields that won't be enumerated in the ECS schema, and won't collide with any fields that may be defined by ECS in the future. |
| entity.raw.* | object | Original, unmodified fields from the source system stored in a flattened format that maintains basic searchability. While `entity.attributes` should be used for normalized fields requiring advanced queries, this field preserves all source metadata with basic search capabilities. Supports existence queries, exact value matches, and simple aggregations. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be of type flattened instead of object, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so for entity.raw . We've agreed to remove the .* from the objects, and this would be the same.


For ECS producers, such as Beats, Elastic Agent integrations, ingest pipelines, and other methods for shipping data to Elastic, the `entity.*` fields are expected to be nested as follows:
- If the entity type is one of host, user, service, cloud, orchestrator), then the entity fields should be nested under the respecitve root field set, for example `host.entity.*` , `user.entity.*`, etc.
- If the entity type is not one of the above, then that `entity.*` fields should be nested under a new root-level object, called `generic`, as `generic.entity.*`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of any other reason

MikePaquette and others added 3 commits September 9, 2025 07:35
clarification.

Co-authored-by: Rômulo Farias <romulodefarias@gmail.com>
two typos.

Co-authored-by: Rômulo Farias <romulodefarias@gmail.com>
extraneous ")"

Co-authored-by: Rômulo Farias <romulodefarias@gmail.com>
romulets added a commit to elastic/kibana that referenced this pull request Sep 23, 2025
## Summary

Add Upsert Entity API which reflects changes made via the API directly
in the final entities index.

#### What is implemented
- Update documents
  - Allowed fields:
    - `entity.attributes.*`
    -  `entity.lifecyle.*`
    - `entity.behavior.*`
- Force update documents


#### Added ES Assets:

- Component Template `security_${type}_default-updates@platform`
- Index Template
`entities_v1_updates_security_${type}_default_index_template`
- Index `.entities.v1.updates.security_${type}_default`

#### What is not implemented
- Create
- ILM Policy to delete update documents

#### How to test

Ingest entities and run in the dev console:
```
PUT kbn:/api/entity_store/entities/generic
{
    "entity": {
        "id": "<ID>",
        "attributes": {
          "StorageClass": "hot"
        }
    }
}
```


### How it works

Before explaining the API itself, a refresher on the entity store

<details>
<summary> Entity Store Diagram </summary>

```mermaid
flowchart TB
    subgraph Main Flow
        A[(.logs*)] ~~~~ B[Transform]
        B ---> |Fetches raw entity data| A
        B ---> | Sends Aggregated Data | G{Ingest Pipeline}
        G --> | Combines new and old data and stores it| C[(.entity.v1.latest*)]
    end

    G -.-> | Fetches data older than transform retention policy| D[(.enrich-index-entities)]

    subgraph Retention Policy Flow
        direction LR 
        E((Kibana Task)) -->|trigger every hour| F[Enrich Policy Entities]
        F -.->| Fetches most upto date entities| C
        F --->| Stores data | D
    end
 ```

</details>

Entity store works based on a Transform which has a look back period of X hours (current 3h). That means data older than look period won't be retained. To solve that an Enrich Policy is set in place that takes hourly snapshots of the current state of the entity store and makes it available to, via ingest pipeline, enrich entity updates and make sure that we have data older than look back period present. Awesome.

This adds complexity to this feature. The goal is add an api that once called reflects data changes immediately in the latest index. A few things were considered:

- ❌ Add a new document to an update index to be picked up by the transform. 
  - That doesn't satisfy the requirement because changes will be made available only after a transform finishes its run
- ❌  Perform update by query in the latest index. 
  - That works great if the entity in the latest index doesn't get any other update via the transform - what we can't guarantee of course.
  
So the arrived solution was to both perform update by query in the latest index and publish an update document to be picked up by the transform, this way we get the best of both worlds.

- So first Update by query on `.entities.v1.latest.security_$TYPE_default` (update made via painless)
- Indexes a new document on `.entities.v1.updates.security_$TYPE_default` to be picked up by the transform.
```mermaid
flowchart LR
    A[User] -->|PUT /api/entity_store/entities/$TYPE| B[Kibana]
B --> |update by query| C[(.entities.v1.latest.security_$TYPE_default)]
B --> |create new doc| D[(.entities.v1.updates.security_$TYPE_default)]
```

We have considered adding a priority mechanism to the update index so we would make sure that documents published to it would be picked up. First we found out that we don't need to make sure a document is seen by the transform. By its definition, transforms process every document - it doesn't have any mechanism to drop documents in case processing is taking too long. Second, we can't do it because the aggregations we run on already sort to find latest values, and sort on multiple fields is not possible. 

### Fields and Schema

Prior to this PR non generic entities (`user`, `host`, and `service`) had no exposure to concepts defined in the proposed `entity.*` ECS Schema. We had to address this to be able to make changes to `entity.attributes`, `entity.lifecyle` and `entity.behavior` fields.

[The current direction](elastic/ecs#2513) is that `entity.*` fields will be nested under `user`, `host`,`service` and `generic` for data input and the latest index, with the final entities, would have a root `entity.*` field set.

In other words, there is a difference between entity data input location and entity data output location.
The document
```json
{ "user": { "entity": { "id" : "romulo", "type": "aws-user" } } }
```
Will be represented in the latest index as 
```json
{ "entity": { "id" : "romulo", "type": "aws-user" } } 
```

Because of the current direction of the discussion we decided to go towards there already. Therefore this PR contains changes to the entity definitions themselves adding entity fields that uses data source `{TYPE}.entity.*` and as destination `entity.*` (`x-pack/solutions/security/plugins/security_solution/server/lib/entity_analytics/entity_store/entity_definitions/entity_descriptions/common.ts`).

That also posed another question, what will be the input like? Will it accept entity "input" or entity "output" format?

I had decided to stay close to "output" format, therefore accept `entity.*` json fields and would be applied to the entity store. The reason behind it is simplicity of API. I believe that having a inconsistent placement for `entity` in the api isn't a great experience, therefore always accepting 
```json
{ "entity": { "id" : "romulo", "type": "aws-user" } } 
```
is better imo.

**That's contradictory to the input via logs however**. Curious to hear people's opinion.

There is another problem that further deviates the API from any ECS definition (input or output). For fields under `entity.attributes`, `entity.lifecyle` and `entity.behavior` we decided to define them on ECS. And because they are "custom fields" product would like them to have a `Capital_snake_case` format, which is not a traditional and developing with TS in such a case is not really allowed at the moment. To curb that, the api expose those fields as `snake_case` and before storing convert them to `Capital_snake_case`. That was the best way I found while still having field definition on OpenAPI spec.

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Mark Hopkin <mark.hopkin@elastic.co>
CAWilson94 pushed a commit to CAWilson94/kibana that referenced this pull request Sep 24, 2025
## Summary

Add Upsert Entity API which reflects changes made via the API directly
in the final entities index.

#### What is implemented
- Update documents
  - Allowed fields:
    - `entity.attributes.*`
    -  `entity.lifecyle.*`
    - `entity.behavior.*`
- Force update documents


#### Added ES Assets:

- Component Template `security_${type}_default-updates@platform`
- Index Template
`entities_v1_updates_security_${type}_default_index_template`
- Index `.entities.v1.updates.security_${type}_default`

#### What is not implemented
- Create
- ILM Policy to delete update documents

#### How to test

Ingest entities and run in the dev console:
```
PUT kbn:/api/entity_store/entities/generic
{
    "entity": {
        "id": "<ID>",
        "attributes": {
          "StorageClass": "hot"
        }
    }
}
```


### How it works

Before explaining the API itself, a refresher on the entity store

<details>
<summary> Entity Store Diagram </summary>

```mermaid
flowchart TB
    subgraph Main Flow
        A[(.logs*)] ~~~~ B[Transform]
        B ---> |Fetches raw entity data| A
        B ---> | Sends Aggregated Data | G{Ingest Pipeline}
        G --> | Combines new and old data and stores it| C[(.entity.v1.latest*)]
    end

    G -.-> | Fetches data older than transform retention policy| D[(.enrich-index-entities)]

    subgraph Retention Policy Flow
        direction LR 
        E((Kibana Task)) -->|trigger every hour| F[Enrich Policy Entities]
        F -.->| Fetches most upto date entities| C
        F --->| Stores data | D
    end
 ```

</details>

Entity store works based on a Transform which has a look back period of X hours (current 3h). That means data older than look period won't be retained. To solve that an Enrich Policy is set in place that takes hourly snapshots of the current state of the entity store and makes it available to, via ingest pipeline, enrich entity updates and make sure that we have data older than look back period present. Awesome.

This adds complexity to this feature. The goal is add an api that once called reflects data changes immediately in the latest index. A few things were considered:

- ❌ Add a new document to an update index to be picked up by the transform. 
  - That doesn't satisfy the requirement because changes will be made available only after a transform finishes its run
- ❌  Perform update by query in the latest index. 
  - That works great if the entity in the latest index doesn't get any other update via the transform - what we can't guarantee of course.
  
So the arrived solution was to both perform update by query in the latest index and publish an update document to be picked up by the transform, this way we get the best of both worlds.

- So first Update by query on `.entities.v1.latest.security_$TYPE_default` (update made via painless)
- Indexes a new document on `.entities.v1.updates.security_$TYPE_default` to be picked up by the transform.
```mermaid
flowchart LR
    A[User] -->|PUT /api/entity_store/entities/$TYPE| B[Kibana]
B --> |update by query| C[(.entities.v1.latest.security_$TYPE_default)]
B --> |create new doc| D[(.entities.v1.updates.security_$TYPE_default)]
```

We have considered adding a priority mechanism to the update index so we would make sure that documents published to it would be picked up. First we found out that we don't need to make sure a document is seen by the transform. By its definition, transforms process every document - it doesn't have any mechanism to drop documents in case processing is taking too long. Second, we can't do it because the aggregations we run on already sort to find latest values, and sort on multiple fields is not possible. 

### Fields and Schema

Prior to this PR non generic entities (`user`, `host`, and `service`) had no exposure to concepts defined in the proposed `entity.*` ECS Schema. We had to address this to be able to make changes to `entity.attributes`, `entity.lifecyle` and `entity.behavior` fields.

[The current direction](elastic/ecs#2513) is that `entity.*` fields will be nested under `user`, `host`,`service` and `generic` for data input and the latest index, with the final entities, would have a root `entity.*` field set.

In other words, there is a difference between entity data input location and entity data output location.
The document
```json
{ "user": { "entity": { "id" : "romulo", "type": "aws-user" } } }
```
Will be represented in the latest index as 
```json
{ "entity": { "id" : "romulo", "type": "aws-user" } } 
```

Because of the current direction of the discussion we decided to go towards there already. Therefore this PR contains changes to the entity definitions themselves adding entity fields that uses data source `{TYPE}.entity.*` and as destination `entity.*` (`x-pack/solutions/security/plugins/security_solution/server/lib/entity_analytics/entity_store/entity_definitions/entity_descriptions/common.ts`).

That also posed another question, what will be the input like? Will it accept entity "input" or entity "output" format?

I had decided to stay close to "output" format, therefore accept `entity.*` json fields and would be applied to the entity store. The reason behind it is simplicity of API. I believe that having a inconsistent placement for `entity` in the api isn't a great experience, therefore always accepting 
```json
{ "entity": { "id" : "romulo", "type": "aws-user" } } 
```
is better imo.

**That's contradictory to the input via logs however**. Curious to hear people's opinion.

There is another problem that further deviates the API from any ECS definition (input or output). For fields under `entity.attributes`, `entity.lifecyle` and `entity.behavior` we decided to define them on ECS. And because they are "custom fields" product would like them to have a `Capital_snake_case` format, which is not a traditional and developing with TS in such a case is not really allowed at the moment. To curb that, the api expose those fields as `snake_case` and before storing convert them to `Capital_snake_case`. That was the best way I found while still having field definition on OpenAPI spec.

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Mark Hopkin <mark.hopkin@elastic.co>
@mjwolf mjwolf merged commit 7df7f75 into main Sep 24, 2025
8 checks passed
niros1 pushed a commit to elastic/kibana that referenced this pull request Sep 30, 2025
## Summary

Add Upsert Entity API which reflects changes made via the API directly
in the final entities index.

#### What is implemented
- Update documents
  - Allowed fields:
    - `entity.attributes.*`
    -  `entity.lifecyle.*`
    - `entity.behavior.*`
- Force update documents


#### Added ES Assets:

- Component Template `security_${type}_default-updates@platform`
- Index Template
`entities_v1_updates_security_${type}_default_index_template`
- Index `.entities.v1.updates.security_${type}_default`

#### What is not implemented
- Create
- ILM Policy to delete update documents

#### How to test

Ingest entities and run in the dev console:
```
PUT kbn:/api/entity_store/entities/generic
{
    "entity": {
        "id": "<ID>",
        "attributes": {
          "StorageClass": "hot"
        }
    }
}
```


### How it works

Before explaining the API itself, a refresher on the entity store

<details>
<summary> Entity Store Diagram </summary>

```mermaid
flowchart TB
    subgraph Main Flow
        A[(.logs*)] ~~~~ B[Transform]
        B ---> |Fetches raw entity data| A
        B ---> | Sends Aggregated Data | G{Ingest Pipeline}
        G --> | Combines new and old data and stores it| C[(.entity.v1.latest*)]
    end

    G -.-> | Fetches data older than transform retention policy| D[(.enrich-index-entities)]

    subgraph Retention Policy Flow
        direction LR 
        E((Kibana Task)) -->|trigger every hour| F[Enrich Policy Entities]
        F -.->| Fetches most upto date entities| C
        F --->| Stores data | D
    end
 ```

</details>

Entity store works based on a Transform which has a look back period of X hours (current 3h). That means data older than look period won't be retained. To solve that an Enrich Policy is set in place that takes hourly snapshots of the current state of the entity store and makes it available to, via ingest pipeline, enrich entity updates and make sure that we have data older than look back period present. Awesome.

This adds complexity to this feature. The goal is add an api that once called reflects data changes immediately in the latest index. A few things were considered:

- ❌ Add a new document to an update index to be picked up by the transform. 
  - That doesn't satisfy the requirement because changes will be made available only after a transform finishes its run
- ❌  Perform update by query in the latest index. 
  - That works great if the entity in the latest index doesn't get any other update via the transform - what we can't guarantee of course.
  
So the arrived solution was to both perform update by query in the latest index and publish an update document to be picked up by the transform, this way we get the best of both worlds.

- So first Update by query on `.entities.v1.latest.security_$TYPE_default` (update made via painless)
- Indexes a new document on `.entities.v1.updates.security_$TYPE_default` to be picked up by the transform.
```mermaid
flowchart LR
    A[User] -->|PUT /api/entity_store/entities/$TYPE| B[Kibana]
B --> |update by query| C[(.entities.v1.latest.security_$TYPE_default)]
B --> |create new doc| D[(.entities.v1.updates.security_$TYPE_default)]
```

We have considered adding a priority mechanism to the update index so we would make sure that documents published to it would be picked up. First we found out that we don't need to make sure a document is seen by the transform. By its definition, transforms process every document - it doesn't have any mechanism to drop documents in case processing is taking too long. Second, we can't do it because the aggregations we run on already sort to find latest values, and sort on multiple fields is not possible. 

### Fields and Schema

Prior to this PR non generic entities (`user`, `host`, and `service`) had no exposure to concepts defined in the proposed `entity.*` ECS Schema. We had to address this to be able to make changes to `entity.attributes`, `entity.lifecyle` and `entity.behavior` fields.

[The current direction](elastic/ecs#2513) is that `entity.*` fields will be nested under `user`, `host`,`service` and `generic` for data input and the latest index, with the final entities, would have a root `entity.*` field set.

In other words, there is a difference between entity data input location and entity data output location.
The document
```json
{ "user": { "entity": { "id" : "romulo", "type": "aws-user" } } }
```
Will be represented in the latest index as 
```json
{ "entity": { "id" : "romulo", "type": "aws-user" } } 
```

Because of the current direction of the discussion we decided to go towards there already. Therefore this PR contains changes to the entity definitions themselves adding entity fields that uses data source `{TYPE}.entity.*` and as destination `entity.*` (`x-pack/solutions/security/plugins/security_solution/server/lib/entity_analytics/entity_store/entity_definitions/entity_descriptions/common.ts`).

That also posed another question, what will be the input like? Will it accept entity "input" or entity "output" format?

I had decided to stay close to "output" format, therefore accept `entity.*` json fields and would be applied to the entity store. The reason behind it is simplicity of API. I believe that having a inconsistent placement for `entity` in the api isn't a great experience, therefore always accepting 
```json
{ "entity": { "id" : "romulo", "type": "aws-user" } } 
```
is better imo.

**That's contradictory to the input via logs however**. Curious to hear people's opinion.

There is another problem that further deviates the API from any ECS definition (input or output). For fields under `entity.attributes`, `entity.lifecyle` and `entity.behavior` we decided to define them on ECS. And because they are "custom fields" product would like them to have a `Capital_snake_case` format, which is not a traditional and developing with TS in such a case is not really allowed at the moment. To curb that, the api expose those fields as `snake_case` and before storing convert them to `Capital_snake_case`. That was the best way I found while still having field definition on OpenAPI spec.

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Mark Hopkin <mark.hopkin@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants