Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip field from events for event search #149113

Merged
merged 11 commits into from Jan 27, 2023

Conversation

nkhristinin
Copy link
Contributor

@nkhristinin nkhristinin commented Jan 18, 2023

Strip source, and keep only fields used for mappings for events search of IM rule.

Tests on cloud env:
Screenshot 2023-01-19 at 16 42 19

Threat Indicators - 1.500.000 documents
Source - 1.000.000 documents.

No changes:
Screenshot 2023-01-19 at 16 30 19

Strip fields:

Screenshot 2023-01-19 at 16 30 27

@nkhristinin nkhristinin added Team:Security Solution Platform Security Solution Platform Team ci:cloud-deploy Create or update a Cloud deployment release_note:fix labels Jan 18, 2023
@nkhristinin nkhristinin self-assigned this Jan 18, 2023
@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

@nkhristinin nkhristinin marked this pull request as ready for review January 19, 2023 15:38
@nkhristinin nkhristinin requested a review from a team as a code owner January 19, 2023 15:38
Copy link
Contributor

@dhurley14 dhurley14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!!! One minor typo fix in the test and another suggestion to add a comment explaining what / why we added the new property for the search after query. Thanks for taking the time to investigate the potential (and now proven!) benefits of ignoring the source of the body from the query response. LGTM!

@@ -555,6 +555,61 @@ describe('create_signals', () => {
});
});

test('it respects overriderBody params', () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
test('it respects overriderBody params', () => {
test('it respects overrideBody params', () => {

small typo

@@ -27,6 +28,7 @@ interface BuildEventsSearchQuery {
secondaryTimestamp: TimestampOverride | undefined;
trackTotalHits?: boolean;
additionalFilters?: estypes.QueryDslQueryContainer[];
overrideBody?: OverrideBodyQuery;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nkhristinin could we add some js docs comments explaining what this property is used for? Something like this suggestion?

Suggested change
overrideBody?: OverrideBodyQuery;
/**
* overrideBody allows the search after to ignore the _source property of the result, thus reducing the size of the response and increasing the performance of the query.
*/
overrideBody?: OverrideBodyQuery;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks! Added comment to a single search function.

Copy link
Contributor

@e40pud e40pud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream


export interface OverrideBodyQuery {
_source: estypes.SearchSourceConfig;
fields: estypes.Fields;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

search query body has some other fields as well, as: sort, query etc.
Defining only 2 of these might be unnecessary restrictive, given its name suggests the whole body override.

On second note - would it make sense to define them as non required fields in this interface? It would make this parameter more flexible, if you would want to override _source field only, for example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion!

I changed those fields to optional.

I was trying to change OverrideBodyQuery to estypes.SearchRequest['body'], but it wasn't working, because looks like
buildEventsSearchQuery return compatible but not the exact type.

Later in a single search after we cast this query to estypes.SearchRequest

So if somebody need to add new fields, it shouldn't be a problem to add it into OverrideBodyQuery

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for looking into it.
I think it's good to go.
With added comment it should not be an issue to add new fields if needed

@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

kibana-ci commented Jan 26, 2023

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] Security Solution Tests #3 / Timelines Creates a timeline by clicking untitled timeline from bottom bar can be added notes

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @nkhristinin

@nkhristinin nkhristinin merged commit 9e96440 into elastic:main Jan 27, 2023
@kibanamachine kibanamachine added v8.7.0 backport:skip This commit does not require backporting labels Jan 27, 2023
nkhristinin added a commit that referenced this pull request Feb 2, 2023
## Terms query for Indicator Match rule

TODO: [] need more unit/integrations tests, but ready for review

The indicator match rule will use terms query when it is possible to
search for matches for threat-first-search and for events-first-search.

## How the match query worked:

Example for threat-first-search.
If we have matching conditions like:
`host.ip ==== indicator.host.ip` or (`source.name ===
indicator.source.name` AND `host.name === indicator.host.name`)

It will generate queries like:
```
match: {host.ip: "1"},  
or
match: {host.ip: "2"}
or
match: {host.ip: "3"}
or
(match: {source.name: "1"} and match: {host.name: "1"})
or
(match: {source.name: "2"} and match: {host.name: "2"})
or
(match: {source.name: "3"} and match: {host.name: "3"})
```

Each match will also have `_name` fields like:
`${threatId}_${threatIndex}_${threatFields}_${sourceField}`
So and because it's 1:1 relation between match and response, later at
enrichment stage will be clear which threat matches which event.


## Terms query.

We do fetch info about mapping for fields which use for match conditions
of the IM rule.
Terms query doesn't support all field types, this is why there is some
allowed list which field types.
Terms query not applied for AND conditions. 


For example:
Fields types

host.ip - `ip`
user.name - `keyword`
user.description - `text`
indicator.host.ip_range - `ip_range`

`host.ip === indicator.host.ip` or `host.ip_range === indicator.host.ip`
or (`source.name === indicator.source.name` AND `host.name ===
indicator.host.name`)

It will generate queries like:
```
terms: {host.ip: ["1","2","3"]},  
or
match: {host.ip_range: "1"} // terms query support range fields, but it will be difficult later to understand which threat match which event, because we can have more than 1 response for this condition
or
match: {host.ip_range: "2"}
or
(match: {source.name: "1"} and match: {host.name: "1"})
or
(match: {source.name: "2"} and match: {host.name: "2"})
or
(match: {source.name: "3"} and match: {host.name: "3"})
```

For terms query, we don't know which response matches with events, this
is why we do match it back in the code.

## Other changes
Threat-first-search - will do one extra request to have all matched
threats.
For example:
The threat index has 1.000.000 documents.
IM rule gets the first batch of 9.000 threats and builds a query to the
events index.
It returns 100 events (max_signal = 100).
Then it tries to enrich those 100 events with threat info. 
The problem is that the original implementation will enrich with the
only threats from this 9.000 batch.
And it will ignore other matches in 1.000.000 threats.

This way we do one extra request in the end from potential alerts to
threat index.


# Tests performance

In the best case, it can improve performance by around 3x times.

[Base](#149113)
Threat Indicators - 1.500.000 documents
Source - 1.000.000 documents.
1 field for match condition
<img width="557" alt="213484531-3ab68c61-c3f5-4e28-b2c4-c1e90a5b1775"
src="https://user-images.githubusercontent.com/7609147/215526984-ff027ba1-2f64-49fe-8fe8-a23ff4eda4dc.png">

This PR:
<img width="537" alt="Screenshot 2023-01-30 at 20 20 32"
src="https://user-images.githubusercontent.com/7609147/215575128-730514ac-a186-4ab8-87fd-af2ea8f79cec.png">

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
ogupte pushed a commit to ogupte/kibana that referenced this pull request Feb 3, 2023
## Terms query for Indicator Match rule

TODO: [] need more unit/integrations tests, but ready for review

The indicator match rule will use terms query when it is possible to
search for matches for threat-first-search and for events-first-search.

## How the match query worked:

Example for threat-first-search.
If we have matching conditions like:
`host.ip ==== indicator.host.ip` or (`source.name ===
indicator.source.name` AND `host.name === indicator.host.name`)

It will generate queries like:
```
match: {host.ip: "1"},  
or
match: {host.ip: "2"}
or
match: {host.ip: "3"}
or
(match: {source.name: "1"} and match: {host.name: "1"})
or
(match: {source.name: "2"} and match: {host.name: "2"})
or
(match: {source.name: "3"} and match: {host.name: "3"})
```

Each match will also have `_name` fields like:
`${threatId}_${threatIndex}_${threatFields}_${sourceField}`
So and because it's 1:1 relation between match and response, later at
enrichment stage will be clear which threat matches which event.


## Terms query.

We do fetch info about mapping for fields which use for match conditions
of the IM rule.
Terms query doesn't support all field types, this is why there is some
allowed list which field types.
Terms query not applied for AND conditions. 


For example:
Fields types

host.ip - `ip`
user.name - `keyword`
user.description - `text`
indicator.host.ip_range - `ip_range`

`host.ip === indicator.host.ip` or `host.ip_range === indicator.host.ip`
or (`source.name === indicator.source.name` AND `host.name ===
indicator.host.name`)

It will generate queries like:
```
terms: {host.ip: ["1","2","3"]},  
or
match: {host.ip_range: "1"} // terms query support range fields, but it will be difficult later to understand which threat match which event, because we can have more than 1 response for this condition
or
match: {host.ip_range: "2"}
or
(match: {source.name: "1"} and match: {host.name: "1"})
or
(match: {source.name: "2"} and match: {host.name: "2"})
or
(match: {source.name: "3"} and match: {host.name: "3"})
```

For terms query, we don't know which response matches with events, this
is why we do match it back in the code.

## Other changes
Threat-first-search - will do one extra request to have all matched
threats.
For example:
The threat index has 1.000.000 documents.
IM rule gets the first batch of 9.000 threats and builds a query to the
events index.
It returns 100 events (max_signal = 100).
Then it tries to enrich those 100 events with threat info. 
The problem is that the original implementation will enrich with the
only threats from this 9.000 batch.
And it will ignore other matches in 1.000.000 threats.

This way we do one extra request in the end from potential alerts to
threat index.


# Tests performance

In the best case, it can improve performance by around 3x times.

[Base](elastic#149113)
Threat Indicators - 1.500.000 documents
Source - 1.000.000 documents.
1 field for match condition
<img width="557" alt="213484531-3ab68c61-c3f5-4e28-b2c4-c1e90a5b1775"
src="https://user-images.githubusercontent.com/7609147/215526984-ff027ba1-2f64-49fe-8fe8-a23ff4eda4dc.png">

This PR:
<img width="537" alt="Screenshot 2023-01-30 at 20 20 32"
src="https://user-images.githubusercontent.com/7609147/215575128-730514ac-a186-4ab8-87fd-af2ea8f79cec.png">

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
kqualters-elastic pushed a commit to kqualters-elastic/kibana that referenced this pull request Feb 6, 2023
Strip source, and keep only fields used for mappings for events search
of IM rule.

Tests on cloud env:
<img width="1182" alt="Screenshot 2023-01-19 at 16 42 19"
src="https://user-images.githubusercontent.com/7609147/213486861-1a0e5a20-b575-4ed6-ac9b-cf9fb8a8c14f.png">

Threat Indicators - 1.500.000 documents
Source - 1.000.000 documents.

No changes:
<img width="564" alt="Screenshot 2023-01-19 at 16 30 19"
src="https://user-images.githubusercontent.com/7609147/213484465-ad20d5b2-c7a7-42d3-9d92-478c5d046636.png">

Strip fields:

<img width="557" alt="Screenshot 2023-01-19 at 16 30 27"
src="https://user-images.githubusercontent.com/7609147/213484531-3ab68c61-c3f5-4e28-b2c4-c1e90a5b1775.png">

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
darnautov pushed a commit to darnautov/kibana that referenced this pull request Feb 7, 2023
Strip source, and keep only fields used for mappings for events search
of IM rule.

Tests on cloud env:
<img width="1182" alt="Screenshot 2023-01-19 at 16 42 19"
src="https://user-images.githubusercontent.com/7609147/213486861-1a0e5a20-b575-4ed6-ac9b-cf9fb8a8c14f.png">

Threat Indicators - 1.500.000 documents
Source - 1.000.000 documents.

No changes:
<img width="564" alt="Screenshot 2023-01-19 at 16 30 19"
src="https://user-images.githubusercontent.com/7609147/213484465-ad20d5b2-c7a7-42d3-9d92-478c5d046636.png">

Strip fields:

<img width="557" alt="Screenshot 2023-01-19 at 16 30 27"
src="https://user-images.githubusercontent.com/7609147/213484531-3ab68c61-c3f5-4e28-b2c4-c1e90a5b1775.png">

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
darnautov pushed a commit to darnautov/kibana that referenced this pull request Feb 7, 2023
## Terms query for Indicator Match rule

TODO: [] need more unit/integrations tests, but ready for review

The indicator match rule will use terms query when it is possible to
search for matches for threat-first-search and for events-first-search.

## How the match query worked:

Example for threat-first-search.
If we have matching conditions like:
`host.ip ==== indicator.host.ip` or (`source.name ===
indicator.source.name` AND `host.name === indicator.host.name`)

It will generate queries like:
```
match: {host.ip: "1"},  
or
match: {host.ip: "2"}
or
match: {host.ip: "3"}
or
(match: {source.name: "1"} and match: {host.name: "1"})
or
(match: {source.name: "2"} and match: {host.name: "2"})
or
(match: {source.name: "3"} and match: {host.name: "3"})
```

Each match will also have `_name` fields like:
`${threatId}_${threatIndex}_${threatFields}_${sourceField}`
So and because it's 1:1 relation between match and response, later at
enrichment stage will be clear which threat matches which event.


## Terms query.

We do fetch info about mapping for fields which use for match conditions
of the IM rule.
Terms query doesn't support all field types, this is why there is some
allowed list which field types.
Terms query not applied for AND conditions. 


For example:
Fields types

host.ip - `ip`
user.name - `keyword`
user.description - `text`
indicator.host.ip_range - `ip_range`

`host.ip === indicator.host.ip` or `host.ip_range === indicator.host.ip`
or (`source.name === indicator.source.name` AND `host.name ===
indicator.host.name`)

It will generate queries like:
```
terms: {host.ip: ["1","2","3"]},  
or
match: {host.ip_range: "1"} // terms query support range fields, but it will be difficult later to understand which threat match which event, because we can have more than 1 response for this condition
or
match: {host.ip_range: "2"}
or
(match: {source.name: "1"} and match: {host.name: "1"})
or
(match: {source.name: "2"} and match: {host.name: "2"})
or
(match: {source.name: "3"} and match: {host.name: "3"})
```

For terms query, we don't know which response matches with events, this
is why we do match it back in the code.

## Other changes
Threat-first-search - will do one extra request to have all matched
threats.
For example:
The threat index has 1.000.000 documents.
IM rule gets the first batch of 9.000 threats and builds a query to the
events index.
It returns 100 events (max_signal = 100).
Then it tries to enrich those 100 events with threat info. 
The problem is that the original implementation will enrich with the
only threats from this 9.000 batch.
And it will ignore other matches in 1.000.000 threats.

This way we do one extra request in the end from potential alerts to
threat index.


# Tests performance

In the best case, it can improve performance by around 3x times.

[Base](elastic#149113)
Threat Indicators - 1.500.000 documents
Source - 1.000.000 documents.
1 field for match condition
<img width="557" alt="213484531-3ab68c61-c3f5-4e28-b2c4-c1e90a5b1775"
src="https://user-images.githubusercontent.com/7609147/215526984-ff027ba1-2f64-49fe-8fe8-a23ff4eda4dc.png">

This PR:
<img width="537" alt="Screenshot 2023-01-30 at 20 20 32"
src="https://user-images.githubusercontent.com/7609147/215575128-730514ac-a186-4ab8-87fd-af2ea8f79cec.png">

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
benakansara pushed a commit to benakansara/kibana that referenced this pull request Feb 7, 2023
## Terms query for Indicator Match rule

TODO: [] need more unit/integrations tests, but ready for review

The indicator match rule will use terms query when it is possible to
search for matches for threat-first-search and for events-first-search.

## How the match query worked:

Example for threat-first-search.
If we have matching conditions like:
`host.ip ==== indicator.host.ip` or (`source.name ===
indicator.source.name` AND `host.name === indicator.host.name`)

It will generate queries like:
```
match: {host.ip: "1"},  
or
match: {host.ip: "2"}
or
match: {host.ip: "3"}
or
(match: {source.name: "1"} and match: {host.name: "1"})
or
(match: {source.name: "2"} and match: {host.name: "2"})
or
(match: {source.name: "3"} and match: {host.name: "3"})
```

Each match will also have `_name` fields like:
`${threatId}_${threatIndex}_${threatFields}_${sourceField}`
So and because it's 1:1 relation between match and response, later at
enrichment stage will be clear which threat matches which event.


## Terms query.

We do fetch info about mapping for fields which use for match conditions
of the IM rule.
Terms query doesn't support all field types, this is why there is some
allowed list which field types.
Terms query not applied for AND conditions. 


For example:
Fields types

host.ip - `ip`
user.name - `keyword`
user.description - `text`
indicator.host.ip_range - `ip_range`

`host.ip === indicator.host.ip` or `host.ip_range === indicator.host.ip`
or (`source.name === indicator.source.name` AND `host.name ===
indicator.host.name`)

It will generate queries like:
```
terms: {host.ip: ["1","2","3"]},  
or
match: {host.ip_range: "1"} // terms query support range fields, but it will be difficult later to understand which threat match which event, because we can have more than 1 response for this condition
or
match: {host.ip_range: "2"}
or
(match: {source.name: "1"} and match: {host.name: "1"})
or
(match: {source.name: "2"} and match: {host.name: "2"})
or
(match: {source.name: "3"} and match: {host.name: "3"})
```

For terms query, we don't know which response matches with events, this
is why we do match it back in the code.

## Other changes
Threat-first-search - will do one extra request to have all matched
threats.
For example:
The threat index has 1.000.000 documents.
IM rule gets the first batch of 9.000 threats and builds a query to the
events index.
It returns 100 events (max_signal = 100).
Then it tries to enrich those 100 events with threat info. 
The problem is that the original implementation will enrich with the
only threats from this 9.000 batch.
And it will ignore other matches in 1.000.000 threats.

This way we do one extra request in the end from potential alerts to
threat index.


# Tests performance

In the best case, it can improve performance by around 3x times.

[Base](elastic#149113)
Threat Indicators - 1.500.000 documents
Source - 1.000.000 documents.
1 field for match condition
<img width="557" alt="213484531-3ab68c61-c3f5-4e28-b2c4-c1e90a5b1775"
src="https://user-images.githubusercontent.com/7609147/215526984-ff027ba1-2f64-49fe-8fe8-a23ff4eda4dc.png">

This PR:
<img width="537" alt="Screenshot 2023-01-30 at 20 20 32"
src="https://user-images.githubusercontent.com/7609147/215575128-730514ac-a186-4ab8-87fd-af2ea8f79cec.png">

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting ci:cloud-deploy Create or update a Cloud deployment release_note:enhancement Team:Security Solution Platform Security Solution Platform Team v8.7.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants