New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strip field from events for event search #149113
Conversation
…-ref HEAD~1..HEAD --fix'
@elasticmachine merge upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!!! One minor typo fix in the test and another suggestion to add a comment explaining what / why we added the new property for the search after query. Thanks for taking the time to investigate the potential (and now proven!) benefits of ignoring the source of the body from the query response. LGTM!
@@ -555,6 +555,61 @@ describe('create_signals', () => { | |||
}); | |||
}); | |||
|
|||
test('it respects overriderBody params', () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test('it respects overriderBody params', () => { | |
test('it respects overrideBody params', () => { |
small typo
@@ -27,6 +28,7 @@ interface BuildEventsSearchQuery { | |||
secondaryTimestamp: TimestampOverride | undefined; | |||
trackTotalHits?: boolean; | |||
additionalFilters?: estypes.QueryDslQueryContainer[]; | |||
overrideBody?: OverrideBodyQuery; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nkhristinin could we add some js docs comments explaining what this property is used for? Something like this suggestion?
overrideBody?: OverrideBodyQuery; | |
/** | |
* overrideBody allows the search after to ignore the _source property of the result, thus reducing the size of the response and increasing the performance of the query. | |
*/ | |
overrideBody?: OverrideBodyQuery; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks! Added comment to a single search function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@elasticmachine merge upstream |
|
||
export interface OverrideBodyQuery { | ||
_source: estypes.SearchSourceConfig; | ||
fields: estypes.Fields; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
search query body has some other fields as well, as: sort
, query
etc.
Defining only 2 of these might be unnecessary restrictive, given its name suggests the whole body override.
On second note - would it make sense to define them as non required fields in this interface? It would make this parameter more flexible, if you would want to override _source
field only, for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion!
I changed those fields to optional.
I was trying to change OverrideBodyQuery
to estypes.SearchRequest['body']
, but it wasn't working, because looks like
buildEventsSearchQuery
return compatible but not the exact type.
Later in a single search after we cast this query to estypes.SearchRequest
So if somebody need to add new fields, it shouldn't be a problem to add it into OverrideBodyQuery
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for looking into it.
I think it's good to go.
With added comment it should not be an issue to add new fields if needed
@elasticmachine merge upstream |
@elasticmachine merge upstream |
@elasticmachine merge upstream |
💛 Build succeeded, but was flaky
Failed CI StepsTest Failures
Metrics [docs]
History
To update your PR or re-run it, just comment with: cc @nkhristinin |
## Terms query for Indicator Match rule TODO: [] need more unit/integrations tests, but ready for review The indicator match rule will use terms query when it is possible to search for matches for threat-first-search and for events-first-search. ## How the match query worked: Example for threat-first-search. If we have matching conditions like: `host.ip ==== indicator.host.ip` or (`source.name === indicator.source.name` AND `host.name === indicator.host.name`) It will generate queries like: ``` match: {host.ip: "1"}, or match: {host.ip: "2"} or match: {host.ip: "3"} or (match: {source.name: "1"} and match: {host.name: "1"}) or (match: {source.name: "2"} and match: {host.name: "2"}) or (match: {source.name: "3"} and match: {host.name: "3"}) ``` Each match will also have `_name` fields like: `${threatId}_${threatIndex}_${threatFields}_${sourceField}` So and because it's 1:1 relation between match and response, later at enrichment stage will be clear which threat matches which event. ## Terms query. We do fetch info about mapping for fields which use for match conditions of the IM rule. Terms query doesn't support all field types, this is why there is some allowed list which field types. Terms query not applied for AND conditions. For example: Fields types host.ip - `ip` user.name - `keyword` user.description - `text` indicator.host.ip_range - `ip_range` `host.ip === indicator.host.ip` or `host.ip_range === indicator.host.ip` or (`source.name === indicator.source.name` AND `host.name === indicator.host.name`) It will generate queries like: ``` terms: {host.ip: ["1","2","3"]}, or match: {host.ip_range: "1"} // terms query support range fields, but it will be difficult later to understand which threat match which event, because we can have more than 1 response for this condition or match: {host.ip_range: "2"} or (match: {source.name: "1"} and match: {host.name: "1"}) or (match: {source.name: "2"} and match: {host.name: "2"}) or (match: {source.name: "3"} and match: {host.name: "3"}) ``` For terms query, we don't know which response matches with events, this is why we do match it back in the code. ## Other changes Threat-first-search - will do one extra request to have all matched threats. For example: The threat index has 1.000.000 documents. IM rule gets the first batch of 9.000 threats and builds a query to the events index. It returns 100 events (max_signal = 100). Then it tries to enrich those 100 events with threat info. The problem is that the original implementation will enrich with the only threats from this 9.000 batch. And it will ignore other matches in 1.000.000 threats. This way we do one extra request in the end from potential alerts to threat index. # Tests performance In the best case, it can improve performance by around 3x times. [Base](#149113) Threat Indicators - 1.500.000 documents Source - 1.000.000 documents. 1 field for match condition <img width="557" alt="213484531-3ab68c61-c3f5-4e28-b2c4-c1e90a5b1775" src="https://user-images.githubusercontent.com/7609147/215526984-ff027ba1-2f64-49fe-8fe8-a23ff4eda4dc.png"> This PR: <img width="537" alt="Screenshot 2023-01-30 at 20 20 32" src="https://user-images.githubusercontent.com/7609147/215575128-730514ac-a186-4ab8-87fd-af2ea8f79cec.png"> --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
## Terms query for Indicator Match rule TODO: [] need more unit/integrations tests, but ready for review The indicator match rule will use terms query when it is possible to search for matches for threat-first-search and for events-first-search. ## How the match query worked: Example for threat-first-search. If we have matching conditions like: `host.ip ==== indicator.host.ip` or (`source.name === indicator.source.name` AND `host.name === indicator.host.name`) It will generate queries like: ``` match: {host.ip: "1"}, or match: {host.ip: "2"} or match: {host.ip: "3"} or (match: {source.name: "1"} and match: {host.name: "1"}) or (match: {source.name: "2"} and match: {host.name: "2"}) or (match: {source.name: "3"} and match: {host.name: "3"}) ``` Each match will also have `_name` fields like: `${threatId}_${threatIndex}_${threatFields}_${sourceField}` So and because it's 1:1 relation between match and response, later at enrichment stage will be clear which threat matches which event. ## Terms query. We do fetch info about mapping for fields which use for match conditions of the IM rule. Terms query doesn't support all field types, this is why there is some allowed list which field types. Terms query not applied for AND conditions. For example: Fields types host.ip - `ip` user.name - `keyword` user.description - `text` indicator.host.ip_range - `ip_range` `host.ip === indicator.host.ip` or `host.ip_range === indicator.host.ip` or (`source.name === indicator.source.name` AND `host.name === indicator.host.name`) It will generate queries like: ``` terms: {host.ip: ["1","2","3"]}, or match: {host.ip_range: "1"} // terms query support range fields, but it will be difficult later to understand which threat match which event, because we can have more than 1 response for this condition or match: {host.ip_range: "2"} or (match: {source.name: "1"} and match: {host.name: "1"}) or (match: {source.name: "2"} and match: {host.name: "2"}) or (match: {source.name: "3"} and match: {host.name: "3"}) ``` For terms query, we don't know which response matches with events, this is why we do match it back in the code. ## Other changes Threat-first-search - will do one extra request to have all matched threats. For example: The threat index has 1.000.000 documents. IM rule gets the first batch of 9.000 threats and builds a query to the events index. It returns 100 events (max_signal = 100). Then it tries to enrich those 100 events with threat info. The problem is that the original implementation will enrich with the only threats from this 9.000 batch. And it will ignore other matches in 1.000.000 threats. This way we do one extra request in the end from potential alerts to threat index. # Tests performance In the best case, it can improve performance by around 3x times. [Base](elastic#149113) Threat Indicators - 1.500.000 documents Source - 1.000.000 documents. 1 field for match condition <img width="557" alt="213484531-3ab68c61-c3f5-4e28-b2c4-c1e90a5b1775" src="https://user-images.githubusercontent.com/7609147/215526984-ff027ba1-2f64-49fe-8fe8-a23ff4eda4dc.png"> This PR: <img width="537" alt="Screenshot 2023-01-30 at 20 20 32" src="https://user-images.githubusercontent.com/7609147/215575128-730514ac-a186-4ab8-87fd-af2ea8f79cec.png"> --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Strip source, and keep only fields used for mappings for events search of IM rule. Tests on cloud env: <img width="1182" alt="Screenshot 2023-01-19 at 16 42 19" src="https://user-images.githubusercontent.com/7609147/213486861-1a0e5a20-b575-4ed6-ac9b-cf9fb8a8c14f.png"> Threat Indicators - 1.500.000 documents Source - 1.000.000 documents. No changes: <img width="564" alt="Screenshot 2023-01-19 at 16 30 19" src="https://user-images.githubusercontent.com/7609147/213484465-ad20d5b2-c7a7-42d3-9d92-478c5d046636.png"> Strip fields: <img width="557" alt="Screenshot 2023-01-19 at 16 30 27" src="https://user-images.githubusercontent.com/7609147/213484531-3ab68c61-c3f5-4e28-b2c4-c1e90a5b1775.png"> Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Strip source, and keep only fields used for mappings for events search of IM rule. Tests on cloud env: <img width="1182" alt="Screenshot 2023-01-19 at 16 42 19" src="https://user-images.githubusercontent.com/7609147/213486861-1a0e5a20-b575-4ed6-ac9b-cf9fb8a8c14f.png"> Threat Indicators - 1.500.000 documents Source - 1.000.000 documents. No changes: <img width="564" alt="Screenshot 2023-01-19 at 16 30 19" src="https://user-images.githubusercontent.com/7609147/213484465-ad20d5b2-c7a7-42d3-9d92-478c5d046636.png"> Strip fields: <img width="557" alt="Screenshot 2023-01-19 at 16 30 27" src="https://user-images.githubusercontent.com/7609147/213484531-3ab68c61-c3f5-4e28-b2c4-c1e90a5b1775.png"> Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
## Terms query for Indicator Match rule TODO: [] need more unit/integrations tests, but ready for review The indicator match rule will use terms query when it is possible to search for matches for threat-first-search and for events-first-search. ## How the match query worked: Example for threat-first-search. If we have matching conditions like: `host.ip ==== indicator.host.ip` or (`source.name === indicator.source.name` AND `host.name === indicator.host.name`) It will generate queries like: ``` match: {host.ip: "1"}, or match: {host.ip: "2"} or match: {host.ip: "3"} or (match: {source.name: "1"} and match: {host.name: "1"}) or (match: {source.name: "2"} and match: {host.name: "2"}) or (match: {source.name: "3"} and match: {host.name: "3"}) ``` Each match will also have `_name` fields like: `${threatId}_${threatIndex}_${threatFields}_${sourceField}` So and because it's 1:1 relation between match and response, later at enrichment stage will be clear which threat matches which event. ## Terms query. We do fetch info about mapping for fields which use for match conditions of the IM rule. Terms query doesn't support all field types, this is why there is some allowed list which field types. Terms query not applied for AND conditions. For example: Fields types host.ip - `ip` user.name - `keyword` user.description - `text` indicator.host.ip_range - `ip_range` `host.ip === indicator.host.ip` or `host.ip_range === indicator.host.ip` or (`source.name === indicator.source.name` AND `host.name === indicator.host.name`) It will generate queries like: ``` terms: {host.ip: ["1","2","3"]}, or match: {host.ip_range: "1"} // terms query support range fields, but it will be difficult later to understand which threat match which event, because we can have more than 1 response for this condition or match: {host.ip_range: "2"} or (match: {source.name: "1"} and match: {host.name: "1"}) or (match: {source.name: "2"} and match: {host.name: "2"}) or (match: {source.name: "3"} and match: {host.name: "3"}) ``` For terms query, we don't know which response matches with events, this is why we do match it back in the code. ## Other changes Threat-first-search - will do one extra request to have all matched threats. For example: The threat index has 1.000.000 documents. IM rule gets the first batch of 9.000 threats and builds a query to the events index. It returns 100 events (max_signal = 100). Then it tries to enrich those 100 events with threat info. The problem is that the original implementation will enrich with the only threats from this 9.000 batch. And it will ignore other matches in 1.000.000 threats. This way we do one extra request in the end from potential alerts to threat index. # Tests performance In the best case, it can improve performance by around 3x times. [Base](elastic#149113) Threat Indicators - 1.500.000 documents Source - 1.000.000 documents. 1 field for match condition <img width="557" alt="213484531-3ab68c61-c3f5-4e28-b2c4-c1e90a5b1775" src="https://user-images.githubusercontent.com/7609147/215526984-ff027ba1-2f64-49fe-8fe8-a23ff4eda4dc.png"> This PR: <img width="537" alt="Screenshot 2023-01-30 at 20 20 32" src="https://user-images.githubusercontent.com/7609147/215575128-730514ac-a186-4ab8-87fd-af2ea8f79cec.png"> --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
## Terms query for Indicator Match rule TODO: [] need more unit/integrations tests, but ready for review The indicator match rule will use terms query when it is possible to search for matches for threat-first-search and for events-first-search. ## How the match query worked: Example for threat-first-search. If we have matching conditions like: `host.ip ==== indicator.host.ip` or (`source.name === indicator.source.name` AND `host.name === indicator.host.name`) It will generate queries like: ``` match: {host.ip: "1"}, or match: {host.ip: "2"} or match: {host.ip: "3"} or (match: {source.name: "1"} and match: {host.name: "1"}) or (match: {source.name: "2"} and match: {host.name: "2"}) or (match: {source.name: "3"} and match: {host.name: "3"}) ``` Each match will also have `_name` fields like: `${threatId}_${threatIndex}_${threatFields}_${sourceField}` So and because it's 1:1 relation between match and response, later at enrichment stage will be clear which threat matches which event. ## Terms query. We do fetch info about mapping for fields which use for match conditions of the IM rule. Terms query doesn't support all field types, this is why there is some allowed list which field types. Terms query not applied for AND conditions. For example: Fields types host.ip - `ip` user.name - `keyword` user.description - `text` indicator.host.ip_range - `ip_range` `host.ip === indicator.host.ip` or `host.ip_range === indicator.host.ip` or (`source.name === indicator.source.name` AND `host.name === indicator.host.name`) It will generate queries like: ``` terms: {host.ip: ["1","2","3"]}, or match: {host.ip_range: "1"} // terms query support range fields, but it will be difficult later to understand which threat match which event, because we can have more than 1 response for this condition or match: {host.ip_range: "2"} or (match: {source.name: "1"} and match: {host.name: "1"}) or (match: {source.name: "2"} and match: {host.name: "2"}) or (match: {source.name: "3"} and match: {host.name: "3"}) ``` For terms query, we don't know which response matches with events, this is why we do match it back in the code. ## Other changes Threat-first-search - will do one extra request to have all matched threats. For example: The threat index has 1.000.000 documents. IM rule gets the first batch of 9.000 threats and builds a query to the events index. It returns 100 events (max_signal = 100). Then it tries to enrich those 100 events with threat info. The problem is that the original implementation will enrich with the only threats from this 9.000 batch. And it will ignore other matches in 1.000.000 threats. This way we do one extra request in the end from potential alerts to threat index. # Tests performance In the best case, it can improve performance by around 3x times. [Base](elastic#149113) Threat Indicators - 1.500.000 documents Source - 1.000.000 documents. 1 field for match condition <img width="557" alt="213484531-3ab68c61-c3f5-4e28-b2c4-c1e90a5b1775" src="https://user-images.githubusercontent.com/7609147/215526984-ff027ba1-2f64-49fe-8fe8-a23ff4eda4dc.png"> This PR: <img width="537" alt="Screenshot 2023-01-30 at 20 20 32" src="https://user-images.githubusercontent.com/7609147/215575128-730514ac-a186-4ab8-87fd-af2ea8f79cec.png"> --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Strip source, and keep only fields used for mappings for events search of IM rule.
Tests on cloud env:
Threat Indicators - 1.500.000 documents
Source - 1.000.000 documents.
No changes:
Strip fields: