Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution] Adds event log telemetry specific for security solution rules #128216

Merged
merged 24 commits into from Mar 29, 2022

Conversation

FrankHassanabad
Copy link
Contributor

@FrankHassanabad FrankHassanabad commented Mar 21, 2022

Summary

Adds event log telemetry specific for security solution rules.

This adds a new section to the telemetry underneath detectionMetrics called detection_rule_status.

<-- click this text to see a full JSON sample document

{
  "detection_rule_status": {
    "all_rules": {
      "eql": {
        "failed": 0,
        "top_failed": {},
        "partial_failure": 0,
        "top_partial_failure": {},
        "succeeded": 0,
        "index_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "search_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_count": 0
      },
      "indicator": {
        "failed": 0,
        "top_failed": {},
        "partial_failure": 0,
        "top_partial_failure": {},
        "succeeded": 0,
        "index_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "search_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_count": 0
      },
      "mlRule": {
        "failed": 0,
        "top_failed": {},
        "partial_failure": 0,
        "top_partial_failure": {},
        "succeeded": 0,
        "index_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "search_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_count": 0
      },
      "query": {
        "failed": 4,
        "top_failed": {
          "1": {
            "message": "an hour were not queried between this rule execution and the last execution so signals may have been missed Consider increasing your look behind time or adding more Kibana instances name * id rule id execution id space ID default",
            "count": 4
          },
          "2": {
            "message": "an hour were not queried between this rule execution and the last execution so signals may have been missed Consider increasing your look behind time or adding more Kibana instances name Endpoint Security id rule id execution id space ID default",
            "count": 2
          },
          "3": {
            "message": "an hour were not queried between this rule execution and the last execution so signals may have been missed Consider increasing your look behind time or adding more Kibana instances name Telnet Port Activity id rule id execution id space ID default",
            "count": 2
          },
          "4": {
            "message": "hours were not queried between this rule execution and the last execution so signals may have been missed Consider increasing your look behind time or adding more Kibana instances name * id rule id execution id space ID default",
            "count": 2
          },
          "5": {
            "message": "hours were not queried between this rule execution and the last execution so signals may have been missed Consider increasing your look behind time or adding more Kibana instances name Endpoint Security id rule id execution id space ID default",
            "count": 1
          },
          "6": {
            "message": "hours were not queried between this rule execution and the last execution so signals may have been missed Consider increasing your look behind time or adding more Kibana instances name Telnet Port Activity id rule id execution id space ID default",
            "count": 1
          }
        },
        "partial_failure": 2,
        "top_partial_failure": {
          "1": {
            "message": "This rule is attempting to query data from Elasticsearch indices listed in the Index pattern section of the rule definition however no index matching blah frank was found This warning will continue to appear until matching index is created or this rule is disabled",
            "count": 189
          },
          "2": {
            "message": "This rule is attempting to query data from Elasticsearch indices listed in the Index pattern section of the rule definition however no index matching logs-endpoint.alerts was found This warning will continue to appear until matching index is created or this rule is disabled If you have recently enrolled agents enabled with Endpoint Security through Fleet this warning should stop once an alert is sent from an agent",
            "count": 187
          }
        },
        "succeeded": 2,
        "index_duration": {
          "max": 228568,
          "avg": 2292.8852459016393,
          "min": 0
        },
        "search_duration": {
          "max": 324,
          "avg": 21.661202185792348,
          "min": 1
        },
        "gap_duration": {
          "max": 5651,
          "avg": 3929.4166666666665,
          "min": 2811
        },
        "gap_count": 10
      },
      "savedQuery": {
        "failed": 0,
        "top_failed": {},
        "partial_failure": 0,
        "top_partial_failure": {},
        "succeeded": 0,
        "index_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "search_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_count": 0
      },
      "threshold": {
        "failed": 0,
        "top_failed": {},
        "partial_failure": 0,
        "top_partial_failure": {},
        "succeeded": 0,
        "index_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "search_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_count": 0
      },
      "total": {
        "failed": 4,
        "partial_failure": 2,
        "succeeded": 2
      }
    },
    "elastic_rules": {
      "eql": {
        "failed": 0,
        "top_failed": {},
        "partial_failure": 0,
        "top_partial_failure": {},
        "succeeded": 0,
        "index_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "search_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_count": 0
      },
      "indicator": {
        "failed": 0,
        "top_failed": {},
        "partial_failure": 0,
        "top_partial_failure": {},
        "succeeded": 0,
        "index_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "search_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_count": 0
      },
      "mlRule": {
        "failed": 0,
        "top_failed": {},
        "partial_failure": 0,
        "top_partial_failure": {},
        "succeeded": 0,
        "index_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "search_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_count": 0
      },
      "query": {
        "failed": 2,
        "top_failed": {
          "1": {
            "message": "an hour were not queried between this rule execution and the last execution so signals may have been missed Consider increasing your look behind time or adding more Kibana instances name Endpoint Security id rule id execution id space ID default",
            "count": 2
          },
          "2": {
            "message": "an hour were not queried between this rule execution and the last execution so signals may have been missed Consider increasing your look behind time or adding more Kibana instances name Telnet Port Activity id rule id execution id space ID default",
            "count": 2
          },
          "3": {
            "message": "hours were not queried between this rule execution and the last execution so signals may have been missed Consider increasing your look behind time or adding more Kibana instances name Endpoint Security id rule id execution id space ID default",
            "count": 1
          },
          "4": {
            "message": "hours were not queried between this rule execution and the last execution so signals may have been missed Consider increasing your look behind time or adding more Kibana instances name Telnet Port Activity id rule id execution id space ID default",
            "count": 1
          }
        },
        "partial_failure": 1,
        "top_partial_failure": {
          "1": {
            "message": "This rule is attempting to query data from Elasticsearch indices listed in the Index pattern section of the rule definition however no index matching logs-endpoint.alerts was found This warning will continue to appear until matching index is created or this rule is disabled If you have recently enrolled agents enabled with Endpoint Security through Fleet this warning should stop once an alert is sent from an agent",
            "count": 187
          }
        },
        "succeeded": 1,
        "index_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "search_duration": {
          "max": 278,
          "avg": 8.165745856353592,
          "min": 1
        },
        "gap_duration": {
          "max": 5474,
          "avg": 3831,
          "min": 2811
        },
        "gap_count": 6
      },
      "savedQuery": {
        "failed": 0,
        "top_failed": {},
        "partial_failure": 0,
        "top_partial_failure": {},
        "succeeded": 0,
        "index_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "search_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_count": 0
      },
      "threshold": {
        "failed": 0,
        "top_failed": {},
        "partial_failure": 0,
        "top_partial_failure": {},
        "succeeded": 0,
        "index_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "search_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_count": 0
      },
      "total": {
        "failed": 2,
        "partial_failure": 1,
        "succeeded": 1
      }
    },
    "custom_rules": {
      "eql": {
        "failed": 0,
        "top_failed": {},
        "partial_failure": 0,
        "top_partial_failure": {},
        "succeeded": 0,
        "index_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "search_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_count": 0
      },
      "indicator": {
        "failed": 0,
        "top_failed": {},
        "partial_failure": 0,
        "top_partial_failure": {},
        "succeeded": 0,
        "index_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "search_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_count": 0
      },
      "mlRule": {
        "failed": 0,
        "top_failed": {},
        "partial_failure": 0,
        "top_partial_failure": {},
        "succeeded": 0,
        "index_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "search_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_count": 0
      },
      "query": {
        "failed": 2,
        "top_failed": {
          "1": {
            "message": "an hour were not queried between this rule execution and the last execution so signals may have been missed Consider increasing your look behind time or adding more Kibana instances name * id rule id execution id space ID default",
            "count": 4
          },
          "2": {
            "message": "hours were not queried between this rule execution and the last execution so signals may have been missed Consider increasing your look behind time or adding more Kibana instances name * id rule id execution id space ID default",
            "count": 2
          }
        },
        "partial_failure": 1,
        "top_partial_failure": {
          "1": {
            "message": "This rule is attempting to query data from Elasticsearch indices listed in the Index pattern section of the rule definition however no index matching blah frank was found This warning will continue to appear until matching index is created or this rule is disabled",
            "count": 189
          }
        },
        "succeeded": 1,
        "index_duration": {
          "max": 228568,
          "avg": 4536.1945945945945,
          "min": 0
        },
        "search_duration": {
          "max": 324,
          "avg": 34.86486486486486,
          "min": 8
        },
        "gap_duration": {
          "max": 5651,
          "avg": 4027.8333333333335,
          "min": 3051
        },
        "gap_count": 4
      },
      "savedQuery": {
        "failed": 0,
        "top_failed": {},
        "partial_failure": 0,
        "top_partial_failure": {},
        "succeeded": 0,
        "index_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "search_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_count": 0
      },
      "threshold": {
        "failed": 0,
        "top_failed": {},
        "partial_failure": 0,
        "top_partial_failure": {},
        "succeeded": 0,
        "index_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "search_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_duration": {
          "max": 0,
          "avg": 0,
          "min": 0
        },
        "gap_count": 0
      },
      "total": {
        "failed": 2,
        "partial_failure": 1,
        "succeeded": 1
      }
    }
  }
}

manual testing

Add some alerts and malfunction them into partial errors by having them not have their indexes and malfunction them by shutting down Kibana for a while and then starting it back up to have Kibana miss some alerts. Then go to Advanced Settings -> scroll to the bottom and click cluster data. Should see data like this after scrolling down for a while:

Screen Shot 2022-03-21 at 4 18 58 PM

Checklist

Delete any items that are not applicable to this PR.

@FrankHassanabad FrankHassanabad self-assigned this Mar 22, 2022
@FrankHassanabad FrankHassanabad changed the title Add event log [Security Solution] Adds event log telemetry and metrics Mar 22, 2022
@FrankHassanabad FrankHassanabad added v8.2.0 backport:skip This commit does not require backporting release_note:enhancement labels Mar 22, 2022
@FrankHassanabad FrankHassanabad changed the title [Security Solution] Adds event log telemetry and metrics [Security Solution] Adds event log telemetry specific for security solution rules Mar 22, 2022
@FrankHassanabad FrankHassanabad added Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. telemetry Issues related to the addition of telemetry to a feature labels Mar 22, 2022
@FrankHassanabad FrankHassanabad marked this pull request as ready for review March 22, 2022 18:00
@FrankHassanabad FrankHassanabad requested review from a team as code owners March 22, 2022 18:00
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

Copy link
Contributor

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a 6k lines addition to the x-pack/plugins/telemetry_collection_xpack/schema/xpack_plugins.json file.

cc @afharo and @Bamieh

{
range: {
'@timestamp': {
gte: 'now-24h',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No action is needed here, but I'm starting to wonder if we need to start finding the previous usage collector run timestamp and use that. I have noticed unusual collection cadences within some of the security clusters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pjhampton after #121656 clusters should report only once every 24h. Just FYI :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had the same question above 😅 https://github.com/elastic/kibana/pull/128216/files#r837923596

If consistency can't be guaranteed (runs same time each day, introducing drift), would be nice to have a service that pushes the right day window datetimes down for these queries to consume.

Copy link
Contributor

@pjhampton pjhampton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🌔 🚀 ✨ LGTM ✨ 🚀 🌔

Code looks solid as always, but this amount of new data in the usage collector makes me a little nervous for a new release. I'm starting to wonder if we can add this as a task instead. Regardless, as long as Kbn core is good with it so am I!

Copy link
Member

@afharo afharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me. I don't think this increase will make us hit the current payload limit.

However, I'd like to challenge the decision in the schema. Is there a particular reason why we want to report it as 1, 2, 3, ... vs. an array?

IMO, using an array may be more useful because an error message may be top 1 for some clusters and top 10 for some others. However, in the big picture, we would be more interested in tackling the ones that occur in most clusters, wouldn't we?

It will also simplify the schema by removing a lot of fields 😇

What do you think?

Comment on lines 319 to 325
top_failed: {
'1': {
message: {
type: 'keyword',
_meta: { description: 'Top 1 failed rule message' },
},
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Would it make sense to declare these as arrays? This way we can query detection_rule_status.all_rules.top_failed: "error message" and it will show up regardless if it's the top 1 or top N.

Suggested change
top_failed: {
'1': {
message: {
type: 'keyword',
_meta: { description: 'Top 1 failed rule message' },
},
},
top_failed: {
type: 'array',
items: {
message: {
type: 'keyword',
_meta: { description: 'One of top 10 failed rule messages' },
},
},

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afharo

I will update the schema as I have a mistake but what I left out was the count ... It really should be this:

"1": {
  "properties": {
    "message": {
      "type": "keyword",
      "_meta": {
        "description": "Top 1 partial failed rule message"
      }
    },
    "count": {
      "type": "long",
      "_meta": {
        "description": "Top 1 partial failed rule message"
      }
    }
  }
}

I haven't done a visualization with arrays that contain objects before. Is that well supported if the array contains objects? I know that nested fields are still lacking support and is being tracked here:
#1084

Are nested types supported in this schema? I didn't see an example of it.

If I changed this to an array with an object inside the array I don't know how easy that is going to be to visualize compared to just numbering them and stopping at 10.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are nested types supported in this schema? I didn't see an example of it.

The schema is not a direct ES mapping (even thought it looks pretty similar 😅). You can specify the array of objects like this:

Suggested change
top_failed: {
'1': {
message: {
type: 'keyword',
_meta: { description: 'Top 1 failed rule message' },
},
},
top_failed: {
type: 'array',
items: {
message: {
type: 'keyword',
_meta: { description: 'One of top 10 failed rule messages' },
},
count: {
type: 'long',
_meta: { description: 'Number of times the message occurred' }
}
},

Then on the receiving end, we can work out different ways of storing it: for instance, I'd say that for easier consumption, we could loop through the array and index one document per object in the array with some common fields for correlation.

This way you get a "simple" event-like index to query and aggregate without depending on tree structures or nested types.


As a separate comment, I was wondering if this type of data makes sense to send it as events with our Event-Based Telemetry client coming up on 8.3 (meta issue here).

@FrankHassanabad @pjhampton @peluja1012 wdyt? Is it worth the wait for one extra minor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will change this to being an array with objects and update the code and tests with the array. Sounds good to me 👍 .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afharo Thank you for reviewing this PR and for your suggestion of taking advantage of the new Telemetry client. While I think it might make sense to use this new client in the future, we have been waiting for the Telemetry data this PR provides for a couple of releases. It's important that we start collecting it as soon as possible so we can get ahead of customer issues related to rule health and performance. I suggest we go with the approach presented in this PR in 8.2 and then migrate over to the Event-Based system in a future release after it's ready for consumption.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peluja1012 sounds like a good compromise to me. All clear! Thank you!

{
range: {
'@timestamp': {
gte: 'now-24h',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pjhampton after #121656 clusters should report only once every 24h. Just FYI :)

Copy link
Member

@afharo afharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you @FrankHassanabad for the latest changes to simplify the schema

Comment on lines 312 to 318
detection_rule_status: {
all_rules: {
eql: {
failed: {
type: 'long',
_meta: { description: 'The number of failed rules' },
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real bummer this can't be broken up into smaller files and destructured all nicely so it's more easily consumed. Sounds like there's some telemetry tooling that is preventing this? We should open an issue to fix that tooling so these schemas can be better digested.

Do you have more details on this @afharo?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also curious if the existing tooling does anything to 'lock' the file. @FrankHassanabad mentioned in another comment below about how it's very important to not change telemetry keys, and so am wondering if there is any tooling in place for this? Once broken out into separate files, we could always do a server-side hook to lock them in read-only mode perhaps? Same goes with any downstream generated files from these schemas.

type: 'long',
_meta: { description: 'The number of partial failure rules' },
},
top_partial_failure: {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe make plural? Is this the first top partial failure, or a list (seems it's an array) of top partials. If the latter would be more description to be plural for folks building dashboards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Will make this plural. Also will change top_failed

So you will have the changes:

top_partial_failure -> top_partial_failures
top_failed -> top_failures
failed -> failures

@spong
Copy link
Member

spong commented Mar 29, 2022

Wasn't able to verify this data on any of these three servers:

Using this query:

GET cluster_telemetry*/_search
{
  "size": 1,
  "query": {
    "bool": {
      "must": [
        {
          "exists": {
            "field": "stack_stats-kibana-plugins.security_solution.detectionMetrics.detection_rule_status"
          }
        }
      ]
    }
  }
}

Would be nice if there was a 'flush' feature so I could run this locally and not have to wait 24hrs. Does this exist? Also not really clear where this data would go if I ran this locally anyway... 🤔 Trying to use the Rosetta Stone doc to understand this all, but very hard to grok with how complex telemetry is.

Comment on lines +783 to +786
index_duration: {
max: {
type: 'float',
_meta: { description: 'The max duration' },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ux nit: Should these _meta:description's include the metric name in it too? Not sure how it shows up in the UI/dashboards, but this wouldn't be helpful if you're hovering on a pie chart or something and it says The max duration instead of something like The max indexing duration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't show up. I think these are not used anywhere other than here in the code. This doesn't convert to the EDN files or to mappings as far as I know there as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, well if it doesn't make it to the cluster schema, I question its value here, no? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would talk to someone from Kibana core about the usefulness and usage of it. I'm not for sure where it's used or if it is used within tooling somewhere else.

Comment on lines +8859 to +8862
"detection_rule_status": {
"properties": {
"all_rules": {
"properties": {
Copy link
Member

@spong spong Mar 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my gosh, another one of these!? 😅 I hope this one is generated at least...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is generated with:

 node scripts/telemetry_check.js --fix

Copy link
Member

@spong spong Mar 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I see there's a README.md in the root folder (telemetry_collection_xpack/schema/README.md). That appears to be out of date though (says 2 files when there's 3?), and doesn't include that script you just mentioned to generate it...it just says

The X-Pack related schema for the content that will be nested in stack_stats.kibana.plugins.
It is automatically generated by @kbn/telemetry-tools based on the schema property provided by all the registered Usage Collectors via the usageCollection.makeUsageCollector API.

Which I guess does link off to another README from there (src/plugins/usage_collection/README.mdx) and it appears to be mentioned in that one, so all good. Just coming in at a different angle so a little confusing here, all good now! 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is Kibana core's files.

Comment on lines +265 to +270
maxTotalSearchDuration: {
value: null,
},
gapCount: {
value: 0,
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: diff here is that cardinality agg is going to return 0 (if missing), but the max/min/avg aggs will return null as providing missing here will mess with the actual value.

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

ESLint disabled in files

id before after diff
apm 15 14 -1
osquery 5 4 -1
securitySolution 69 68 -1
uptime 7 6 -1
total -4

ESLint disabled line counts

id before after diff
apm 88 85 -3
enterpriseSearch 9 7 -2
fleet 47 46 -1
osquery 122 119 -3
uptime 49 43 -6
total -15

References to deprecated APIs

id before after diff
canvas 70 64 -6
dashboard 78 72 -6
data 475 465 -10
dataEnhanced 55 49 -6
discover 26 20 -6
fleet 20 19 -1
lens 18 14 -4
management 2 1 -1
maps 456 330 -126
monitoring 40 28 -12
upgradeAssistant 12 7 -5
visDefaultEditor 205 155 -50
visTypeVega 4 3 -1
visualizations 17 13 -4
total -238

Total ESLint disabled count

id before after diff
apm 103 99 -4
enterpriseSearch 9 7 -2
fleet 55 54 -1
osquery 127 123 -4
securitySolution 509 508 -1
uptime 56 49 -7
total -19

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @FrankHassanabad

},
},
},
elastic_rules: {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another note, there is no 'space-awareness' in this telemetry, so if a user installs the prebuilt rules in 10 spaces, this telemetry will say there's some 6k elastic rules installed (6xx prebuilt rules shipped x 10).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another note to future self: rules are queried directly to determine elastic vs custom rules, and so are capped to 10k across all spaces:

const query: SavedObjectsCreatePointInTimeFinderOptions = {
type: 'alert',
perPage: maxPerPage,
namespaces: ['*'],
filter,
};

So if a user has >10k rules between spaces things could start getting wonky -- we should test this and verify functionality and consider an implementation that takes space-awareness into account. Also be good to test these queries where the terms is gonna be a full 10k as well and see what that does to perf... 😮

@@ -108,6 +122,7 @@ export const getRuleMetrics = async ({
return {
detection_rule_detail: elasticRuleObjects,
detection_rule_usage: rulesUsage,
detection_rule_status: eventLogMetricsTypeStatus,
};
} catch (e) {
// ignore failure, usage will be zeroed. We use debug mode to not unnecessarily worry users as this will not effect them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you know that a failure occurred in the actual telemetry? On failures we're returning empty values?

* transformed. If it malfunctions or times out then it will not malfunction other
* parts of telemetry.
* NOTE: This takes in "ruleResults" to filter against elastic rules and custom rules.
* If the event log recorded information about which rules were elastic vs. custom this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, would be nice if we logged more metadata about the rule itself so there's not so much overhead in processing like this.

Copy link
Member

@spong spong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit to grok at first (especially being new to telemetry), but zoomed with @FrankHassanabad and we pair reviewed the whooole thing. Thanks for implementing some of the nits from my feedback! And note, I was not able to verify any of this telemetry made it to any of the staging servers, but I'm sure we'll be performing some additional validation testing once we get the first builds with this code in, and can always tweak before the final BC. @FrankHassanabad did verify independently during development that they made it, so we should be all good here.

All that said, thanks for taking the time to review this all with me @FrankHassanabad, and LGTM from the Security Solution side! Appreciate all the tests, and how modular and re-usable you structured this code (++ all the tests too!). 👍 🙌 🚀

@FrankHassanabad FrankHassanabad merged commit de894d1 into elastic:main Mar 29, 2022
@FrankHassanabad FrankHassanabad deleted the add-event-log branch March 29, 2022 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:enhancement Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. telemetry Issues related to the addition of telemetry to a feature v8.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants