Capture important data events verbatim in filter settings#49
Capture important data events verbatim in filter settings#49mjcheetham merged 3 commits intogit-ecosystem:mainfrom
Conversation
| # Data pattern rules - capture values from data events matching | ||
| # a (category, key prefix) pair. Matched values are promoted into | ||
| # the summary regardless of detail level or nesting. | ||
| data_patterns: | ||
| # Capture curl/http error details from gvfs-helper | ||
| - category: "gvfs-helper" | ||
| key_prefix: "error/" | ||
| field_name: "gvfs_helper_errors" | ||
|
|
There was a problem hiding this comment.
So data_patterns log the actual objects from the event_data events, whereas the message_patterns just log a count.
Something about the difference in summarisation behaviour between two similarly named things makes me feel a little uneasy from a 'permanent API' thing.
message_patterns aggregates, but data_patterns collects.
There was a problem hiding this comment.
How do you feel about something like string_patterns to signal "match which strings to escalate up the stack" or do you feel we need to move these patterns out of the "summary" config?
There was a problem hiding this comment.
(I'm pushing a version that includes that model, just in case.)
There was a problem hiding this comment.
I guess really it's not a question of the name of the fields, but that message_patterns performs aggregation (counts of matching messages, which makes sense as a summary) vs the new feature that performs matching+elevation (actual data).
Maybe I'd feel better if this was not tied to the summary. I saw summary as a way to define aggregated statistics.
If some types of events are important to capture verbatim perhaps this is more a "important_events" feature or something.. (placeholder name).
There was a problem hiding this comment.
Latest push includes a rewrite to put important_events in the filter config instead of summary config.
I did a second full test with the internal telemetry service consuming this branch, getting this important_events value in the telemetry stream:
{
"gvfs_helper_caches": [
"<redacted>"
],
"gvfs_helper_errors": [
"(curl:35) SSL connect error [hard_fail]"
],
"gvfs_helper_remotes": [
"https://dev.azure.com/<redacted>"
]
}
6dd6c7e to
12d81b1
Compare
Context: Data events with nesting level n are logically attached to the region at regionStack[n-2]. The prior bounds check only tested whether the stack was shorter than rWant, missing two cases: a negative rWant (when mf_nesting is 0 or 1, which should have been caught by the nesting <= 1 guard but might slip through) and the case where rWant equals the stack length exactly, which would be an out-of-bounds index. Justification: Adding `rWant < 0` makes the guard explicit about the lower bound. Changing `<` to `<=` corrects the upper bound: a valid index into a slice of length L is 0..L-1, so `len >= rWant+1`, i.e. `rWant < len`, i.e. reject when `rWant >= len`, which is `len <= rWant`. The fix ensures the event is silently ignored (with a TODO log warning) rather than causing an out-of-bounds panic. Implementation: Single comparison change in apply__data_generic. No behavior change for well-formed event streams; only prevents a potential panic on malformed or unexpectedly nested data events.
Context: Operators monitoring Git processes that run gvfs-helper subcommands need visibility into specific data event values (error strings, curl codes, etc.) even when verbose telemetry is disabled. At dl:summary level all data events are currently dropped, so there is no way to guarantee critical diagnostic values surface in the OTEL process span without enabling expensive full-verbose collection site-wide. Justification: The rules live in filter.yml, not summary.yml, because capture is a filtering concern: it determines which data events are guaranteed to appear regardless of detail level, paralleling how filter rules control verbosity. The captured values are stored in their own OTEL attribute (trace2.process.important_events) separate from trace2.process.summary, keeping aggregated metrics distinct from verbatim strings. The capture call in apply__data_generic runs before any early-return paths so that orphaned nested events (nesting > 1 with an empty region stack) are never silently dropped. Implementation: ImportantEventRule (category exact match, key_prefix prefix match, field_name output key) is added to FilterSettings and validated in parseFilterSettingsFromBuffer. TrProcess gains an importantEvents map that is allocated only when rules exist, so the nil check in apply__important_events serves as a fast no-op when the feature is unconfigured. apply__important_events in filter_settings.go matches each data event against the configured rules and appends matching values. emitProcessSpan emits trace2.process.important_events as a JSON object whenever the map is non-empty, at all detail levels. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Context: The important_events feature added in the previous commit has no user-facing documentation. Operators need to know the YAML syntax, the semantics of each field, the output attribute name, and how the feature relates to summary.yml so they can distinguish verbatim capture from aggregated metrics. Implementation: config-filter-settings.md gains a new "Important Events" section with YAML syntax, field descriptions, and a worked example showing gvfs-helper error capture. The filter syntax summary at the bottom is extended with the important_events block so the complete schema appears in one place. configure-custom-collector.md documents the filter: pathname key (previously undocumented in the receiver config example) and adds a cross-reference to important_events. The summary example YAML adds a note directing readers to filter.yml for verbatim value capture, distinguishing it from the aggregated summary output. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
12d81b1 to
4bc9129
Compare
mjcheetham
left a comment
There was a problem hiding this comment.
Thank you for persisting with my concerns and delays in getting back to you on this @derrickstolee - you know your contributions are always thoroughly appreciated ❤️
| In addition to controlling verbosity, the `filter.yml` file can | ||
| declare a list of data events that should always be captured verbatim, | ||
| regardless of the active detail level. This lets operators guarantee | ||
| that specific Trace2 data values are always surfaced in the OTEL | ||
| process span even when verbose telemetry is disabled. |
There was a problem hiding this comment.
I like this description of the feature - fits well!
|
This is now included in the v0.7.0 tag. |
In #44, we created a custom model for generating performance summaries based on region timings and printf events.
This change introduces a new customization to allow promoting trace2 string data messages into the output telemetry event.
My motivation is that the
microsoft/gitfork includes agit-gvfs-helpertool that triggers certain string data failures when certain kinds of errors occur. Those messages are not being elevated to the telemetry, so we don't know how often they are happening or if they will go away when we make certain server-side changes.important_eventsrules tofilter.ymlthat capture specific Trace2dataevent values verbatim into a newtrace2.process.important_eventsOTEL span attribute.category(exact) andkey_prefix(prefix), and collect all matching values into a named array field.dl:summarywithout enabling verbose telemetry.filter_settings.goalongsideImportantEventRuletype and validation; stored separately fromSummaryAccumulator.This will take the events sent that match the category exactly and the key by prefix and put them into a JSON dictionaryas follows:
This is a real example that I was able to trigger in my own testing along with other non-error strings using an internal telemetry service that consumes the
git-ecosystem/trace2recievervia a fork of thegit-ecosystem/sample-trace2-otel-collector.I used agentic coding to produce these results, including the substantial test cases.