From 2ade5dddf30ba917cae83d041f06bc57759cd57a Mon Sep 17 00:00:00 2001 From: Matt Johnson-Pint Date: Fri, 7 Apr 2023 09:22:53 -0700 Subject: [PATCH] rfc(feature): Exception Groups (#79) Co-authored-by: Abhijeet Prasad --- README.md | 1 + text/0079-exception-groups.md | 622 ++++++++++++++++++++++++++++++++++ 2 files changed, 623 insertions(+) create mode 100644 text/0079-exception-groups.md diff --git a/README.md b/README.md index a5decae6..07b3c33d 100644 --- a/README.md +++ b/README.md @@ -39,4 +39,5 @@ This repository contains RFCs and DACIs. Lost? - [0073-usage-of-transaction-types](text/0073-usage-of-transaction-types.md): Usage of transaction types - [0074-source-context-via-links](text/0074-source-context-via-links.md): Source context via links - [0078-escalating-issues](text/0078-escalating-issues.md): Escalating Issues +- [0079-exception-groups](text/0079-exception-groups.md): Exception Groups - [0080-issue-states](text/0080-issue-states.md): Issue States diff --git a/text/0079-exception-groups.md b/text/0079-exception-groups.md new file mode 100644 index 00000000..da2eb81d --- /dev/null +++ b/text/0079-exception-groups.md @@ -0,0 +1,622 @@ +- Start Date: 2023-03-26 +- RFC Type: feature +- RFC PR: https://github.com/getsentry/rfcs/pull/79 +- RFC Status: approved + +# Summary + +This RFC is intended to capture the changes required to Sentry to handle *exception groups*, +including changes to SDKs, protocol schema, event processing, and UI. + +### Important + +This RFC is `approved`, however SDKs should not release this feature until the back-end work is completed. +The status will be changed to `active` at that time. + +If desired, SDK teams _may_ implement this in an unmerged draft PR if they choose, or release in "preview" form. + +# Motivation + +Several programming languages have a concept of an unrelated group of multiple exception, aggregated into a single exception. + +- Python [`ExceptionGroup`](https://docs.python.org/3/library/exceptions.html#exception-groups) ([PEP 654](https://peps.python.org/pep-0654/)) +- .NET [`AggregateException`](https://docs.microsoft.com/dotnet/api/system.aggregateexception) +- JavaScript [`AggregateError`](https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/AggregateError) +- Go [`Unwrap() []error`](https://pkg.go.dev/errors) ([see also](https://tip.golang.org/doc/go1.20#errors)) +- Java [`Throwable.getSuppressed()`](https://docs.oracle.com/javase/7/docs/api/java/lang/Throwable.html#getSuppressed()) + +There may be others. We will use the term "exception group" throughout the design. +It applies to the concept, regardless of language. + +Sentry needs a way to capture exception groups, and present them in a meaningful way. +Simply capturing the exception group by itself is insufficient, and current workarounds (in .NET) are problematic. + +See also: +- [Python user feedback](https://github.com/getsentry/sentry/issues/37716) +- [.NET user feedback](https://github.com/getsentry/sentry-dotnet/issues/270) +- [JavaScript user feedback](https://github.com/getsentry/sentry-javascript/issues/5469) + +# Background + +## About Exception Groups + +An exception group is an exception unto itself. It may have been caused by the exceptions in the group, +but there is no implied causal relationship between those exceptions. + +In other words, given the following: + +- Group A + - Exception 1 + - Exception 2 + +`Group A` is an exception group that may have been caused by either `Exception 1`, `Exception 2`, or both of them. +However, it is generally _not_ true that `Exception 1` was caused by `Exception 2` or vice versa. + +Furthermore: +- `Exception 1` and `Exception 2` might be of the same type, or they might be of different types. +- There can be `n` number of exceptions in an exception group (`n >= 1`). +- There can be a stack trace on each of the exceptions within an exception group, as well as on the group itself. +- Just like any other exception, each exception within an exception group can have a chain of inner exceptions. +- An inner exception can *also* be an exception group. +- Exception groups can be present at any level. There is no requirement that the start of the chain is an exception group. + +Thus, other valid examples that could occur in an application include the following: + +- Group B + - Exception 1 + - Exception 2 + - Exception 3 + +- Group C + - Exception 1 + - Exception 1a + - Exception 1b + - Exception 2 + - Exception 2a + - Exception 2b + +- Group D + - Exception 1 + - Exception 1a + - Exception 1b + - Group E + - Exception 2 + - Exception 2a + - Exception 2b + - Exception 3 + - Exception 3a + - Exception 3b + +- Exception 1 + - Exception 1a + - Group E + - Exception 2 + - Exception 2a + - Exception 2b + - Exception 3 + - Exception 3a + - Exception 3b + +## Interpreting Exception Groups + +The meaning of a _normal_ exception is fairly straightforward: +- An exception is something that went wrong in the application, and thus represents an issue to resolve. +- If there is another exception that occurred to cause this one, that is assigned to the "inner exception" or "cause". +- The chain of exceptions thus is linear. + +This gets more complicated when an exception group is involved: +- The exception group _might_ represent the issue to resolve, or the issue might better be represented by one or more of the inner exceptions within the group. +- Depending on language, there might be a cause that is separate from the _any_ of the exceptions in the group. +- Thus, the chain of exceptions is more tree-like than linear. + +## Exception Handling in Sentry + +SDKs send exceptions to Sentry using the [Exception Interface](https://develop.sentry.dev/sdk/event-payloads/exception/) +on an event. One or more exceptions can be sent in the `values` array of the interface. Multiple values represent +a chain of exceptions in a causal relationship, sorted from oldest to newest. For example: + +```json +{ + "exception": { + "values": [ + {"type": "TypeError", "value": "Invalid Type!"}, + {"type": "ValueError", "value": "Invalid Value!"}, + {"type": "RuntimeError", "value": "Something went wrong!"} + ] + } +} +``` + +In the above example, an issue is created in Sentry for the `RuntimeError` exception. + +``` +Issue Title: "RuntimeError: Something went wrong!" + ++============================== ++ RuntimeError: Something went wrong! ++------------------------------ ++ (stack trace) ++============================== + ++============================== ++ ValueError: Invalid Value! ++------------------------------ ++ (stack trace) ++============================== + ++============================== ++ TypeError: Invalid Type! ++------------------------------ ++ (stack trace) ++============================== +``` + +This design is linear, and thus cannot support exception groups having more than one exception. +But even in that case, the exception presented in Sentry might be titled and grouped in an undesirable manner. + +Consider: + +```json +{ + "exception": { + "values": [ + {"type": "TypeError", "value": "Invalid Type!"}, + {"type": "ValueError", "value": "Invalid Value!"}, + {"type": "RuntimeError", "value": "Something went wrong!"}, + {"type": "ExceptionGroup", "value": "Exception Group (1 sub-exception)"} + ] + } +} +``` + +Sending this to Sentry would result in an issue titled `"ExceptionGroup: Exception Group (1 sub-exception)"`. +In some contexts that may be desired, but in others the expectation is that the issue would be titled `"RuntimeError: Something went wrong!"`. + +# Language Specifics + +## .NET + +The exception group type in .NET is [`AggregateException`](https://learn.microsoft.com/dotnet/api/system.aggregateexception). +- The exceptions of the group are stored in the `InnerExceptions` property. +- Like other exceptions, it also has an `InnerException` property, which is interpreted as the cause of this exception. + - Its value is always the same as `InnerExceptions[0]`. + +## Python + +The exception group type in .Python is [`ExceptionGroup`](https://docs.python.org/3/library/exceptions.html#exception-groups). +- The exceptions of the group are stored in the `exceptions` attribute. +- Like other exceptions, it can have a `__cause__` and/or a `__context__` attribute. + - `__context__` is an indirect cause, assigned if another exception occurs while handling the exception. + - `__cause__` is a direct cause, assigned if raised with the exception (using the `from` keyword). + - Setting it suppresses any `__context__` value, when displayed in the stack trace. + - There is no requirement that `exceptions[0]` be either of these. + +## JavaScript + +The exception group type in JavaScript is [`AggregateError`](https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/AggregateError). +- The errors of the group are stored in the `errors` property. +- Like other errors, it has a `cause` (singular) property. + - There is no requirement that `errors[0]` be the same as `cause`. + +# Proposed Solution + +**NOTE**: _This has changed significantly from prior versions of the draft RFC._ + +We will focus having the SDKs capture all available information, and adjust the issue grouping rules in Sentry to use that new information. + +**The SDKs will do the following:** + +- Capture the entire tree of exceptions represented by the exception group. +- Capture any other chained exceptions that are not part of the exception group. +- Use the existing `exception` value of the Sentry event. +- Add a few new fields to each exception's `mechanism` data, as described below. + +**Sentry will do the following:** + +- Take the new mechanism fields into account when grouping issues. See [Sentry Issue Grouping](#sentry-issue-grouping) below for more details. +- Present a user interface that depicts the structure of the exception group on the issue details page. See [Sentry UI Changes](#sentry-ui-changes) below for more details. + +## Important Notice + +The new grouping rules _MUST_ be fully implemented in Sentry.io and a published version of self-hosted +Sentry before any SDK can release a non-preview version that includes these changes. The SDK's release notes +should mention the required version of Sentry. + +Although the protocol changes are backwards compatible, the current problems +with exception groups will persist or be exacerbated without the new grouping rules in place. + +SDKs that currently require the user to opt-in to send chained exceptions can continue to do so without side-effects. +SDKs that send chained exceptions _by default_ may see an immediate change to exception grouping after implementing this feature. + +## New Mechanism Fields + +The [Exception Mechanism Interface](https://develop.sentry.dev/sdk/event-payloads/exception/#exception-mechanism) +will have the following new fields added: + +### `source` + +An optional string value describing the source of the exception. + +- The SDK should populate this with the name of the property or attribute of the parent exception that this exception + was acquired from. In the case of an array, it should include the zero-based array index as well. +- Python Examples: `"__context__"`, `"__cause__"`, `"exceptions[0]"`, `"exceptions[1]"` +- .NET Examples: `"InnerException"`, `"InnerExceptions[0]"`, `"InnerExceptions[1]"` +- JavaScript Examples: `"cause"`, `"errors[0]"`, `"errors[1]"` + +### `is_exception_group` + +An optional boolean value, set `true` when the exception is the exception group type specific to the platform or language. +The default is `false` when omitted. + +- For example, exceptions of type `ExceptionGroup` (Python), `AggregateException` (.NET), and `AggregateError` (JavaScript) + should have `"is_exception_group": true`. Other exceptions can omit this field. + +### `exception_id` + +An optional numeric value providing an ID for the exception relative to this specific event. + +- The SDK should assign simple incrementing integers to each exception in the tree, starting with `0` for the root of the tree. + In other words, when flattened into the list provided in the `exception` values on the event, the last exception + in the list should have ID `0`, the previous one should have ID `1`, the next previous should have ID `2`, etc. + +### `parent_id` + +An optional numeric value pointing at the `exception_id` that is the parent of this exception. + +- The SDK should assign this to all exceptions _except_ the root exception (the last to be listed in the `exception` values). + +## Interpretation + +The `exception_id` and `parent_id` fields work in conjunction to represent the hierarchical nature of the tree of exceptions. +If not provided, the previous interpretation will be assumed - which is that each exception in the list of `exception` values +is a child of the one immediately following it in the list. + +The [Exception Interface](https://develop.sentry.dev/sdk/event-payloads/exception/) will not change in structure, +but it will change in _interpretation_ with regard to multiple exception values. + +- The previous interpretation was: _"Multiple values represent chained exceptions."_. +- The new interpretation will be: _"Multiple values are related by the optional `mechanism.exception_id` + and `mechanism.parent_id` fields. When not present, multiple values represent chained exceptions."_ + + +## Additional SDK Requirements + +### Mechanism Type + +When setting the `mechanism.type` field, SDKs should use the following guidelines: + +- For the root exception (the last to be in the `exception.values` list), set `mechanism.type` to the name + of the integration that produced the exception (as was the case before this proposal). If the exception was + captured manually, set the `mechanism.type` to `"generic"`. + +- For all other exceptions in the list, set the `mechanism.type` to `"chained"`. This will indicate that the exception + is part of the chain of exceptions stemming from the root exception (regardless of whether it is in an exception group or not). + +Do not omit the `mechanism.type` field, nor send it empty or null. + +### Exception Value + +When setting the `value` field of the exception for an exception group type, SDKs should only deliver the meaningful part +of the exception message, excluding any string that may have been automatically added by their platform. + +For example: + +- In Python, the `value` field should not contain details such as `" (2 sub-exceptions)"`. Use the `message` attribute to get the raw message from an `ExceptionGroup`. If there is no message, omit the `value` and just send `type`. + +- In .NET, the `value` field should not contain details such as `" (Exception 1) (Exception 2)"`. The `Message` property may need to be modified to remove the appended inner messages. + +### Keep Aggregate Exceptions + +The .NET SDK previously had implemented an option called `KeepAggregateExceptions`. This flag should be deprecated, +in favor of _always_ sending the entire chain of aggregate exceptions as explained in this SDK. This may affect +existing issue grouping, and should be noted in the change log when released. + +Other SDKs should _not_ implement a similar option. + +## Example Event + +Given the Python code: + +```python +try: + raise RuntimeError("something") +except: + raise ExceptionGroup("nested", + [ + ValueError(654), + ExceptionGroup("imports", + [ + ImportError("no_such_module"), + ModuleNotFoundError("another_module"), + ] + ), + TypeError("int"), + ] + ) +``` + +The event would contain: + +```json +{ + "exception": { + "values": [ + { + "type": "TypeError", + "value": "int", + "mechanism": { + "type": "chained", + "source": "exceptions[2]", + "exception_id": 6, + "parent_id": 0 + } + }, + { + "type": "ModuleNotFoundError", + "value": "another_module", + "mechanism": { + "type": "chained", + "source": "exceptions[1]", + "exception_id": 5, + "parent_id": 3 + } + }, + { + "type": "ImportError", + "value": "no_such_module", + "mechanism": { + "type": "chained", + "source": "exceptions[0]", + "exception_id": 4, + "parent_id": 3 + } + }, + { + "type": "ExceptionGroup", + "value": "imports", + "mechanism": { + "type": "chained", + "source": "exceptions[1]", + "is_exception_group": true, + "exception_id": 3, + "parent_id": 0 + } + }, + { + "type": "ValueError", + "value": "654", + "mechanism": { + "type": "chained", + "source": "exceptions[0]", + "exception_id": 2, + "parent_id": 0 + } + }, + { + "type": "RuntimeError", + "value": "something", + "mechanism": { + "type": "chained", + "source": "__context__", + "exception_id": 1, + "parent_id": 0 + } + }, + { + "type": "ExceptionGroup", + "value": "nested", + "mechanism": { + "type": "exceptionhook", + "handled": false, + "is_exception_group": true, + "exception_id": 0 + } + }, + ] + } +} +``` + +**Reminder:** In .NET, `InnerException` is always the same as `InnerExceptions[0]`, thus it does not need to be reported separately. +However, Python's `__cause__` and `__context__`, and JavaScript's `cause`, are independent and thus _should_ be reported separately +if they have values. + +## Sentry Issue Grouping + +Issue grouping rules for exception groups are complex, because the nature of exception groups is that they may or may not represent +more than one distinct issue. While this may require some further experimentation to get right, the initial plan is as follows: + +First, determine the list of "top-level" exceptions. These are the exceptions that represent distinct issues contained in +the exception group. + +1. Start from the exception having `mechanism.exception_id:0`. +2. If it has `mechanism.is_exception_group:true`, then recursively search each child. +3. When reaching one where `mechanism.is_exception_group:false` (or not present), include it as a "top-level" exception, + and do not traverse any of its child exceptions. + +Next, determine from the top-level exceptions which of them would have been grouped together, had they been in separate events. + +- Apply grouping rules between the top-level exception to determine the distinct number of issues represented by the group. +- For each top-level exception, only consider the first-path through any child exceptions. + +Finally: + +- If there is only one distinct group of top-level exceptions, group the event with other events based on that top-level exception only. + Ignore any parent exception groups. + +- If more than one distinct top-level exception exists, then group the event based on the parent exception group that they have in common. + This will often be the root-level exception group. + +As an example, consider simplified issue grouping rules that only considered the exception `type`. When applied to an exception group such as: + +- `ExceptionGroup` + - `ValueError` + - `TypeError` + - `TypeError` + +There are two distinct top-level exceptions, `ValueError` and `TypeError`. They have the `ExceptionGroup` in common. +Thus the three exceptions considered for issue grouping are `ExceptionGroup`, `ValueError`, and the first `TypeError`. + +Now consider this example: + +- `ExceptionGroup` + - `ExceptionGroup` + - `ValueError` + - `ValueError` + - `TypeError` + - `ValueError` + - `ValueError` + - `ValueError` + +There are 5 top-level exceptions, all of type `ValueError`. Thus, the event should only be grouped based on a single `ValueError`, +and the others should be ignored for purposes of issue grouping. That one of them has a chained `TypeError` is not relevant, +at least not in this initial plan. + +A further modification to the plan might consider all possible branches of chained exceptions, but that is not proposed at this time. + +### Additional Issue Grouping Requirements + +As mentioned earlier, SDKs will set a meaningful `mechanism.type` on the root exception only (the last item in the `exception.values` list). +Other exceptions will have type `"chained"` to indicate that they are a chained exception stemming from the root exception. + +For purposes of issue grouping, `"chained"` mechanism typed should be excluded. Additionally, the mechanism type of the root exception must +_always_ be considered as part of issue grouping, even if the rest of the exception group is being ignored. + +## Issue Titles + +As a side-effect of issue grouping, issues will be titled (and subtitled) based on the top-most exception that is not ignored from +the grouping. In other words, if there is more than one distinct top-level exception, the issue will be titled by the exception group itself. In the above examples, the first issue would be titled as `ExceptionGroup`, and the second issue would be titled as `ValueError`. + +However, the `mechanism.handled` field (which determines `error.handled` and `error.unhandled` event attributes in Sentry) should always be +taken from the root exception, even if the title and subtitle are derived from one of the chained exceptions. +This is because SDKs are only expected to supply the `mechanism.handled` field on the root exception. + +## Sentry UI Changes + +The Issue Details page will be updated to improve usability of exception groups. The exact details are at the discretion of the design team, +however the following should be considered: + +- A condensed tree-like visualization of the exception group should be added somewhere on the page. + Each exception in the tree should have an in-page link to jump to that exception and ensure it is expanded. +- Some exceptions should be collapsed by default, including any where `mechanism.is_exception_group === true`, and perhaps others. +- The `mechanism.source` field, if available, should be displayed on each exception in the exceptions section. +- We may want to include a way to navigate from each exception to its parent exception, or back to the exception group. + +# Drawbacks + +- Modifying the way Exceptions are sent to Sentry will affect existing issue grouping rules + that customers may have set up. This change could create new alerts when first deployed. + +- The design proposed above retains backwards compatibility with older versions of Sentry. + However, without the proposed UI changes, previous versions (self-hosted, etc.) of Sentry + will treat the exceptions list as if they were all one long chain of direct exceptions. + This could be a bit confusing to the user, until such time they upgrade their Sentry + instance to a version that includes the UI and issue grouping changes. + +# Other Options Considered + +We considered the following, each had pitfalls that led to the plan described above. + +## Do Nothing + +This would mean leaving things the way they currently are. + +Pros: + +- Nothing to do. + +Cons: + +- Overall, reduced ability to use Sentry for error monitoring, as the usage of exception groups increases. +- Events created by the .NET SDK have several problems for exception groups (`AggregateException`), such as: + - There's no structure represented by the chain of events, so every relationship appears as parent/child, even + those that should actually be siblings. + - In some cases, some stack traces are relocated from the exception group to the first child exception, + otherwise no code location will be represented. Doing so grossly misrepresents the true nature of the + exception caused at the highlighted stack frame. + - In other cases, there are already stack traces on both the exception group and the + first child exception. Thus the location of the exception group is lost completely. + - The `KeepAggregateException` option is global for the entire application, and can't be adjusted on a case-by-case basis. +- Issues created by other SDKs such as Python and JavaScript are not prepared to deal with exception groups at all. + - Issues are always titled and grouped by the exception group, even when there's only a single type of exception contained within. + - Because none of the items in the `exceptions` or `errors` lists are part of the _cause_, they're currently not passed to Sentry at all. + This makes it impossible to identify the actual cause of an exception raised via an exception group. + +## Sending Hierarchical Data + +This approach was seriously considered. It would involve creating a new tree-like data structure that more closely +resembles the original tree of exceptions. It would have been placed on either a new `exception_group` interface, +or added to the existing `contexts` or `extra` collections. + +Pros: + +- The event would contain a more direct representation of the exception data. +- Less work for the SDKs. + +Cons: + +- Much of the server-side processing would have to be reconsidered, including relay, symbolication, and trimming. +- It would not be backwards compatible, without duplicating significant data into the exceptions list anyway. + +## Sending One Exception Chain Only + +This approach would involve not capturing the entire exception group, but trying to determine which top-level +exception was worth capturing, from within the SDK. + +Pros: + +- Fully compatible with existing Sentry, without any changes to grouping rules or UI. +- Fully backwards compatible as well. + +Cons: + +- Potential to loose a lot of useful data. +- Misrepresents the exception that was actually raised. +- Looses track of the actual location in source code where the exception group was raised. + +## Splitting Events in the SDK + +This approach would involve the SDK sending multiple separate events for each top-level exception in an exception group. + +Pros: + +- All exceptions would come through at once. +- No server-side processing would need to be performed. + +Cons: + +- Too much duplicate date is sent from the SDK at run time. +- It can quickly exceed the SDKs internal maximum queue length. +- It can trigger rate limits and spike-protection mechanisms. + +## Splitting Events in Relay + +This approach would involve the SDK sending one event containing the exception group, and relying on Sentry Relay +to split out top-level exceptions into separate events. +It would require the creation of a new `exception_group` interface, placed directly on the incoming event. + +Pros: + +- One event sent from the SDK, so none of the cons involved with splitting in the SDK. + +Cons: + +- Could be very CPU intensive. +- Too much business logic. +- Could back up overall ingestion throughput. +- Would not be backwards compatible with the existing event schema. + +## Using Synthetic Exceptions + +This approach would set `mechanism.synthetic:true` on exception groups types, to attempt to keep them from +being considered during issue grouping. + +Pros: + +- If it worked, issue grouping would need less adjustment. + +Cons: + +- It doesn't work for this use case. + - Title of issue would be incorrect, referring to the first in-app frame of the exception group. + - Issue grouping would be incorrect, including details of the exception group.