From 2ade5dddf30ba917cae83d041f06bc57759cd57a Mon Sep 17 00:00:00 2001
From: Matt Johnson-Pint <mattjohnsonpint@gmail.com>
Date: Fri, 7 Apr 2023 09:22:53 -0700
Subject: [PATCH] rfc(feature): Exception Groups (#79)

Co-authored-by: Abhijeet Prasad <aprasad@sentry.io>
---
 README.md                     |   1 +
 text/0079-exception-groups.md | 622 ++++++++++++++++++++++++++++++++++
 2 files changed, 623 insertions(+)
 create mode 100644 text/0079-exception-groups.md

diff --git a/README.md b/README.md
index a5decae6..07b3c33d 100644
--- a/README.md
+++ b/README.md
@@ -39,4 +39,5 @@ This repository contains RFCs and DACIs. Lost?
 - [0073-usage-of-transaction-types](text/0073-usage-of-transaction-types.md): Usage of transaction types
 - [0074-source-context-via-links](text/0074-source-context-via-links.md): Source context via links
 - [0078-escalating-issues](text/0078-escalating-issues.md): Escalating Issues
+- [0079-exception-groups](text/0079-exception-groups.md): Exception Groups
 - [0080-issue-states](text/0080-issue-states.md): Issue States
diff --git a/text/0079-exception-groups.md b/text/0079-exception-groups.md
new file mode 100644
index 00000000..da2eb81d
--- /dev/null
+++ b/text/0079-exception-groups.md
@@ -0,0 +1,622 @@
+- Start Date: 2023-03-26
+- RFC Type: feature
+- RFC PR: https://github.com/getsentry/rfcs/pull/79
+- RFC Status: approved
+
+# Summary
+
+This RFC is intended to capture the changes required to Sentry to handle *exception groups*,
+including changes to SDKs, protocol schema, event processing, and UI.
+
+### Important
+
+This RFC is `approved`, however SDKs should not release this feature until the back-end work is completed.
+The status will be changed to `active` at that time.
+
+If desired, SDK teams _may_ implement this in an unmerged draft PR if they choose, or release in "preview" form.
+
+# Motivation
+
+Several programming languages have a concept of an unrelated group of multiple exception, aggregated into a single exception.
+
+- Python [`ExceptionGroup`](https://docs.python.org/3/library/exceptions.html#exception-groups) ([PEP 654](https://peps.python.org/pep-0654/))
+- .NET [`AggregateException`](https://docs.microsoft.com/dotnet/api/system.aggregateexception)
+- JavaScript [`AggregateError`](https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/AggregateError)
+- Go [`Unwrap() []error`](https://pkg.go.dev/errors) ([see also](https://tip.golang.org/doc/go1.20#errors))
+- Java [`Throwable.getSuppressed()`](https://docs.oracle.com/javase/7/docs/api/java/lang/Throwable.html#getSuppressed())
+
+There may be others. We will use the term "exception group" throughout the design.
+It applies to the concept, regardless of language.
+
+Sentry needs a way to capture exception groups, and present them in a meaningful way.
+Simply capturing the exception group by itself is insufficient, and current workarounds (in .NET) are problematic.
+
+See also:
+- [Python user feedback](https://github.com/getsentry/sentry/issues/37716)
+- [.NET user feedback](https://github.com/getsentry/sentry-dotnet/issues/270)
+- [JavaScript user feedback](https://github.com/getsentry/sentry-javascript/issues/5469)
+
+# Background
+
+## About Exception Groups
+
+An exception group is an exception unto itself.  It may have been caused by the exceptions in the group,
+but there is no implied causal relationship between those exceptions.
+
+In other words, given the following:
+
+- Group A
+  - Exception 1
+  - Exception 2
+  
+`Group A` is an exception group that may have been caused by either `Exception 1`, `Exception 2`, or both of them.
+However, it is generally _not_ true that `Exception 1` was caused by `Exception 2` or vice versa.
+
+Furthermore:
+- `Exception 1` and `Exception 2` might be of the same type, or they might be of different types.
+- There can be `n` number of exceptions in an exception group (`n >= 1`).
+- There can be a stack trace on each of the exceptions within an exception group, as well as on the group itself.
+- Just like any other exception, each exception within an exception group can have a chain of inner exceptions.
+- An inner exception can *also* be an exception group.
+- Exception groups can be present at any level.  There is no requirement that the start of the chain is an exception group.
+
+Thus, other valid examples that could occur in an application include the following:
+
+- Group B
+  - Exception 1
+    - Exception 2
+      - Exception 3
+
+- Group C
+  - Exception 1
+    - Exception 1a
+      - Exception 1b
+  - Exception 2
+    - Exception 2a
+      - Exception 2b
+
+- Group D
+  - Exception 1
+    - Exception 1a
+      - Exception 1b
+  - Group E
+    - Exception 2
+      - Exception 2a
+        - Exception 2b
+    - Exception 3
+      - Exception 3a
+        - Exception 3b
+
+- Exception 1
+  - Exception 1a
+    - Group E
+      - Exception 2
+        - Exception 2a
+          - Exception 2b
+      - Exception 3
+        - Exception 3a
+          - Exception 3b
+
+## Interpreting Exception Groups
+
+The meaning of a _normal_ exception is fairly straightforward:
+- An exception is something that went wrong in the application, and thus represents an issue to resolve.
+- If there is another exception that occurred to cause this one, that is assigned to the "inner exception" or "cause".
+- The chain of exceptions thus is linear.
+
+This gets more complicated when an exception group is involved:
+- The exception group _might_ represent the issue to resolve, or the issue might better be represented by one or more of the inner exceptions within the group.
+- Depending on language, there might be a cause that is separate from the _any_ of the exceptions in the group.
+- Thus, the chain of exceptions is more tree-like than linear.
+
+## Exception Handling in Sentry
+
+SDKs send exceptions to Sentry using the [Exception Interface](https://develop.sentry.dev/sdk/event-payloads/exception/)
+on an event. One or more exceptions can be sent in the `values` array of the interface.  Multiple values represent
+a chain of exceptions in a causal relationship, sorted from oldest to newest.  For example:
+
+```json
+{
+  "exception": {
+    "values": [
+      {"type": "TypeError", "value": "Invalid Type!"},
+      {"type": "ValueError", "value": "Invalid Value!"},
+      {"type": "RuntimeError", "value": "Something went wrong!"}
+    ]
+  }
+}
+```
+
+In the above example, an issue is created in Sentry for the `RuntimeError` exception.
+
+```
+Issue Title: "RuntimeError: Something went wrong!"
+
++==============================
++ RuntimeError: Something went wrong!
++------------------------------
++ (stack trace)
++==============================
+
++==============================
++ ValueError: Invalid Value!
++------------------------------
++ (stack trace)
++==============================
+
++==============================
++ TypeError: Invalid Type!
++------------------------------
++ (stack trace)
++==============================
+```
+
+This design is linear, and thus cannot support exception groups having more than one exception.
+But even in that case, the exception presented in Sentry might be titled and grouped in an undesirable manner.
+
+Consider:
+
+```json
+{
+  "exception": {
+    "values": [
+      {"type": "TypeError", "value": "Invalid Type!"},
+      {"type": "ValueError", "value": "Invalid Value!"},
+      {"type": "RuntimeError", "value": "Something went wrong!"},
+      {"type": "ExceptionGroup", "value": "Exception Group (1 sub-exception)"}
+    ]
+  }
+}
+```
+
+Sending this to Sentry would result in an issue titled `"ExceptionGroup: Exception Group (1 sub-exception)"`.
+In some contexts that may be desired, but in others the expectation is that the issue would be titled `"RuntimeError: Something went wrong!"`.
+
+# Language Specifics
+
+## .NET
+
+The exception group type in .NET is [`AggregateException`](https://learn.microsoft.com/dotnet/api/system.aggregateexception).
+- The exceptions of the group are stored in the `InnerExceptions` property.
+- Like other exceptions, it also has an `InnerException` property, which is interpreted as the cause of this exception.
+  - Its value is always the same as `InnerExceptions[0]`.
+
+## Python
+
+The exception group type in .Python is [`ExceptionGroup`](https://docs.python.org/3/library/exceptions.html#exception-groups).
+- The exceptions of the group are stored in the `exceptions` attribute.
+- Like other exceptions, it can have a `__cause__` and/or a `__context__` attribute.
+  - `__context__` is an indirect cause, assigned if another exception occurs while handling the exception.
+  - `__cause__` is a direct cause, assigned if raised with the exception (using the `from` keyword).
+    - Setting it suppresses any `__context__` value, when displayed in the stack trace.
+  - There is no requirement that `exceptions[0]` be either of these.
+
+## JavaScript
+
+The exception group type in JavaScript is [`AggregateError`](https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/AggregateError).
+- The errors of the group are stored in the `errors` property.
+- Like other errors, it has a `cause` (singular) property.
+  - There is no requirement that `errors[0]` be the same as `cause`.
+
+# Proposed Solution
+
+**NOTE**: _This has changed significantly from prior versions of the draft RFC._
+
+We will focus having the SDKs capture all available information, and adjust the issue grouping rules in Sentry to use that new information.
+
+**The SDKs will do the following:**
+
+- Capture the entire tree of exceptions represented by the exception group.
+- Capture any other chained exceptions that are not part of the exception group.
+- Use the existing `exception` value of the Sentry event.
+- Add a few new fields to each exception's `mechanism` data, as described below.
+
+**Sentry will do the following:**
+
+- Take the new mechanism fields into account when grouping issues.  See [Sentry Issue Grouping](#sentry-issue-grouping) below for more details.
+- Present a user interface that depicts the structure of the exception group on the issue details page.  See [Sentry UI Changes](#sentry-ui-changes) below for more details.
+
+## Important Notice
+
+The new grouping rules _MUST_ be fully implemented in Sentry.io and a published version of self-hosted
+Sentry before any SDK can release a non-preview version that includes these changes.  The SDK's release notes
+should mention the required version of Sentry.
+
+Although the protocol changes are backwards compatible, the current problems
+with exception groups will persist or be exacerbated without the new grouping rules in place.
+
+SDKs that currently require the user to opt-in to send chained exceptions can continue to do so without side-effects.
+SDKs that send chained exceptions _by default_ may see an immediate change to exception grouping after implementing this feature.
+
+## New Mechanism Fields
+
+The [Exception Mechanism Interface](https://develop.sentry.dev/sdk/event-payloads/exception/#exception-mechanism)
+will have the following new fields added:
+
+### `source`
+
+An optional string value describing the source of the exception.
+
+- The SDK should populate this with the name of the property or attribute of the parent exception that this exception
+  was acquired from.  In the case of an array, it should include the zero-based array index as well.
+- Python Examples: `"__context__"`, `"__cause__"`, `"exceptions[0]"`, `"exceptions[1]"`
+- .NET Examples:  `"InnerException"`, `"InnerExceptions[0]"`, `"InnerExceptions[1]"`
+- JavaScript Examples: `"cause"`, `"errors[0]"`, `"errors[1]"`
+
+### `is_exception_group`
+
+An optional boolean value, set `true` when the exception is the exception group type specific to the platform or language.
+The default is `false` when omitted.
+
+- For example, exceptions of type `ExceptionGroup` (Python), `AggregateException` (.NET), and `AggregateError` (JavaScript)
+  should have `"is_exception_group": true`.  Other exceptions can omit this field.
+
+### `exception_id`
+
+An optional numeric value providing an ID for the exception relative to this specific event.
+
+- The SDK should assign simple incrementing integers to each exception in the tree, starting with `0` for the root of the tree.
+  In other words, when flattened into the list provided in the `exception` values on the event, the last exception
+  in the list should have ID `0`, the previous one should have ID `1`, the next previous should have ID `2`, etc.
+
+### `parent_id`
+
+An optional numeric value pointing at the `exception_id` that is the parent of this exception.
+
+- The SDK should assign this to all exceptions _except_ the root exception (the last to be listed in the `exception` values).
+
+## Interpretation
+
+The `exception_id` and `parent_id` fields work in conjunction to represent the hierarchical nature of the tree of exceptions.
+If not provided, the previous interpretation will be assumed - which is that each exception in the list of `exception` values
+is a child of the one immediately following it in the list.
+
+The [Exception Interface](https://develop.sentry.dev/sdk/event-payloads/exception/) will not change in structure,
+but it will change in _interpretation_ with regard to multiple exception values.
+
+- The previous interpretation was: _"Multiple values represent chained exceptions."_.
+- The new interpretation will be: _"Multiple values are related by the optional `mechanism.exception_id`
+  and `mechanism.parent_id` fields. When not present, multiple values represent chained exceptions."_
+
+
+## Additional SDK Requirements
+
+### Mechanism Type
+
+When setting the `mechanism.type` field, SDKs should use the following guidelines:
+
+- For the root exception (the last to be in the `exception.values` list), set `mechanism.type` to the name
+  of the integration that produced the exception (as was the case before this proposal).  If the exception was
+  captured manually, set the `mechanism.type` to `"generic"`.
+
+- For all other exceptions in the list, set the `mechanism.type` to `"chained"`.  This will indicate that the exception
+  is part of the chain of exceptions stemming from the root exception (regardless of whether it is in an exception group or not).
+
+Do not omit the `mechanism.type` field, nor send it empty or null.
+
+### Exception Value
+ 
+When setting the `value` field of the exception for an exception group type, SDKs should only deliver the meaningful part
+of the exception message, excluding any string that may have been automatically added by their platform.
+
+For example:
+
+- In Python, the `value` field should not contain details such as `" (2 sub-exceptions)"`.  Use the `message` attribute to get the raw message from an `ExceptionGroup`.  If there is no message, omit the `value` and just send `type`.
+
+- In .NET, the `value` field should not contain details such as `" (Exception 1) (Exception 2)"`.  The `Message` property may need to be modified to remove the appended inner messages.
+
+### Keep Aggregate Exceptions
+
+The .NET SDK previously had implemented an option called `KeepAggregateExceptions`.  This flag should be deprecated,
+in favor of _always_ sending the entire chain of aggregate exceptions as explained in this SDK.  This may affect
+existing issue grouping, and should be noted in the change log when released.
+
+Other SDKs should _not_ implement a similar option.
+
+## Example Event
+
+Given the Python code:
+
+```python
+try:
+  raise RuntimeError("something")
+except:
+  raise ExceptionGroup("nested",
+    [
+      ValueError(654),
+      ExceptionGroup("imports",
+        [
+          ImportError("no_such_module"),
+          ModuleNotFoundError("another_module"),
+        ]
+      ),
+      TypeError("int"),
+    ]
+  )
+```
+
+The event would contain:
+
+```json
+{
+  "exception": {
+    "values": [
+      {
+        "type": "TypeError",
+        "value": "int",
+        "mechanism": {
+          "type": "chained",
+          "source": "exceptions[2]",
+          "exception_id": 6,
+          "parent_id": 0
+        }
+      },
+      {
+        "type": "ModuleNotFoundError",
+        "value": "another_module",
+        "mechanism": {
+          "type": "chained",
+          "source": "exceptions[1]",
+          "exception_id": 5,
+          "parent_id": 3
+        }
+      },
+      {
+        "type": "ImportError",
+        "value": "no_such_module",
+        "mechanism": {
+          "type": "chained",
+          "source": "exceptions[0]",
+          "exception_id": 4,
+          "parent_id": 3
+        }
+      },
+      {
+        "type": "ExceptionGroup",
+        "value": "imports",
+        "mechanism": {
+          "type": "chained",
+          "source": "exceptions[1]",
+          "is_exception_group": true,
+          "exception_id": 3,
+          "parent_id": 0
+        }
+      },
+      {
+        "type": "ValueError",
+        "value": "654",
+        "mechanism": {
+          "type": "chained",
+          "source": "exceptions[0]",
+          "exception_id": 2,
+          "parent_id": 0
+        }
+      },
+      {
+        "type": "RuntimeError",
+        "value": "something",
+        "mechanism": {
+          "type": "chained",
+          "source": "__context__",
+          "exception_id": 1,
+          "parent_id": 0
+        }
+      },
+      {
+        "type": "ExceptionGroup",
+        "value": "nested",
+        "mechanism": {
+          "type": "exceptionhook",
+          "handled": false,
+          "is_exception_group": true,
+          "exception_id": 0
+        }
+      },
+    ]
+  }
+}
+```
+
+**Reminder:** In .NET, `InnerException` is always the same as `InnerExceptions[0]`, thus it does not need to be reported separately.
+However, Python's `__cause__` and `__context__`, and JavaScript's `cause`, are independent and thus _should_ be reported separately
+if they have values.
+
+## Sentry Issue Grouping
+
+Issue grouping rules for exception groups are complex, because the nature of exception groups is that they may or may not represent
+more than one distinct issue.  While this may require some further experimentation to get right, the initial plan is as follows:
+
+First, determine the list of "top-level" exceptions.  These are the exceptions that represent distinct issues contained in
+the exception group.
+
+1. Start from the exception having `mechanism.exception_id:0`.
+2. If it has `mechanism.is_exception_group:true`, then recursively search each child.
+3. When reaching one where `mechanism.is_exception_group:false` (or not present), include it as a "top-level" exception,
+   and do not traverse any of its child exceptions.
+
+Next, determine from the top-level exceptions which of them would have been grouped together, had they been in separate events.
+
+- Apply grouping rules between the top-level exception to determine the distinct number of issues represented by the group.
+- For each top-level exception, only consider the first-path through any child exceptions.
+
+Finally:
+
+- If there is only one distinct group of top-level exceptions, group the event with other events based on that top-level exception only.
+  Ignore any parent exception groups.
+
+- If more than one distinct top-level exception exists, then group the event based on the parent exception group that they have in common.
+  This will often be the root-level exception group.
+
+As an example, consider simplified issue grouping rules that only considered the exception `type`.  When applied to an exception group such as:
+
+- `ExceptionGroup`
+  - `ValueError`
+  - `TypeError`
+  - `TypeError`
+
+There are two distinct top-level exceptions, `ValueError` and `TypeError`.  They have the `ExceptionGroup` in common.
+Thus the three exceptions considered for issue grouping are `ExceptionGroup`, `ValueError`, and the first `TypeError`.
+
+Now consider this example:
+
+- `ExceptionGroup`
+  - `ExceptionGroup`
+    - `ValueError`
+    - `ValueError`
+      - `TypeError`
+    - `ValueError`
+  - `ValueError`
+  - `ValueError`
+
+There are 5 top-level exceptions, all of type `ValueError`.  Thus, the event should only be grouped based on a single `ValueError`,
+and the others should be ignored for purposes of issue grouping.  That one of them has a chained `TypeError` is not relevant,
+at least not in this initial plan.
+
+A further modification to the plan might consider all possible branches of chained exceptions, but that is not proposed at this time.
+
+### Additional Issue Grouping Requirements
+
+As mentioned earlier, SDKs will set a meaningful `mechanism.type` on the root exception only (the last item in the `exception.values` list).
+Other exceptions will have type `"chained"` to indicate that they are a chained exception stemming from the root exception.
+
+For purposes of issue grouping, `"chained"` mechanism typed should be excluded.  Additionally, the mechanism type of the root exception must
+_always_ be considered as part of issue grouping, even if the rest of the exception group is being ignored.
+
+## Issue Titles
+
+As a side-effect of issue grouping, issues will be titled (and subtitled) based on the top-most exception that is not ignored from
+the grouping.  In other words, if there is more than one distinct top-level exception, the issue will be titled by the exception group itself.  In the above examples, the first issue would be titled as `ExceptionGroup`, and the second issue would be titled as `ValueError`.
+
+However, the `mechanism.handled` field (which determines `error.handled` and `error.unhandled` event attributes in Sentry) should always be
+taken from the root exception, even if the title and subtitle are derived from one of the chained exceptions.
+This is because SDKs are only expected to supply the `mechanism.handled` field on the root exception.
+
+## Sentry UI Changes
+
+The Issue Details page will be updated to improve usability of exception groups.  The exact details are at the discretion of the design team,
+however the following should be considered:
+
+- A condensed tree-like visualization of the exception group should be added somewhere on the page.
+  Each exception in the tree should have an in-page link to jump to that exception and ensure it is expanded.
+- Some exceptions should be collapsed by default, including any where `mechanism.is_exception_group === true`, and perhaps others.
+- The `mechanism.source` field, if available, should be displayed on each exception in the exceptions section.
+- We may want to include a way to navigate from each exception to its parent exception, or back to the exception group.
+
+# Drawbacks 
+
+- Modifying the way Exceptions are sent to Sentry will affect existing issue grouping rules
+  that customers may have set up.  This change could create new alerts when first deployed.
+
+- The design proposed above retains backwards compatibility with older versions of Sentry.
+  However, without the proposed UI changes, previous versions (self-hosted, etc.) of Sentry
+  will treat the exceptions list as if they were all one long chain of direct exceptions.
+  This could be a bit confusing to the user, until such time they upgrade their Sentry
+  instance to a version that includes the UI and issue grouping changes.
+
+# Other Options Considered
+
+We considered the following, each had pitfalls that led to the plan described above.
+
+## Do Nothing
+
+This would mean leaving things the way they currently are.
+
+Pros:
+
+- Nothing to do.
+
+Cons:
+
+- Overall, reduced ability to use Sentry for error monitoring, as the usage of exception groups increases.
+- Events created by the .NET SDK have several problems for exception groups (`AggregateException`), such as:
+  - There's no structure represented by the chain of events, so every relationship appears as parent/child, even
+    those that should actually be siblings.
+  - In some cases, some stack traces are relocated from the exception group to the first child exception,
+    otherwise no code location will be represented.  Doing so grossly misrepresents the true nature of the
+    exception caused at the highlighted stack frame.
+  - In other cases, there are already stack traces on both the exception group and the
+    first child exception.  Thus the location of the exception group is lost completely.
+  - The `KeepAggregateException` option is global for the entire application, and can't be adjusted on a case-by-case basis.
+- Issues created by other SDKs such as Python and JavaScript are not prepared to deal with exception groups at all.
+  - Issues are always titled and grouped by the exception group, even when there's only a single type of exception contained within.
+  - Because none of the items in the `exceptions` or `errors` lists are part of the _cause_, they're currently not passed to Sentry at all.
+    This makes it impossible to identify the actual cause of an exception raised via an exception group.
+
+## Sending Hierarchical Data
+
+This approach was seriously considered.  It would involve creating a new tree-like data structure that more closely
+resembles the original tree of exceptions.  It would have been placed on either a new `exception_group` interface,
+or added to the existing `contexts` or `extra` collections.
+
+Pros:
+
+- The event would contain a more direct representation of the exception data.
+- Less work for the SDKs.
+
+Cons:
+
+- Much of the server-side processing would have to be reconsidered, including relay, symbolication, and trimming.
+- It would not be backwards compatible, without duplicating significant data into the exceptions list anyway.
+
+## Sending One Exception Chain Only
+
+This approach would involve not capturing the entire exception group, but trying to determine which top-level
+exception was worth capturing, from within the SDK.
+
+Pros:
+
+- Fully compatible with existing Sentry, without any changes to grouping rules or UI.
+- Fully backwards compatible as well.
+
+Cons:
+
+- Potential to loose a lot of useful data.
+- Misrepresents the exception that was actually raised.
+- Looses track of the actual location in source code where the exception group was raised.
+
+## Splitting Events in the SDK
+
+This approach would involve the SDK sending multiple separate events for each top-level exception in an exception group.
+
+Pros:
+
+- All exceptions would come through at once.
+- No server-side processing would need to be performed.
+
+Cons:
+
+- Too much duplicate date is sent from the SDK at run time.
+- It can quickly exceed the SDKs internal maximum queue length.
+- It can trigger rate limits and spike-protection mechanisms.
+
+## Splitting Events in Relay
+
+This approach would involve the SDK sending one event containing the exception group, and relying on Sentry Relay
+to split out top-level exceptions into separate events.
+It would require the creation of a new `exception_group` interface, placed directly on the incoming event.
+
+Pros:
+
+- One event sent from the SDK, so none of the cons involved with splitting in the SDK.
+
+Cons:
+
+- Could be very CPU intensive.
+- Too much business logic.
+- Could back up overall ingestion throughput.
+- Would not be backwards compatible with the existing event schema.
+
+## Using Synthetic Exceptions
+
+This approach would set `mechanism.synthetic:true` on exception groups types, to attempt to keep them from
+being considered during issue grouping.
+
+Pros:
+
+- If it worked, issue grouping would need less adjustment.
+
+Cons:
+
+- It doesn't work for this use case.
+  - Title of issue would be incorrect, referring to the first in-app frame of the exception group.
+  - Issue grouping would be incorrect, including details of the exception group.