[EPIC] Metric Bucket JSON-Encoding for Sets and Distributions

Set and distribution metrics are currently being sent as float arrays in JSON which causes a significant overhead in various parts of the system. 

The float arrays are expensive to parse and consumers which are not interested in the individual values still need to parse each float (without a specialised JSON parser). Additionally in practice the float arrays are a very inefficient encoding and require much more than 8 bytes (f64) per value.

All array values are converted to little endian before compression and encoding.

To increase overall throughput we want to use a more efficient encoding of the float arrays for set and distribution metrics.

Before switching from JSON to a binary format like CBOR we want to explore different ways to encode and compress the data with JSON first.

Benchmarks regarding throughput and compression have been collected in [getsentry/bucket-compression](https://github.com/getsentry/bucket-compression).

We identified `zstd` to be a very good general purpose algorithm which should give us good wins across the board. To encode binary data (`zstd` compressed data) we will use Base64 (**without** padding).

*Note:* zstd is used in its streaming mode.

Proposed schema change for distribution and set values (`values` field):

```json
{
  "oneOf": [
    { "type": "array", "items": { "type": "number" } },
    {
      "type": "object",
      "properties": {
        "format": { "const": "array" },
        "data": { "type": "array", "items": { "type": "number" } }
      }
    },
    {
      "type": "object",
      "properties": {
        "format": { "const": "zstd" },
        "data": { "type": "string" }
      }
    }
  ]
}
```

Examples:

```json
  {
    "org_id": 1,
    "project_id": "12345...",
    "timestamp": 1615889440,
    "width": 10,
    "name": "endpoint.response_time",
    "tags": {
      "route": "user_index"
    },
    "type": "d",
    "value": {"format": "zstd", "data": "<base64>"}
  }
```

```json
  {
    "name": "endpoint.response_time",
    ...
    "type": "s",
    "value": {"format": "array", "data": [13.37, 42, 3.14159265358979323846264338327950288]},
  }
```

**Note:** We want to stay compatible with the current JSON float array values, since we components which directly write to Kafka and are not using Relay as their ingestion path.

## Milestone 1 - New JSON Schema

As a first milestone we want to support the new JSON schema with the `array` format. This does not change the encoding of the values yet.

We will have to add support for the new format in multiple systems:

```[tasklist]
### Relay
- [ ] https://github.com/getsentry/relay/pull/3137
- [ ] https://github.com/getsentry/sentry/pull/65410
```

```[tasklist]
### Consumers
- [x] Rework Schema Validation
- [x] Increase observability with additional metrics
- [x] Check Last-Seen-Updater
- [ ] https://github.com/getsentry/sentry-kafka-schemas/pull/222
```

```[tasklist]
### Snuba
- [ ] https://github.com/getsentry/snuba/pull/5560
- [ ] https://github.com/getsentry/snuba/pull/5617
- [x] Tweak DLQ Configuration
- [x] Additional Observability. Messages in the new format / time to decode a message
```


## Milestone 2 - `base64` and `zstd`

Add support for  the `base64` and `zstd` encodings.

```[tasklist]
### Relay
- [ ] https://github.com/getsentry/relay/pull/3218
- [ ] https://github.com/getsentry/sentry/pull/66588
- [ ] https://github.com/getsentry/relay/pull/3252
```

```[tasklist]
### Indexer Consumer
```

```[tasklist]
### Snuba
- [x] Add support for the `zstd` format
- [ ] https://github.com/getsentry/snuba/pull/5761
```


## Milestone X - Future


```[tasklist]
### Future
- [ ] Update other producers which do not ingest via Relay
- [ ] Deprecate old format and/or remove support for the old format
- [ ] Add support for additional compressions (lossy?)
- [ ] Switch to a binary protocol like CBOR (instead of JSON)
- [ ] Relay: Use optimized JSON format for the metrics bulk endpoint
- [ ] Relay: Switch between different compressions based on the bucket contents (e.g. small buckets vs. big buckets etc.)
```

## Test

### Base64

##### Base64 - Distribution

Expected distribution values: `[3, 1, 2]`.

```json
{
   "name":"d:transactions/foo@none",
   "org_id":0,
   "project_id":42,
   "retention_days":90,
   "tags":{},
   "timestamp":1712219392,
   "type":"d",
   "value":{
      "data":"AAAAAAAACEAAAAAAAADwPwAAAAAAAABA",
      "format":"base64"
   }
}
```

##### Base64 - Set

Expected set values: `{1, 7}`.

```json
{
   "name":"s:transactions/bar@none",
   "org_id":0,
   "project_id":42,
   "retention_days":90,
   "tags":{},
   "timestamp":1712219392,
   "type":"s",
   "value":{
      "data":"AQAAAAcAAAA=",
      "format":"base64"
   }
}
```

### Zstd

##### Zstd - Distribution

Expected distribution values: `[1, 2, 3]`.

```json
{
   "name":"d:transactions/foo@none",
   "org_id":0,
   "project_id":42,
   "retention_days":90,
   "tags":{},
   "timestamp":1712219148,
   "type":"d",
   "value":{
      "data":"KLUv/QBYrQAAcAAA8D8AQAAAAAAAAAhAAgBgRgCw",
      "format":"zstd"
   }
}
```

##### Zstd - Set

Expected set values: `{1, 7}`.

```json
{
   "name":"s:transactions/bar@none",
   "org_id":0,
   "project_id":42,
   "retention_days":90,
   "tags":{},
   "timestamp":1712219148,
   "type":"s",
   "value":{
      "data":"KLUv/QBYQQAAAQAAAAcAAAA=",
      "format":"zstd"
   }
}
```

## Rollout Plan (for each Milestone)

1. Rollout each service independently with support for the old and new JSON encoding
2. S4S: Enable new format by letting Relay send the updated JSON, staggered for each namespace independently (start with `custom`, end with `transactions`).
3. SaaS: Enable new format, start with `custom` end with `transactions` (see Step 2).

### Rollback

The option to stop Relay from sending the new format can be immediately rolled back, Relay will stop sending the new format within 10 seconds.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Metric Bucket JSON-Encoding for Sets and Distributions #3074

Milestone 1 - New JSON Schema

Milestone 2 - `base64` and `zstd`

Milestone X - Future

Test

Base64

Base64 - Distribution

Base64 - Set

Zstd

Zstd - Distribution

Zstd - Set

Rollout Plan (for each Milestone)

Rollback

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

[EPIC] Metric Bucket JSON-Encoding for Sets and Distributions #3074

Description

Milestone 1 - New JSON Schema

Milestone 2 - base64 and zstd

Milestone X - Future

Test

Base64

Base64 - Distribution

Base64 - Set

Zstd

Zstd - Distribution

Zstd - Set

Rollout Plan (for each Milestone)

Rollback

Metadata

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Issue actions

Milestone 2 - `base64` and `zstd`