Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 15 additions & 5 deletions docs/ingestion/data-formats.md
Original file line number Diff line number Diff line change
Expand Up @@ -564,7 +564,7 @@ For example:
### Kafka

The `kafka` input format lets you parse the Kafka metadata fields in addition to the Kafka payload value contents.
It should only be used when ingesting from Apache Kafka.
It should only be used when ingesting from Apache Kafka.

The `kafka` input format wraps around the payload parsing input format and augments the data it outputs with the Kafka event timestamp, topic name, event headers, and the key field that itself can be parsed using any available input format.

Expand All @@ -583,6 +583,8 @@ Configure the Kafka `inputFormat` as follows:
| `headerFormat` | Object | Specifies how to parse the Kafka headers. Supports String types. Because Kafka header values are bytes, the parser decodes them as UTF-8 encoded strings. To change this behavior, implement your own parser based on the encoding style. Change the `encoding` type in `KafkaStringHeaderFormat` to match your custom implementation. See [Header format](#header-format) for supported encoding formats.| no ||
| `keyFormat` | [InputFormat](#input-format) | The [input format](#input-format) to parse the Kafka key. It only processes the first entry of the `inputFormat` field. If your key values are simple strings, you can use the `tsv` format to parse them. Note that for `tsv`,`csv`, and `regex` formats, you need to provide a `columns` array to make a valid input format. Only the first one is used, and its name will be ignored in favor of `keyColumnName`. | no ||
| `keyColumnName` | String | The name of the column for the Kafka key.| no |`kafka.key`|
| `partitionColumnName` | String | The name of the column for the Kafka partition number. | no | `kafka.partition` |
| `offsetColumnName` | String | The name of the column for the Kafka record offset. Ingesting this column enables filtering by offset in `transformSpec`, which is useful for recovering data from a specific offset range. | no | `kafka.offset` |

#### Header format

Expand All @@ -604,6 +606,8 @@ For example, consider the following structure for a Kafka message that represent

- **Kafka timestamp**: `1680795276351`
- **Kafka topic**: `wiki-edits`
- **Kafka partition**: `0`
- **Kafka offset**: `12345`
- **Kafka headers**:
- `env=development`
- `zone=z1`
Expand Down Expand Up @@ -632,6 +636,8 @@ You would configure it as follows:
"columns": ["x"]
},
"keyColumnName": "kafka.key",
"partitionColumnName": "kafka.partition",
"offsetColumnName": "kafka.offset"
}
}
```
Expand All @@ -649,7 +655,9 @@ You would parse the example message as follows:
"kafka.topic": "wiki-edits",
"kafka.header.env": "development",
"kafka.header.zone": "z1",
"kafka.key": "wiki-edit"
"kafka.key": "wiki-edit",
"kafka.partition": 0,
"kafka.offset": 12345
}
```

Expand Down Expand Up @@ -734,16 +742,18 @@ After Druid ingests the data, you can query the Kafka metadata columns as follow
SELECT
"kafka.header.env",
"kafka.key",
"kafka.partition",
"kafka.offset",
"kafka.timestamp",
"kafka.topic"
FROM "wikiticker"
```

This query returns:

| `kafka.header.env` | `kafka.key` | `kafka.timestamp` | `kafka.topic` |
|--------------------|-----------|---------------|---------------|
| `development` | `wiki-edit` | `1680795276351` | `wiki-edits` |
| `kafka.header.env` | `kafka.key` | `kafka.partition` | `kafka.offset` | `kafka.timestamp` | `kafka.topic` |
|--------------------|-----------|-------------------|----------------|---------------|---------------|
| `development` |`wiki-edit`|`0`|`12345`| `1680795276351`| `wiki-edits` |

### Kinesis

Expand Down
8 changes: 6 additions & 2 deletions docs/ingestion/ingestion-spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -188,9 +188,13 @@ Treat `__time` as a millisecond timestamp: the number of milliseconds since Jan
The `dimensionsSpec` is located in `dataSchema` → `dimensionsSpec` and is responsible for
configuring [dimensions](./schema-model.md#dimensions).

You can either manually specify the dimensions or take advantage of schema auto-discovery where you allow Druid to infer all or some of the schema for your data. This means that you don't have to explicitly specify your dimensions and their type.
You can either manually specify the dimensions or take advantage of type-aware schema auto-discovery where you allow Druid to infer all or some of the schema for your data. This means that you don't have to explicitly specify your dimensions and their type.

To use schema auto-discovery, set `useSchemaDiscovery` to `true`.
:::caution
When using type-aware schema auto-discovery, Druid discovers the type for all dimensions unless you use the `dimensionExclusions` field to explicitly specify dimensions to ignore. This helps you control storage costs by preventing Druid from unintentionally ingesting dimensions.
:::

To use type-aware schema auto-discovery, set `useSchemaDiscovery` to `true`.

Alternatively, you can use the string-based schemaless ingestion where any discovered dimensions are treated as strings. To do so, leave `useSchemaDiscovery` set to `false` (default). Then, set the dimensions list to empty or set the `includeAllDimensions` property to `true`.

Expand Down
9 changes: 5 additions & 4 deletions docs/ingestion/schema-design.md
Original file line number Diff line number Diff line change
Expand Up @@ -249,12 +249,13 @@ Druid can infer the schema for your data in one of two ways:

#### Type-aware schema discovery

:::info
Note that using type-aware schema discovery can impact downstream BI tools depending on how they handle ARRAY typed columns.
:::

You can have Druid infer the schema and types for your data partially or fully by setting `dimensionsSpec.useSchemaDiscovery` to `true` and defining some or no dimensions in the dimensions list.

Before you use type-aware schema discovery, keep the following in mind:

- There may be an impact on downstream BI tools depending on how they handle ARRAY-typed columns.
- Be aware of all the potential dimensions. Druid discovers all available dimensions unless you specify an exclusion list. Without an exclusion list, you may ingest more columns than you intend. For example, if you use type-aware schema discovery and the Kafka input format, Druid discovers dimensions like the Kafka offset and partition unless you add them to the exclusion list.

When performing type-aware schema discovery, Druid can discover all the columns of your input data (that are not present in
the exclusion list). Druid automatically chooses the most appropriate native Druid type among `STRING`, `LONG`,
`DOUBLE`, `ARRAY<STRING>`, `ARRAY<LONG>`, `ARRAY<DOUBLE>`, or `COMPLEX<json>` for nested data. For input formats with
Expand Down
Loading