apache · 317brian · May 14, 2026 · May 4, 2026 · May 4, 2026 · May 4, 2026
diff --git a/docs/ingestion/data-formats.md b/docs/ingestion/data-formats.md
@@ -564,7 +564,7 @@ For example:
 ### Kafka
 
 The `kafka` input format lets you parse the Kafka metadata fields in addition to the Kafka payload value contents.
-It should only be used when ingesting from Apache Kafka.
+It should only be used when ingesting from Apache Kafka. 
 
 The `kafka` input format wraps around the payload parsing input format and augments the data it outputs with the Kafka event timestamp, topic name, event headers, and the key field that itself can be parsed using any available input format.
 
@@ -583,6 +583,8 @@ Configure the Kafka `inputFormat` as follows:
 | `headerFormat` | Object | Specifies how to parse the Kafka headers. Supports String types. Because Kafka header values are bytes, the parser decodes them as UTF-8 encoded strings. To change this behavior, implement your own parser based on the encoding style. Change the `encoding` type in `KafkaStringHeaderFormat` to match your custom implementation. See [Header format](#header-format) for supported encoding formats.| no ||
 | `keyFormat` | [InputFormat](#input-format) | The [input format](#input-format) to parse the Kafka key. It only processes the first entry of the `inputFormat` field. If your key values are simple strings, you can use the `tsv` format to parse them. Note that for `tsv`,`csv`, and `regex` formats, you need to provide a `columns` array to make a valid input format. Only the first one is used, and its name will be ignored in favor of `keyColumnName`. | no ||
 | `keyColumnName` | String | The name of the column for the Kafka key.| no |`kafka.key`|
+| `partitionColumnName` | String | The name of the column for the Kafka partition number. | no | `kafka.partition` |
+| `offsetColumnName` | String | The name of the column for the Kafka record offset. Ingesting this column enables filtering by offset in `transformSpec`, which is useful for recovering data from a specific offset range. | no | `kafka.offset` |
 
 #### Header format
 
@@ -604,6 +606,8 @@ For example, consider the following structure for a Kafka message that represent
 
 - **Kafka timestamp**: `1680795276351`
 - **Kafka topic**: `wiki-edits`
+- **Kafka partition**: `0`
+- **Kafka offset**: `12345`
 - **Kafka headers**:
   - `env=development`
   - `zone=z1`
@@ -632,6 +636,8 @@ You would configure it as follows:
       "columns": ["x"]
     },
     "keyColumnName": "kafka.key",
+    "partitionColumnName": "kafka.partition",
+    "offsetColumnName": "kafka.offset"
   }
 }
 ```
@@ -649,7 +655,9 @@ You would parse the example message as follows:
   "kafka.topic": "wiki-edits",
   "kafka.header.env": "development",
   "kafka.header.zone": "z1",
-  "kafka.key": "wiki-edit"
+  "kafka.key": "wiki-edit",
+  "kafka.partition": 0,
+  "kafka.offset": 12345
 }
 ```
 
@@ -734,16 +742,18 @@ After Druid ingests the data, you can query the Kafka metadata columns as follow
 SELECT
   "kafka.header.env",
   "kafka.key",
+  "kafka.partition",
+  "kafka.offset",
   "kafka.timestamp",
   "kafka.topic"
 FROM "wikiticker"
 ```
 
 This query returns:
 
-| `kafka.header.env` | `kafka.key` | `kafka.timestamp` | `kafka.topic` |
-|--------------------|-----------|---------------|---------------|
-| `development`      | `wiki-edit` | `1680795276351` | `wiki-edits`  |
+| `kafka.header.env` | `kafka.key` | `kafka.partition` | `kafka.offset` | `kafka.timestamp` | `kafka.topic` |
+|--------------------|-----------|-------------------|----------------|---------------|---------------|
+| `development`      |`wiki-edit`|`0`|`12345`| `1680795276351`| `wiki-edits`  |
 
 ### Kinesis
 

diff --git a/docs/ingestion/ingestion-spec.md b/docs/ingestion/ingestion-spec.md
@@ -188,9 +188,13 @@ Treat `__time` as a millisecond timestamp: the number of milliseconds since Jan
 The `dimensionsSpec` is located in `dataSchema` → `dimensionsSpec` and is responsible for
 configuring [dimensions](./schema-model.md#dimensions).
 
-You can either manually specify the dimensions or take advantage of schema auto-discovery where you allow Druid to infer all or some of the schema for your data. This means that you don't have to explicitly specify your dimensions and their type. 
+You can either manually specify the dimensions or take advantage of type-aware schema auto-discovery where you allow Druid to infer all or some of the schema for your data. This means that you don't have to explicitly specify your dimensions and their type. 
 
-To use schema auto-discovery, set `useSchemaDiscovery` to `true`. 
+:::caution
+When using type-aware schema auto-discovery, Druid discovers the type for all dimensions unless you use the `dimensionExclusions` field to explicitly specify dimensions to ignore. This helps you control storage costs by preventing Druid from unintentionally ingesting dimensions.
+:::
+
+To use type-aware schema auto-discovery, set `useSchemaDiscovery` to `true`. 
 
 Alternatively, you can use the string-based schemaless ingestion where any discovered dimensions are treated as strings. To do so, leave `useSchemaDiscovery` set to `false` (default). Then, set the dimensions list to empty or set the  `includeAllDimensions` property to `true`.
 

diff --git a/docs/ingestion/schema-design.md b/docs/ingestion/schema-design.md
@@ -249,12 +249,13 @@ Druid can infer the schema for your data in one of two ways:
 
 #### Type-aware schema discovery
 
-:::info
- Note that using type-aware schema discovery can impact downstream BI tools depending on how they handle ARRAY typed columns.
-:::
-
 You can have Druid infer the schema and types for your data partially or fully by setting `dimensionsSpec.useSchemaDiscovery` to `true` and defining some or no dimensions in the dimensions list. 
 
+Before you use type-aware schema discovery, keep the following in mind:
+
+- There may be an impact on downstream BI tools depending on how they handle ARRAY-typed columns.
+- Be aware of all the potential dimensions. Druid discovers all available dimensions unless you specify an exclusion list. Without an exclusion list, you may ingest more columns than you intend. For example, if you use type-aware schema discovery and the Kafka input format, Druid discovers dimensions like the Kafka offset and partition unless you add them to the exclusion list.
+
 When performing type-aware schema discovery, Druid can discover all the columns of your input data (that are not present in
 the exclusion list). Druid automatically chooses the most appropriate native Druid type among `STRING`, `LONG`,
 `DOUBLE`, `ARRAY<STRING>`, `ARRAY<LONG>`, `ARRAY<DOUBLE>`, or `COMPLEX<json>` for nested data. For input formats with