Skip to content

Clear "lineSplittable" for JSON when using KafkaInputFormat.#15692

Merged
gianm merged 3 commits intoapache:masterfrom
gianm:kafka-json-linesplittable
Jan 18, 2024
Merged

Clear "lineSplittable" for JSON when using KafkaInputFormat.#15692
gianm merged 3 commits intoapache:masterfrom
gianm:kafka-json-linesplittable

Conversation

@gianm
Copy link
Contributor

@gianm gianm commented Jan 16, 2024

Fixes a bug where the KafkaInputFormat would parse incoming JSON newline-delimited (as if it were a batch ingest) rather than as a whole entity (as is typical for streaming ingest).

Background:

JsonInputFormat has a withLineSplittable method that can be used to control whether JSON is read line-by-line, or as a whole. The intent is that in streaming ingestion, lineSplittable is false (although it can be overridden by assumeNewlineDelimited), and in batch ingestion, lineSplittable is true.

When a json format is wrapped by a kafka format, this isn't set properly. This patch updates KafkaInputFormat to set this on an underlying json format.

The tests for KafkaInputFormat were overriding the lineSplittable parameter explicitly, which wasn't really fair, because that made them unrealistic to what happens in production. Now they omit the parameter and get the production behavior.

JsonInputFormat has a "withLineSplittable" method that can be used to
control whether JSON is read line-by-line, or as a whole. The intent
is that in streaming ingestion, "lineSplittable" is false (although it
can be overridden by "assumeNewlineDelimited"), and in batch ingestion,
lineSplittable is true.

When a "json" format is wrapped by a "kafka" format, this isn't set
properly. This patch updates KafkaInputFormat to set this on an
underlying "json" format.

The tests for KafkaInputFormat were overriding the "lineSplittable"
parameter explicitly, which wasn't really fair, because that made them
unrealistic to what happens in production. Now they omit the parameter
and get the production behavior.
Copy link
Member

@asdf2014 asdf2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT 👍

@gianm gianm merged commit 764f41d into apache:master Jan 18, 2024
@gianm gianm deleted the kafka-json-linesplittable branch January 18, 2024 11:22
@adarshsanjeev adarshsanjeev added this to the 30.0.0 milestone May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants