Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support csv input format in Kafka ingestion with header #16630

Merged
merged 6 commits into from
Jun 25, 2024

Conversation

kfaraz
Copy link
Contributor

@kfaraz kfaraz commented Jun 20, 2024

Description

When Kafka ingestion is setup with ioConfig.type = kafka (i.e. enable "Parse Kafka metadata") and csv input format, we get the following parsing error while both sampling the data and running an actual ingestion task.

Screenshot 2024-06-20 at 12 17 45 PM

org.apache.druid.java.util.common.parsers.ParseException:
  Unsupported input format in valueFormat. KafkaInputFormat only supports input format
  that return MapBasedInputRow rows

This error eventually fails the sampling with

org.apache.druid.indexing.overlord.sampler.SamplerExceptionMapper
- Failed to sample data: 
   Size of rawColumnsList([[{kafka.timestamp=1718857393187, name=a, kafka.topic=abc, time=2024-06-14T01:00:00Z, value=1}]])
   does not correspond to size of inputRows([[]])

and ingestion simply rejects the events due to the parse exception.

The root cause is that KafkaInputReader expects the input rows to be MapBasedInputRows
so that it may use the event map to blend the values with headers and keys.

Changes

  • Convert ListBasedInputRow to MapBasedInputRow using .asMap()
    while building blended rows that contain values from Kafka headers, key and value.

Screenshot after the fix

Screenshot 2024-06-20 at 12 07 30 PM

Testing

  • Add a unit test to KafkaInputFormatTest with csv record payload
  • Tested ingestion and sampling on a local cluster with csv values. (refer to the screenshot above)

Release note

Allow use of csv input format in Kafka record when "Parse Kafka metadata" is also enabled.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@kfaraz kfaraz requested a review from clintropolis June 20, 2024 06:41
@kfaraz kfaraz changed the title Support ListBasedInputRow in Kafka ingestion with header Support csv input format in Kafka ingestion with header Jun 20, 2024
Copy link
Contributor

@AmatyaAvadhanula AmatyaAvadhanula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, @kfaraz. LGTM!

// Return type for the value parser should be of type MapBasedInputRow
// Parsers returning other types are not compatible currently.
valueRow = (MapBasedInputRow) r;
if (r instanceof ListBasedInputRow) {
Copy link
Contributor

@cryptoe cryptoe Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the performance implication of this if should be okay.
Anyway can we add a UT for this ?

Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a UT for the above change.

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we just made buildBlendedEventMap less picky about stuff? I attached a patch that instead makes buildBlendedEventMap look something like this

private static Map<String, Object> buildBlendedEventMap(
      Function<String, Object> getRowValue,
      Set<String> rowDimensions,
      Map<String, Object> fallback
  )

so then usage is like:

    return valueParser.read().map(
        r -> {

          final HashSet<String> newDimensions = new HashSet<>(r.getDimensions());
          final Map<String, Object> event = buildBlendedEventMap(r::getRaw, newDimensions, headerKeyList);
...

kafka-reader.patch

I didn't test much, but KafkaInputFormatTest passed minus the parse exception test due to different message from different key ordering, though that's probably easy to fix.

@kfaraz
Copy link
Contributor Author

kfaraz commented Jun 20, 2024

Thanks for the patch, @clintropolis ! I have tested out the changes (with a minor tweak for sampling) and updated the PR accordingly. It works as expected.

@kfaraz kfaraz requested a review from clintropolis June 20, 2024 14:15
Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be nice to add a test that uses csv to KafkaInputFormatTest.

Also I think that the keyFormat might have a similar problem,

which can be changed to something like this

          InputRow keyRow = keyIterator.next();
          // Add the key to the mergeList only if the key string is not already present
          mergedHeaderMap.putIfAbsent(
              keyColumnName,
              keyRow.getRaw(Iterables.getOnlyElement(keyRow.getDimensions()))
          );

if we also change KafkaInputFormat.java key parser thingy to not use the regular input schema,


to something like

        (keyFormat == null) ?
            null :
            record ->
                (record.getRecord().key() == null) ?
                    null :
                    JsonInputFormat.withLineSplittable(keyFormat, false).createReader(
                        new InputRowSchema(
                            dummyTimestampSpec,
                            DimensionsSpec.EMPTY,
                            null
                        ),
                        new ByteEntity(record.getRecord().key()),
                        temporaryDirectory
                    ),

@kfaraz
Copy link
Contributor Author

kfaraz commented Jun 21, 2024

@clintropolis , I have added a test for CSV value. Do you think it would be okay if we fix the handling of the key format in a follow up PR?

@kfaraz kfaraz requested a review from clintropolis June 21, 2024 09:34
Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes overall lgtm, is fine to do other fix as a follow-up

@kfaraz kfaraz merged commit f1043d2 into apache:master Jun 25, 2024
87 checks passed
@kfaraz
Copy link
Contributor Author

kfaraz commented Jun 25, 2024

Thanks for the reviews, @AmatyaAvadhanula , @clintropolis !

@kfaraz kfaraz deleted the support_list_row branch June 25, 2024 06:20
@asdf2014
Copy link
Member

asdf2014 commented Jul 8, 2024

I believe this is worth mentioning in the release notes 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants