Skip to content

NIFI-9822 - ConsumeKafkaRecord allows writing out Kafka record key#6045

Closed
greyp9 wants to merge 3 commits intoapache:mainfrom
greyp9:NIFI-9822-B
Closed

NIFI-9822 - ConsumeKafkaRecord allows writing out Kafka record key#6045
greyp9 wants to merge 3 commits intoapache:mainfrom
greyp9:NIFI-9822-B

Conversation

@greyp9
Copy link
Copy Markdown
Contributor

@greyp9 greyp9 commented May 13, 2022

Summary

NIFI-9822

  • ConsumeKafkaRecord allows writing out Kafka record key

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using mvn clean install -P contrib-check
    • JDK 8
    • JDK 11
    • JDK 17

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

@greyp9
Copy link
Copy Markdown
Contributor Author

greyp9 commented May 13, 2022

@markap14 , @pvillard31

Wanted to vet an initial implementation of this ticket, including a first pass at documentation. It seems to work in the happy path case. I would like to wait on error handling, unit tests, and (properly formatted) documentation until you guys like the approach.

Copy link
Copy Markdown
Contributor

@markap14 markap14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @greyp9 thanks for the updates! I think this is a good direction. I left several comments inline of things that I noticed. Some of these you may well already have realized given that you said you need to do more work on tests & docs. Some are minor recommendations that you can accept or ignore.

I do think we need to improve the Publisher a bit more, though. I think the publisher needs a new property that tells it whether the incoming data should be sent as-is, as a single record, or if the incoming data is in the "Wrapper" format. And only expose the Key Record Writer if using the Wrapper format. That way, we can easily receive a message from ConsumeKafkaRecord, do some processing, etc. and then push to another kafka topic, for instance, and easily retain the headers, the key, and the value.

@@ -426,11 +465,13 @@ protected ConsumerPool createConsumerPool(final ProcessContext context, final Co
}

return new ConsumerPool(maxLeases, readerFactory, writerFactory, props, topics, maxUncommittedTime, securityProtocol,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't necessarily need to be done in this ticket. But probably makes sense to introduce a Builder pattern here instead of so many constructor args. It made sense before this, too, though :)

Comment on lines +697 to +698
final InputStream is = new ByteArrayInputStream(key);
final RecordReader reader = keyReaderFactory.createRecordReader(attributes, is, key.length, logger);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use try-with-resources here to ensure that we close the InputStream and the Record Reader.

throws IOException, SchemaNotFoundException, MalformedRecordException {
final Tuple<RecordField, Object> tuple;
final byte[] key = consumerRecord.key() == null ? new byte[0] : consumerRecord.key();
if (KafkaProcessorUtils.RECORD.getValue().equals(keyFormat)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me a minute to figure out what this mean - KafkaProcessorUtils.RECORD wasn't immediately obvious to me. Perhaps it makes sense to rename RECORD, STRING, etc. to something that makes more sense outside the context, such as KEY_AS_RECORD, KEY_AS_STRING etc.? Is a bit of a nitpick and you can feel free to ignore if you want.

tuple = new Tuple<>(recordField, record);
} else if (KafkaProcessorUtils.STRING.getValue().equals(keyFormat)) {
final RecordField recordField = new RecordField("key", RecordFieldType.STRING.getDataType());
tuple = new Tuple<>(recordField, new String(key, StandardCharsets.UTF_8));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll definitely want to make sure that we document that this strategy requires that the key be a UTF-8 compatible String. And we should probably ensure that we test with a non-UTF-8 compatible String. In that case, the record should probably go to the parse.failure relationship.

} else {
final RecordField recordField = new RecordField("key",
RecordFieldType.ARRAY.getArrayDataType(RecordFieldType.BYTE.getDataType()));
tuple = new Tuple<>(recordField, key);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually we need to introduce a BYTES data type for Records. Right now, when we have an Array of type Byte, the record api expects this to be an array of Byte objects, not primitive bytes. So, as inefficient as it is, in this case, I think we need to create a Byte[] for the key instead of provide the byte[].

.map(s -> s.getBytes(StandardCharsets.UTF_8)).orElse(null);
} else {
final ByteArrayOutputStream os = new ByteArrayOutputStream(1024);
final MapRecord keyRecord = (MapRecord) record.getValue(messageKeyField);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably be using Record here - not MapRecord

Comment on lines +262 to +264
- Output Strategy "Write Value Only" (the default) emits flowfile records containing only the Kafka record value.
- Output Strategy "Use Wrapper" (new) emits flowfile records containing the Kafka record key, value, and headers, as
well as additional metadata from the Kafka record.</p>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably use <ul> with <li> rather than - for denoting lists.

<p>This processor (NiFi 1.17+) offers multiple output strategies (configured via processor property "Output Strategy")
for converting Kafka records into flow files.
- Output Strategy "Write Value Only" (the default) emits flowfile records containing only the Kafka record value.
- Output Strategy "Use Wrapper" (new) emits flowfile records containing the Kafka record key, value, and headers, as
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be sure that we call out the Record Schema that will be used here.


<p>Additionally, the choice of the "Output Strategy" processor property affects the related properties "Headers to Add
as Attributes (Regex)" and "Key Attribute Encoding". These properties are available only when "Output Strategy" is set
to "Write Value Only".</p>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might make sense to mention the reason they are only available when Output Strategy = Write Value Only. I.e., because it doesn't make sense when writing Records, as the Headers and keys are not attributes, they are part of the Record/FlowFile content.

descriptors.add(KafkaProcessorUtils.TOKEN_AUTH);
descriptors.add(KafkaProcessorUtils.SSL_CONTEXT_SERVICE);
descriptors.add(SEPARATE_BY_KEY);
descriptors.add(OUTPUT_STRATEGY);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output Strategy (and related properties) is going to be a very important thing for the user to think through when configuring this. Because of that, I'd recommend moving this property up in the list to just after Group ID

@greyp9
Copy link
Copy Markdown
Contributor Author

greyp9 commented Jun 16, 2022

Closing in favor of #6131.

@greyp9 greyp9 closed this Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants