Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce optional external.version.header config #697

Merged

Conversation

rjosal-indeed
Copy link
Contributor

Introduce optional external.version.header config to use for ES external version instead of kafka offset for non-datastream indices

Problem

After changing an index mapping, we need to reindex. To do this, we create a new index, run a batch job with all the existing input documents, while at the same time streaming realtime updates, ensuring we do not miss any documents. Today both can't run at the same time because we might overwrite a new streamed update with a batch index.

Solution

ES external versioning can be used to compare document versions for idempotency. We will be able to pull a version out of the document into a kafka header and use the same version in the batch job. We choose header so that deletes with null values are also idempotent.

Does this solution apply anywhere else?
  • [ x] yes
  • no
If yes, where?

There are a couple of other uses cases that may be solved by this control.
https://github.com/confluentinc/kafka-connect-elasticsearch/issues?q=is%3Aissue+is%3Aopen+external+version

Test Strategy

Testing done:
  • [x ] Unit tests
  • Integration tests
  • System tests
  • Manual tests

Release Plan

If merged, we will run this connector in Confluent Cloud. Else, we will run the fork. As an optional feature, with a default that does not affect existing usages, it is backwards compatible.

master

n/a

Thank you, and please pick this apart!

@rjosal-indeed rjosal-indeed requested a review from a team as a code owner May 9, 2023 23:25
@cla-assistant
Copy link

cla-assistant bot commented Sep 11, 2023

CLA assistant check
All committers have signed the CLA.

Comment on lines 253 to 254
request.version((Long) record.headers().lastWithName(
config.externalVersionHeader()).value());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we put this in service, we noticed that even though we used an SMT to write a long type header, it didn't come out that way in this code. We got it to work by using org.apache.kafka.connect.storage.SimpleHeaderConverter #fromConnectHeader and then reading it as a String and parsing it to a Long. Does that sound correct or did we misunderstand something about kafka connect headers?

Copy link
Member

@sp-gupta sp-gupta Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rjosal-indeed
Apologies, I missed this before.

Would it be possible for you to elaborate it? I am not sure if I quite understood how you used SimpleHeaderConverter.

Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the code for this lines with the connect header converter looks like:

        final Header versionHeader = record.headers().lastWithName(config.externalVersionHeader());
        final byte[] versionValue = HEADER_CONVERTER.fromConnectHeader(
                record.topic(),
                versionHeader.key(),
                versionHeader.schema(),
                versionHeader.value()
        );
        request.version(Long.parseLong(new String(versionValue)));

where HEADER_CONVERTER is an constant new SimpleHeaderConverter()

Copy link
Member

@sp-gupta sp-gupta Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rjosal-indeed
Great! Thank you so much, this makes sense.

I think we can go ahead with this change. Also, can you refer to the suggestions mentioned here (ignore 2nd point) and make the changes accordingly? After, we can go ahead and merge the changes real quick.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sp-gupta ready for review, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rjosal-indeed
There was one import missing due to which the build was failing. I have corrected it.

Again, Thank you so much for your contribution. Really appreciate!!

@rjosal-indeed rjosal-indeed changed the base branch from master to 10.0.x November 28, 2023 00:16
@rjosal-indeed rjosal-indeed changed the base branch from 10.0.x to master November 29, 2023 18:54
…nal version instead of kafka offset for non-datastream indices
@rjosal-indeed rjosal-indeed changed the base branch from master to 11.0.x December 4, 2023 18:51
@sp-gupta sp-gupta merged commit 9e95174 into confluentinc:11.0.x Dec 5, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants