New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-14517: Implement regex subscriptions #14327
base: trunk
Are you sure you want to change the base?
Conversation
be7f5ef
to
c68c37d
Compare
@a961370183 Thanks for the PR. The approach taken in this PR is incorrect. As explained in KIP-848, the goal is to move the resolution of the regex to the server side. Therefore, the regex is passed to the server via the new ConsumerGroupHeartbeat API. That being said, I think that we should start by implementing the server side. Are you interested in doing do? Note that the part adding the validation command to the consumer group command line tool is correct. We could keep this part. |
b00cedc
to
12bd434
Compare
36c09b3
to
833fefb
Compare
@dajac Almost done. Could you pls help to review the code? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JimmyWang6 for the PR.
As @dajac mentioned, we don't want to have the client include the RE2J dependency. The evaluation of the regex is performed strictly on the broker.
/** | ||
* local variable to avoid multiple compile | ||
*/ | ||
private Pattern complitedPattern; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks
KAFKA-14048:remove client side code
11850f2
to
23a4d9b
Compare
@JimmyWang6 I am really sorry but I haven't had the time to dive into this yet. At a high level, I think that the approach taken so far does not cover all the aspects. For instance, how is the target assignment updates based on the resolved regular expressions? |
@JimmyWang6 , Apart from all the suggestions made in the PR reviews, I have a minor one. In the PR title, you have mentioned |
95495c0
to
d98f801
Compare
@vamossagar12 Much thanks for your review. Will change to KAFKA-14517 later |
@dajac |
704670b
to
62d5608
Compare
Hi @dajac |
@
@dajac @vamossagar12 The latest code appears to be functioning well. Do you have any further comments or suggestions regarding this approach? |
77a2f41
to
cf8c05e
Compare
cf8c05e
to
d5b0393
Compare
@JimmyWang6, are we still using re2j.Pattern for server-side regex checking, I don't see its usage in the PR? |
@Phuc-Hong-Tran The work to replace the existing regex matching with re4j is not included in this PR. I think we should create a new subtask to address this. |
Sure @JimmyWang6, but we need your PR to get merged first before we can process with that task, otherwise whoever gonna work on that task need to fork your branch, which is not an ideal workflow |
b91e9b1
to
902684e
Compare
@dajac |
Hey @JimmyWang6. I am really sorry for the delay on this pull request. We have too many other things to finish before I could really focus on it. I will have more time in the coming weeks. If you're still interested and you also have time, we could get it done now. |
As a first step, it would be great if we could keep this pull request focused on the RPC and its implementation on the server side. I would extract the command line tool part into its own pull request. Would it be possible? |
@dajac I have a question regarding the usage of RE2J on the server side. Shouldn't RE2J be the primary engine that will be used for compiling regular expressions. Why is it not in the scope of this ticket? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JimmyWang6 I left a few high level comments to start with. Are you on the Apache slack? We could also discuss there offline if you want.
{ "name": "SubscribedTopicRegex", "type": "string", "versions": "0+", "nullableVersions": "0+", "default": "null", | ||
"about": "null if it didn't change since the last heartbeat; the subscribed topic regex otherwise" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We must bump the version of the API to version 1 and use version 1+
for this field in order to make it backward compatible.
@@ -1040,6 +1051,7 @@ private CoordinatorResult<ConsumerGroupHeartbeatResponseData, Record> consumerGr | |||
.maybeUpdateRebalanceTimeoutMs(ofSentinel(rebalanceTimeoutMs)) | |||
.maybeUpdateServerAssignorName(Optional.ofNullable(assignorName)) | |||
.maybeUpdateSubscribedTopicNames(Optional.ofNullable(subscribedTopicNames)) | |||
.maybeUpdateSubscribedTopicRegex(Optional.ofNullable(subscribedTopicRegex)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we miss a few fundamental things in consumerGroupHeartbeat
. Let me explain. The consumerGroupHeartbeat
is structured in 3 parts.
- We update the member and its subscriptions.
- We compute the new target assignment if needed.
- We reconcile the member.
I think that in 1), we need to update the subscription for the member as you do here. However, we also need to verify it before storing it and we also need to update the subscription metadata if the regex was changed. See L1080. In step 2), we also need to change the logic to include the topic matching the regex. See L1115. Step 3) is fine as it is.
We also need a mechanism to periodically refresh the regexes in order to catch new topics or deleted topics. What was your plan for this? I thought that we could piggy back the mechanism to refresh the subscription metadata (L1076) to also refresh the regexes.
I think that we also need to store the resolved regular expressions somehow. I mean a mapping from the regex (as string) to the matching topics because we need this for step 2). For this, I was considering whether we could just use a LRU cache. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dajac, +1 on using LRU cache, it would also save us from unnecessary assignment computations when member rejoining
Pattern pattern = Pattern.compile(regex); | ||
for (String topicName : metadataImage.topics().topicsByName().keySet()) { | ||
if (pattern.matcher(topicName).matches()) { | ||
subscribeGroupToTopic(groupId, topicName); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this for two reasons: (1) Computing the list of topics is quite expensive so doing it here may not be the best place; and (2) It would not catch the changes as it applies it only when the regex is stored. I think that we need to discuss the high level approach. Take a look at my previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question, why is computing the list of topics to subscribe to expensive here? My guess is that metadataImage is something that we need to load from disk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Phuc-Hong-Tran It is definitely expensive wherever it will be :). The major downside of doing it here is that when you reload the group coordinator state from the partition, you may have obsolete records containing regex which are not compacted yet. For those, we would compute the list for nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dajac thanks for the explanation
@Phuc-Hong-Tran Yes, you're right. We need to use RE2J in this ticket. cc @JimmyWang6 |
@dajac |
@dajac Can you invite me to the apache slack as well? The email is phuctran3289@gamil.com, TIA |
@Phuc-Hong-Tran Done. You should receive an invitation when the admin approves my request. |
@dajac I miss-spelled the email, the correct one is phuctran3289@gmail.com |
@dajac |
hi @JimmyWang6, are you still working on this? |
This pull request implements the server-side functionality for regex subscriptions in the next generation of the consumer rebalance protocol.
For the ConsumerGroupHeartbeat API, the server now accepts a regex pattern, and a constraint is enforced that only one of either the regular subscription pattern or topic name subscription is allowed. The subscribedTopicRegex is eventually transformed into subscribedTopicName in the ConsumerGroupMember.
To handle this, an additional argument, TopicsImage, is introduced in the maybeUpdateSubscribedTopicNames() function. This argument is used to filter the topics that match the regex pattern. During replay, when the function is called with a null TopicsImage argument, the regex expression is not considered.
Committer Checklist (excluded from commit message)