KAFKA-15034: Improve performance of the ReplaceField SMT; add JMH benchmark #13776

yashmayya · 2023-05-29T14:38:35Z

https://issues.apache.org/jira/browse/KAFKA-15034
The ReplaceField SMT can be configured with a list of fields that are to be included or excluded during every record transformation.
Currently, it uses an ArrayList for these fields which causes the filter operations to be of O(N) complexity resulting in poor performance when configured with a large number of include / exclude fields.
This patch refactors it to use a HashSet instead (O(1) expected time complexity) and adds a JMH benchmark to demonstrate the performance improvements.

JMH Benchmark Result Before (`ArrayList` based implementation):

Benchmark                                                  (includeExcludeFieldCount)  (valueFieldCount)  Mode  Cnt          Score         Error  Units
ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                           1                100  avgt    5        928.115 ±      20.251  ns/op
ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                           1               1000  avgt    5      10380.401 ±     286.643  ns/op
ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                           1              10000  avgt    5      97058.104 ±    3834.409  ns/op
ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                         100                100  avgt    5      15052.629 ±     112.337  ns/op
ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                         100               1000  avgt    5     301212.390 ±    6400.402  ns/op
ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                         100              10000  avgt    5    2218226.090 ±   38198.098  ns/op
ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                       10000                100  avgt    5     582789.404 ±   11436.565  ns/op
ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                       10000               1000  avgt    5    6263588.619 ±  530370.435  ns/op
ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                       10000              10000  avgt    5  133424024.627 ± 9497024.791  ns/op

JMH Benchmark Result After (`HashSet` based implementation):

Benchmark                                                  (includeExcludeFieldCount)  (valueFieldCount)  Mode  Cnt       Score      Error  Units
ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                           1                100  avgt    5    1205.928 ±   32.611  ns/op
ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                           1               1000  avgt    5   10124.067 ±  212.876  ns/op
ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                           1              10000  avgt    5  105143.540 ± 1813.534  ns/op
ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                         100                100  avgt    5    1602.392 ±   15.756  ns/op
ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                         100               1000  avgt    5   11543.659 ±  193.129  ns/op
ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                         100              10000  avgt    5  171689.002 ± 5691.014  ns/op
ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                       10000                100  avgt    5    1686.155 ±   21.922  ns/op
ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                       10000               1000  avgt    5   20584.457 ±  429.614  ns/op
ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                       10000              10000  avgt    5  221015.401 ± 8108.798  ns/op

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…with a large number of include / exclude fields; add JMH benchmark

C0urante

Thanks Yash. I like the small-change, large-gain performance improvement that's made possible here, but I'm curious about potential fallout under certain scenarios. LMKWYT

jmh-benchmarks/src/main/java/org/apache/kafka/jmh/connect/ReplaceFieldBenchmark.java

connect/transforms/src/main/java/org/apache/kafka/connect/transforms/ReplaceField.java

C0urante · 2023-05-30T15:54:17Z

jmh-benchmarks/src/main/java/org/apache/kafka/jmh/connect/ReplaceFieldBenchmark.java

+        replaceFieldConfigs.put("exclude",
+                IntStream.range(0, fieldCount).filter(x -> (x & 1) == 0).mapToObj(x -> "Field-" + x).collect(Collectors.joining(",")));
+        replaceFieldConfigs.put("include",
+                IntStream.range(0, fieldCount).filter(x -> (x & 1) == 1).mapToObj(x -> "Field-" + x).collect(Collectors.joining(",")));


We may want to add a separate parameter for the number of included/excluded fields (can be a single parameter to control both, or a separate parameter for each) in order to cover the case of a value with a large number of fields and a small number of included/excluded fields.

Given the above observations, do you feel like this is still required?

I feel they are going to perform the same. The time complexity of String#hashCode is O(n) and same is the case for equals method. No harm in trying though :)

Object lookup in a hash set usually involves both computing its hash and performing an equality check, since multiple objects may occupy the same bucket. Lookup in a single-element list may theoretically be faster if it only involves a single equality check.

I think it's worth including in the benchmark for a few reasons:

Saves people the trouble of having to look up this PR discussion

Covers a more-common case (it's much more likely that someone configures this SMT with 1 field than 10,000)

Guards against performance regressions if we change things in the future

But if it's too much work then we can merge as-is. @yashmayya let me know what your decision is.

Makes sense, I've added a separate parameter controlling the number of include and exclude fields since it was a pretty minor change. These are the new results:

JMH Benchmark Result Before (ArrayList based implementation):

Benchmark (includeExcludeFieldCount) (valueFieldCount) Mode Cnt Score Error Units ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark 1 100 avgt 5 928.115 ± 20.251 ns/op ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark 1 1000 avgt 5 10380.401 ± 286.643 ns/op ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark 1 10000 avgt 5 97058.104 ± 3834.409 ns/op ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark 100 100 avgt 5 15052.629 ± 112.337 ns/op ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark 100 1000 avgt 5 301212.390 ± 6400.402 ns/op ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark 100 10000 avgt 5 2218226.090 ± 38198.098 ns/op ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark 10000 100 avgt 5 582789.404 ± 11436.565 ns/op ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark 10000 1000 avgt 5 6263588.619 ± 530370.435 ns/op ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark 10000 10000 avgt 5 133424024.627 ± 9497024.791 ns/op

JMH Benchmark Result After (HashSet based implementation):

Benchmark (includeExcludeFieldCount) (valueFieldCount) Mode Cnt Score Error Units ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark 1 100 avgt 5 1205.928 ± 32.611 ns/op ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark 1 1000 avgt 5 10124.067 ± 212.876 ns/op ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark 1 10000 avgt 5 105143.540 ± 1813.534 ns/op ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark 100 100 avgt 5 1602.392 ± 15.756 ns/op ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark 100 1000 avgt 5 11543.659 ± 193.129 ns/op ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark 100 10000 avgt 5 171689.002 ± 5691.014 ns/op ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark 10000 100 avgt 5 1686.155 ± 21.922 ns/op ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark 10000 1000 avgt 5 20584.457 ± 429.614 ns/op ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark 10000 10000 avgt 5 221015.401 ± 8108.798 ns/op

Awesome, thanks!

yashmayya

Thanks for taking a look Chris!

connect/transforms/src/main/java/org/apache/kafka/connect/transforms/ReplaceField.java

yashmayya · 2023-05-30T17:51:50Z

jmh-benchmarks/src/main/java/org/apache/kafka/jmh/connect/ReplaceFieldBenchmark.java

+        replaceFieldConfigs.put("exclude",
+                IntStream.range(0, fieldCount).filter(x -> (x & 1) == 0).mapToObj(x -> "Field-" + x).collect(Collectors.joining(",")));
+        replaceFieldConfigs.put("include",
+                IntStream.range(0, fieldCount).filter(x -> (x & 1) == 1).mapToObj(x -> "Field-" + x).collect(Collectors.joining(",")));


Given the above observations, do you feel like this is still required?

…ields in the ReplaceField JMH benchmark

C0urante

LGTM!

yashmayya added connect performance labels May 29, 2023

yashmayya changed the title ~~KAFKA-15034: Use HashSet for include / exclude fields in ReplaceField SMT; add JMH benchmark~~ KAFKA-15034: Improve performance of ReplaceField SMT when configured with lots of include / exclude fields; add JMH benchmark May 29, 2023

yashmayya requested review from C0urante, mimaison and viktorsomogyi May 29, 2023 14:40

yashmayya force-pushed the KAFKA-15034 branch from 7795088 to 7b01ec8 Compare May 29, 2023 15:30

KAFKA-15034: Improve performance of ReplaceField SMT when configured …

1eaacd0

…with a large number of include / exclude fields; add JMH benchmark

yashmayya force-pushed the KAFKA-15034 branch from 7b01ec8 to 1eaacd0 Compare May 30, 2023 05:41

yashmayya changed the title ~~KAFKA-15034: Improve performance of ReplaceField SMT when configured with lots of include / exclude fields; add JMH benchmark~~ KAFKA-15034: Improve performance of the ReplaceField SMT; add JMH benchmark May 30, 2023

C0urante reviewed May 30, 2023

View reviewed changes

Explicitly set default setup level

4419350

yashmayya commented May 30, 2023

View reviewed changes

Introduce a new parameter controlling the number of include/exclude f…

9a1dee7

…ields in the ReplaceField JMH benchmark

C0urante approved these changes Jun 1, 2023

View reviewed changes

C0urante merged commit 9bb2f78 into apache:trunk Jun 1, 2023

mimaison mentioned this pull request Dec 19, 2023

KAFKA-15996: Improve JsonConverter performance #14992

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-15034: Improve performance of the ReplaceField SMT; add JMH benchmark #13776

KAFKA-15034: Improve performance of the ReplaceField SMT; add JMH benchmark #13776

yashmayya commented May 29, 2023 •

edited

Loading

C0urante left a comment

C0urante May 30, 2023

yashmayya May 30, 2023

vamossagar12 May 31, 2023

C0urante May 31, 2023

yashmayya Jun 1, 2023

C0urante Jun 1, 2023

yashmayya left a comment

yashmayya May 30, 2023

C0urante left a comment

KAFKA-15034: Improve performance of the ReplaceField SMT; add JMH benchmark #13776

KAFKA-15034: Improve performance of the ReplaceField SMT; add JMH benchmark #13776

Conversation

yashmayya commented May 29, 2023 • edited Loading

JMH Benchmark Result Before (ArrayList based implementation):

JMH Benchmark Result After (HashSet based implementation):

Committer Checklist (excluded from commit message)

C0urante left a comment

Choose a reason for hiding this comment

C0urante May 30, 2023

Choose a reason for hiding this comment

yashmayya May 30, 2023

Choose a reason for hiding this comment

vamossagar12 May 31, 2023

Choose a reason for hiding this comment

C0urante May 31, 2023

Choose a reason for hiding this comment

yashmayya Jun 1, 2023

Choose a reason for hiding this comment

JMH Benchmark Result Before (ArrayList based implementation):

JMH Benchmark Result After (HashSet based implementation):

C0urante Jun 1, 2023

Choose a reason for hiding this comment

yashmayya left a comment

Choose a reason for hiding this comment

yashmayya May 30, 2023

Choose a reason for hiding this comment

C0urante left a comment

Choose a reason for hiding this comment

yashmayya commented May 29, 2023 •

edited

Loading

JMH Benchmark Result Before (`ArrayList` based implementation):

JMH Benchmark Result After (`HashSet` based implementation):

JMH Benchmark Result Before (`ArrayList` based implementation):

JMH Benchmark Result After (`HashSet` based implementation):