IGNITE-28836 DirectMessageWriter: reduce per-field overhead and per-message allocations on the message serialization hot path#13296
Open
anton-vinogradov wants to merge 1 commit into
Conversation
221a9c9 to
5f3d62e
Compare
Contributor
Author
|
Benchmark harness used for the numbers above, provided for reproducibility (not part of this PR). Drop it under JmhDirectMessageWriterBenchmark.java/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.ignite.internal.benchmarks.jmh.direct;
import java.nio.ByteBuffer;
import org.apache.ignite.internal.benchmarks.jmh.JmhAbstractBenchmark;
import org.apache.ignite.internal.benchmarks.jmh.runner.JmhIdeBenchmarkRunner;
import org.apache.ignite.internal.direct.DirectMessageWriter;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.profile.GCProfiler;
import org.openjdk.jmh.runner.options.TimeValue;
/**
* Benchmarks for {@link DirectMessageWriter} performance changes:
* <ul>
* <li>{@link #hotPathPrimitiveFields} measures the per-field write overhead (the {@code curStream} caching that
* avoids re-resolving {@code state.item().stream} on every primitive write). Run it on the baseline build and on
* the patched build and compare throughput.</li>
* <li>{@link #compressedScratchOld} vs {@link #compressedScratchNew} model the per-field scratch allocation done by
* {@code writeCompressedMessage}: a fresh direct buffer + a brand-new writer (old) versus a reused writer + heap
* buffer (new). Run together with {@code -prof gc} to see the allocation-rate / direct-memory difference.</li>
* </ul>
*/
@State(Scope.Thread)
public class JmhDirectMessageWriterBenchmark extends JmhAbstractBenchmark {
/** Network buffer the hot-path benchmark serializes into (reused, like the real comms write loop). */
private static final int NET_BUF = 64 * 1024;
/** Scratch buffer capacity, matches {@code DirectMessageWriter.TMP_BUF_CAPACITY}. */
private static final int TMP_BUF_CAPACITY = 10 * 1024;
/** Writer under test for the hot path. */
private DirectMessageWriter writer;
/** Reused network buffer. */
private ByteBuffer buf;
/** Reusable writer modeling the patched compressed path. */
private DirectMessageWriter reusable;
/** */
@Setup
public void setup() {
// The message factory is only used by writeMessage(); primitive writes and scratch allocation never touch it.
writer = new DirectMessageWriter(null);
buf = ByteBuffer.allocate(NET_BUF);
reusable = new DirectMessageWriter(null);
}
/**
* Hot path: serialize a batch of mixed primitive fields into a reused buffer, mirroring how a generated serializer
* calls the writer field by field. Stresses the {@code state.item().stream} resolution done on every call.
*
* @return Accumulated {@code lastFinished} flag (consumed to prevent dead-code elimination).
*/
@Benchmark
public boolean hotPathPrimitiveFields() {
ByteBuffer b = buf;
b.clear();
writer.setBuffer(b);
boolean ok = true;
// 256 "messages" of 7 mixed fields = 1792 writer calls, ~7 KB, fits the 64 KB buffer.
for (int i = 0; i < 256; i++) {
ok &= writer.writeByte((byte)i);
ok &= writer.writeShort((short)i);
ok &= writer.writeInt(i);
ok &= writer.writeLong(i);
ok &= writer.writeBoolean((i & 1) == 0);
ok &= writer.writeFloat(i);
ok &= writer.writeDouble(i);
}
return ok;
}
/**
* Old compressed-path scratch allocation: fresh direct buffer + brand-new writer per compressed field.
*
* @return Allocated writer (returned to prevent dead-code elimination).
*/
@Benchmark
public Object compressedScratchOld() {
ByteBuffer tmp = ByteBuffer.allocateDirect(TMP_BUF_CAPACITY);
DirectMessageWriter w = new DirectMessageWriter(null);
w.setBuffer(tmp);
return w;
}
/**
* New compressed-path scratch allocation: reused writer + heap buffer per compressed field.
*
* @return Reused writer (returned to prevent dead-code elimination).
*/
@Benchmark
public Object compressedScratchNew() {
ByteBuffer tmp = ByteBuffer.allocate(TMP_BUF_CAPACITY);
DirectMessageWriter w = reusable;
w.reset();
w.setBuffer(tmp);
return w;
}
/**
* @param args Arguments.
* @throws Exception If failed.
*/
public static void main(String[] args) throws Exception {
JmhIdeBenchmarkRunner.create()
.forks(1)
.threads(1)
.warmupIterations(6)
.warmupTime(TimeValue.milliseconds(500))
.measurementIterations(10)
.measurementTime(TimeValue.milliseconds(500))
.benchmarks(JmhDirectMessageWriterBenchmark.class.getSimpleName())
.jvmArguments("-Xms2g", "-Xmx2g")
.profilers(GCProfiler.class)
.run();
}
} |
| ByteBuffer tmpBuf = ByteBuffer.allocateDirect(TMP_BUF_CAPACITY); | ||
| // Scratch buffer is only copied into a heap array by CompressedMessage before deflating, so a heap | ||
| // buffer is strictly cheaper than a direct one (no native alloc / Cleaner churn) on this per-field path. | ||
| ByteBuffer tmpBuf = ByteBuffer.allocate(TMP_BUF_CAPACITY); |
Contributor
There was a problem hiding this comment.
Now, when we have heap buffer, maybe it worth to use buf.array() in CompressedMessage.compress() instead of array copy?
5f3d62e to
ed75913
Compare
…essage allocations on the message serialization hot path Cache the current state item's stream (curStream) instead of re-resolving state.item().stream on every primitive write. In writeCompressedMessage() replace the per-field direct ByteBuffer with a reusable heap scratch buffer (CompressedMessage consumes it in the constructor, so it never escapes the method), reuse a thread-confined tmpWriter across fields, and grow the scratch buffer without an intermediate byte[] copy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ed75913 to
068634c
Compare
Contributor
Author
|
JMH benchmark used for the numbers in the PR description (kept out of the patch itself). Drop it into /*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.ignite.internal.benchmarks.jmh.direct;
import java.nio.ByteBuffer;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import java.util.UUID;
import org.apache.ignite.internal.CoreMessagesProvider;
import org.apache.ignite.internal.benchmarks.jmh.runner.JmhIdeBenchmarkRunner;
import org.apache.ignite.internal.direct.DirectMessageWriter;
import org.apache.ignite.internal.managers.communication.IgniteMessageFactoryImpl;
import org.apache.ignite.internal.processors.affinity.AffinityTopologyVersion;
import org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsFullMessage;
import org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GroupPartitionIdPair;
import org.apache.ignite.internal.util.typedef.internal.U;
import org.apache.ignite.plugin.extensions.communication.MessageFactory;
import org.apache.ignite.plugin.extensions.communication.MessageFactoryProvider;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;
import static java.util.concurrent.TimeUnit.SECONDS;
import static org.apache.ignite.marshaller.Marshallers.jdk;
import static org.openjdk.jmh.annotations.Mode.Throughput;
import static org.openjdk.jmh.annotations.Scope.Thread;
/** Benchmarks the {@link DirectMessageWriter} hot path: per-field primitive writes and compressed map fields. */
@State(Thread)
@BenchmarkMode(Throughput)
@Warmup(iterations = 5, time = 3, timeUnit = SECONDS)
@Measurement(iterations = 5, time = 3, timeUnit = SECONDS)
@Fork(1)
public class JmhDirectMessageWriterBenchmark {
/** */
private DirectMessageWriter writer;
/** */
private ByteBuffer buf;
/** */
private UUID uuid;
/** */
public static void main(String[] args) throws Exception {
JmhIdeBenchmarkRunner.create()
.benchmarks(JmhDirectMessageWriterBenchmark.class.getName())
.run();
}
/** */
@Setup
public void setup() {
writer = new DirectMessageWriter(msgFactory());
buf = ByteBuffer.allocate(4 * 1024 * 1024);
uuid = UUID.randomUUID();
}
/** Mimics a generated serializer writing a message of mixed primitive fields. */
@Benchmark
public boolean primitiveFields() {
buf.clear();
writer.setBuffer(buf);
boolean res = writer.writeHeader((short)42);
for (int i = 0; i < 8; i++) {
res &= writer.writeInt(i);
res &= writer.writeLong(i);
res &= writer.writeByte((byte)i);
res &= writer.writeBoolean(true);
}
res &= writer.writeUuid(uuid);
writer.reset();
return res;
}
/** Exchange-style message with two compressed map fields. */
@Benchmark
public boolean compressedMapFields(CompressedPayload payload) {
buf.clear();
writer.setBuffer(buf);
boolean res = writer.writeMessage(payload.msg, true);
writer.reset();
return res;
}
/** */
private static MessageFactory msgFactory() {
return new IgniteMessageFactoryImpl(new MessageFactoryProvider[]{
new CoreMessagesProvider(jdk(), jdk(), U.gridClassLoader())});
}
/** Payload sized below ({@code 30}) and above ({@code 500}) the initial 10K scratch buffer. */
@State(Thread)
public static class CompressedPayload {
/** */
@Param({"30", "500"})
private int entries;
/** */
private GridDhtPartitionsFullMessage msg;
/** */
@Setup
public void setup() {
Map<UUID, Map<GroupPartitionIdPair, Long>> partHistSuppliers = new HashMap<>();
Map<UUID, Map<Integer, Set<Integer>>> partsToReload = new HashMap<>();
for (int i = 0; i < entries; i++) {
UUID nodeId = UUID.randomUUID();
partHistSuppliers.put(nodeId, Map.of(new GroupPartitionIdPair(i, i + 1), i + 2L));
partsToReload.put(nodeId, Map.of(i, Set.of(i + 1)));
}
msg = new GridDhtPartitionsFullMessage(null, null, new AffinityTopologyVersion(0), partHistSuppliers, partsToReload);
}
}
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
IGNITE-28836
https://issues.apache.org/jira/browse/IGNITE-28836
DirectMessageWriteris on the critical path of every outgoing message. This PR removes two inefficiencies:writeXxxre-resolvedstate.item().stream(array load + bounds check + field load) on every primitive write. The current stream is now cached incurStreamand refreshed only when the current item changes (setBuffer/beforeNestedWrite/afterNestedWrite).writeCompressedMessage()allocated a freshByteBuffer.allocateDirect(10KB)per compressed field (with a doubling re-allocation chain for payloads above 10KB) plus a brand-newDirectMessageWriter.CompressedMessageconsumes the scratch buffer right in its constructor (deflates into its ownbyte[]), so the buffer never escapeswriteCompressedMessage()and none of that churn is necessary. The writer now keeps one reusable heap scratch buffer (retained at the largest size seen) and a thread-confinedtmpWriter(reset()per use), and grows the buffer without an intermediatebyte[]copy.No wire-format change, no public API change.
Benchmark
JmhDirectMessageWriterBenchmark(source attached in a comment below):compressedMapFieldswrites an exchange-style message with two compressed map fields, sized below (30 entries) and above (500 entries) the initial 10KB scratch;primitiveFieldswrites a ~35-field primitive message. JDK 17,-prof gc, master vs patched:Note on the allocation columns: master's direct-buffer churn is invisible to
gc.alloc.rate.norm, but it shows up as GC time — Cleaner processing makes master's collections an order of magnitude more expensive at the same collection counts.Testing
DirectMarshallingMessagesTest— nested containers written through 16-byte chunks (multi-pass resume across buffer swaps, exercises thecurStreamrefresh points).CompressedMessageTest— >40KB compressed payload (exercises the scratch-buffer growth path), byte-for-byte writer→reader round-trip.🤖 Generated with Claude Code