[client] Avoid per-record AutoPartitionStrategy construction in WriterClient.doSend in log tables by binary-signal · Pull Request #3529 · apache/fluss

binary-signal · 2026-06-25T13:38:07Z

Purpose

Linked issue: close #3527

This PR removes the per-record cost which improves write throughput for fluss log tables.

Brief change log

Two patches:

Skip the dynamic-partition check when the table has no partition keys (
fluss-client/.../WriterClient.java):
```
if (!tableInfo.getPartitionKeys().isEmpty()) {
    dynamicPartitionCreator.checkAndCreatePartitionAsync(
            physicalTablePath,
            tableInfo.getPartitionKeys(),
            tableInfo.getTableConfig().getAutoPartitionStrategy());
}
```
DynamicPartitionCreator.checkAndCreatePartitionAsync already returns
immediately when partitionName == null. This just hoists the same check up
to the caller so the expensive argument never has to be evaluated. No
behaviour change.

Memoise the strategy on TableConfig (
fluss-common/.../config/TableConfig.java):

public class TableConfig {
    private final Configuration config;
    private volatile AutoPartitionStrategy autoPartitionStrategy;

    public AutoPartitionStrategy getAutoPartitionStrategy() {
        AutoPartitionStrategy s = autoPartitionStrategy;
        if (s == null) {
            s = AutoPartitionStrategy.from(config);
            autoPartitionStrategy = s; // benign race: same value
        }
        return s;
    }
    ...
}

TableConfig is immutable after construction. AutoPartitionStrategy is
too with all fields final. A volatile field is enough.

Tests

No new unit tests. The change is observable only as a perf improvement.

Benchmark Performance Improvements

Fluss Flink Sink / Taskmanager

Metric	Before	After	Delta
Writer records/s, avg (sum across 16 subtasks)	2,526,516	2,826,183	+11.9%
Writer records/s, peak	3,342,630	4,421,649	+32.3%
Writer records/s, p95	3,082,199	4,198,872	+36.2%
TaskManager JVM CPU load, avg	51.5%	44.1%	-7.4%
JVM GC time, ms per second, avg	1,433	999	-30.3%
Writer task busy time, ms/s, avg	642.5	449.7	-30.0%
Writer task idle time, ms/s, avg	357.5	550.3	+53.9%
Records per batch, avg	40,971	53,531	+30.7%
Bytes per batch, avg	1,267,516	1,645,327	+29.8%

Per-record CPU cost

TaskManager level: cpu_load / throughput

Before: 0.515 / 2,526,516 = 204 ns of CPU per record
After: 0.441 / 2,826,183 = 156 ns of CPU per record
Down 23.5%

Writer-task level: busy_ms_per_s / throughput

Before: 642.5 / 2,526,516 = 254 ns per record
After: 449.7 / 2,826,183 = 159 ns per record
Down 37.4%

The 37% drop at the writer-task level matches before flamegraph attributed
roughly half the writer CPU to
TimeZone.getTimeZone plus the AutoPartitionStrategy.from construction around
it. Removing both shaves about that much off the per-record cost. The remaining
work (Arrow encoding, ZSTD, serialisation) didn't change.

What got faster

Throughput is up at every percentile. Peak and p95 moved more than the mean,
which fits the picture of the writer no longer being the bottleneck.

CPU is down even though throughput is up. TM doing more records with less
work.

TM GC time dropped. The fix removes the per-record allocation of
AutoPartitionStrategy and the throwaway Configuration view, which is the
kind of short-lived garbage that drives young-gen churn.

Batches got about 30% bigger by both record count and bytes. With the writer
faster, the batch window naturally accumulates more records before the timeout
fires, so each RPC carries more payload.

Fluss Server-side

Metric	Before	After	Delta
`fluss_tabletserver_table_messagesInPerSecond`, avg	2,526,438	2,816,269	+11.5%
`fluss_tabletserver_table_messagesInPerSecond`, peak	3,346,781	4,464,978	+33.4%
`fluss_tabletserver_table_messagesInPerSecond`, p95	3,077,221	4,198,341	+36.4%
`fluss_tabletserver_table_bytesInPerSecond`, avg	77.9 MB/s	86.6 MB/s	+11.2%
`fluss_tabletserver_table_bytesInPerSecond`, peak	103.2 MB/s	138.1 MB/s	+33.8%
`fluss_tabletserver_table_bytesOutPerSecond`, avg	99.7 MB/s	90.0 MB/s	-9.7% (see note below)
`fluss_tabletserver_table_totalProduceLogRequestsPerSecond`, avg	149.1	186.7	+25.2%
`fluss_tabletserver_table_totalProduceLogRequestsPerSecond`, peak	427.5	373.2	-12.7% (fewer peak RPCs at higher throughput)
`fluss_tabletserver_table_failedProduceLogRequestsPerSecond`	0	0

Flink Task - Fluss Sink Flamegraph Before

Flink Task - Fluss Sink Flamegraph After
No trace of getTimezone

Flink
Note: on left hand side is before and on right hand side is after

Fluss
Note: on left hand side is before and on right hand side is after

API and Format

No API or storage format changes.

Documentation

None needed. No new feature, no user-visible behaviour change.

Building the AutoPartitionStrategy on every record needlessly cost CPU in the hot write path, even for tables without partition keys. Guard the dynamic partition creation call in WriterClient.doSend behind a partition-keys check so the strategy is only resolved for partitioned tables, and cache AutoPartitionStrategy lazily in TableConfig via a volatile field rather than rebuilding it on each access. Signed-off-by: Evan <binary-signal@users.noreply.github.com>

binary-signal · 2026-06-25T13:57:20Z

PTAL @fresh-borzoni && @luoyuxia xD

morazow

Thanks @binary-signal great improvements! 🚀

I have added couple suggestions.

Another question: Do you have benchmarks / flamegraphs for autoPartitionStrategy caching only? E.g, with partitioned table.

- test the strategy is memoized, so repeated calls return the same instancetestAutoPartitionStrategyIsCached - use tableInfo.isPartitioned instead of !tableInfo.getPartitionKeys().isEmpty() - address comments Signed-off-by: Evan <binary-signal@users.noreply.github.com>

binary-signal · 2026-06-25T17:09:02Z

Thanks @binary-signal great improvements! 🚀

I have added couple suggestions.

Another question: Do you have benchmarks / flamegraphs for autoPartitionStrategy caching only? E.g, with partitioned table.

Good question and no, i don't have benchmarks / flamegraphs for autoPartitionStrategy caching.

The benchmark was run on a non-partitioned log table. On that path the new isPartitioned() guard short-circuits before getAutoPartitionStrategy() is ever called, so the memoisation (patch 2) contributes ~~nothing~~ to those results, the entire gain there comes from patch 1 skipping the per-record work.

The caching only matters for partitioned tables. There the guard passes, and since Java evaluates the argument eagerly, getAutoPartitionStrategy() runs on every record (WriterClient.java:182):

private void doSend(WriteRecord record, WriteCallback callback) {
    ...
    ...
    if (tableInfo.isPartitioned()) {
        dynamicPartitionCreator.checkAndCreatePartitionAsync(
                physicalTablePath,
                tableInfo.getPartitionKeys(),
                tableInfo.getTableConfig().getAutoPartitionStrategy()); // arg evaluated per record
    }
    ...

The cache turns that per-record call into a single volatile read (TableConfig.java:165):

public AutoPartitionStrategy getAutoPartitionStrategy() {
    AutoPartitionStrategy s = autoPartitionStrategy; // volatile read on the hot path
    if (s == null) {                                 // construct once
        s = AutoPartitionStrategy.from(config);
        autoPartitionStrategy = s;
    }
    return s;
}

Without it, each record paid a fresh AutoPartitionStrategy.from(config), which includes the TimeZone.getTimeZone() lookup that showed up in the flamegraph (AutoPartitionStrategy.java:57):

public static AutoPartitionStrategy from(Configuration conf) {
    return new AutoPartitionStrategy(
            conf.getBoolean(ConfigOptions.TABLE_AUTO_PARTITION_ENABLED),
            conf.getString(ConfigOptions.TABLE_AUTO_PARTITION_KEY),
            conf.get(ConfigOptions.TABLE_AUTO_PARTITION_TIME_UNIT),
            conf.getInt(ConfigOptions.TABLE_AUTO_PARTITION_NUM_PRECREATE),
            conf.getInt(ConfigOptions.TABLE_AUTO_PARTITION_NUM_RETENTION),
            TimeZone.getTimeZone(conf.getString(ConfigOptions.TABLE_AUTO_PARTITION_TIMEZONE))); // costly lookup
}

So I'd expect a similar flamegraph delta to the non-partitioned case getTimeZone() + AutoPartitionStrategy.from() disappearing from the writer hot path but I haven't measured it yet.

morazow

Thanks for explanation @binary-signal 🙏

Looks good from my side. Let's wait for PMC / contributors to review and merge 🤝

fresh-borzoni

@binary-signal Thank you for the PR, left suggestion, PTAL

instead of memoizing the strategy on TableConfig, pass TableInfo into checkAndCreatePartitionAsync and resolve getAutoPartitionStrategy() inside the create branch. The strategy is only needed when a partition is actually created, so on the common "already exists" path it is never built. This removes the cached field - WriterClient: pass tableInfo instead of pre-computing the strategy - DynamicPartitionCreator: resolve partitionKeys + strategy in the create branch only - TableConfig: drop the cached volatile field and stale javadoc - Remove the caching test added previously Signed-off-by: Evan <binary-signal@users.noreply.github.com>

binary-signal · 2026-07-02T11:05:20Z

@fresh-borzoni PTAL 🙏

fresh-borzoni

@binary-signal LGTM, thank you

morazow suggested changes Jun 25, 2026

View reviewed changes

Comment thread fluss-client/src/main/java/org/apache/fluss/client/write/WriterClient.java Outdated

Comment thread fluss-common/src/main/java/org/apache/fluss/config/TableConfig.java

Comment thread fluss-common/src/main/java/org/apache/fluss/config/TableConfig.java

address comments

be9e5c4

- test the strategy is memoized, so repeated calls return the same instancetestAutoPartitionStrategyIsCached - use tableInfo.isPartitioned instead of !tableInfo.getPartitionKeys().isEmpty() - address comments Signed-off-by: Evan <binary-signal@users.noreply.github.com>

binary-signal requested a review from morazow June 25, 2026 17:09

morazow approved these changes Jun 25, 2026

View reviewed changes

fresh-borzoni reviewed Jun 25, 2026

View reviewed changes

Comment thread fluss-client/src/main/java/org/apache/fluss/client/write/WriterClient.java Outdated

fresh-borzoni approved these changes Jul 3, 2026

View reviewed changes

fresh-borzoni merged commit 38fae34 into apache:main Jul 3, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[client] Avoid per-record AutoPartitionStrategy construction in WriterClient.doSend in log tables#3529

[client] Avoid per-record AutoPartitionStrategy construction in WriterClient.doSend in log tables#3529
fresh-borzoni merged 3 commits into
apache:mainfrom
binary-signal:log-table-perf-auto-partition

binary-signal commented Jun 25, 2026

Uh oh!

binary-signal commented Jun 25, 2026

Uh oh!

morazow left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

binary-signal commented Jun 25, 2026

Uh oh!

morazow left a comment

Uh oh!

fresh-borzoni left a comment

Uh oh!

Uh oh!

binary-signal commented Jul 2, 2026

Uh oh!

fresh-borzoni left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

binary-signal commented Jun 25, 2026

Purpose

Brief change log

Tests

Benchmark Performance Improvements

Fluss Flink Sink / Taskmanager

Per-record CPU cost

What got faster

Fluss Server-side

API and Format

Documentation

Uh oh!

binary-signal commented Jun 25, 2026

Uh oh!

morazow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

binary-signal commented Jun 25, 2026

Uh oh!

morazow left a comment

Choose a reason for hiding this comment

Uh oh!

fresh-borzoni left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

binary-signal commented Jul 2, 2026

Uh oh!

fresh-borzoni left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants