Skip to content

[client] Avoid per-record AutoPartitionStrategy construction in WriterClient.doSend in log tables#3529

Merged
fresh-borzoni merged 3 commits into
apache:mainfrom
binary-signal:log-table-perf-auto-partition
Jul 3, 2026
Merged

[client] Avoid per-record AutoPartitionStrategy construction in WriterClient.doSend in log tables#3529
fresh-borzoni merged 3 commits into
apache:mainfrom
binary-signal:log-table-perf-auto-partition

Conversation

@binary-signal

Copy link
Copy Markdown
Contributor

Purpose

Linked issue: close #3527

This PR removes the per-record cost which improves write throughput for fluss log tables.

Brief change log

Two patches:

  1. Skip the dynamic-partition check when the table has no partition keys (
    fluss-client/.../WriterClient.java):

    if (!tableInfo.getPartitionKeys().isEmpty()) {
        dynamicPartitionCreator.checkAndCreatePartitionAsync(
                physicalTablePath,
                tableInfo.getPartitionKeys(),
                tableInfo.getTableConfig().getAutoPartitionStrategy());
    }

    DynamicPartitionCreator.checkAndCreatePartitionAsync already returns
    immediately when partitionName == null. This just hoists the same check up
    to the caller so the expensive argument never has to be evaluated. No
    behaviour change.

  2. Memoise the strategy on TableConfig (
    fluss-common/.../config/TableConfig.java):

    public class TableConfig {
        private final Configuration config;
        private volatile AutoPartitionStrategy autoPartitionStrategy;
    
        public AutoPartitionStrategy getAutoPartitionStrategy() {
            AutoPartitionStrategy s = autoPartitionStrategy;
            if (s == null) {
                s = AutoPartitionStrategy.from(config);
                autoPartitionStrategy = s; // benign race: same value
            }
            return s;
        }
        ...
    }

    TableConfig is immutable after construction. AutoPartitionStrategy is
    too with all fields final. A volatile field is enough.

Tests

No new unit tests. The change is observable only as a perf improvement.

Benchmark Performance Improvements

Fluss Flink Sink / Taskmanager

Metric Before After Delta
Writer records/s, avg (sum across 16 subtasks) 2,526,516 2,826,183 +11.9%
Writer records/s, peak 3,342,630 4,421,649 +32.3%
Writer records/s, p95 3,082,199 4,198,872 +36.2%
TaskManager JVM CPU load, avg 51.5% 44.1% -7.4%
JVM GC time, ms per second, avg 1,433 999 -30.3%
Writer task busy time, ms/s, avg 642.5 449.7 -30.0%
Writer task idle time, ms/s, avg 357.5 550.3 +53.9%
Records per batch, avg 40,971 53,531 +30.7%
Bytes per batch, avg 1,267,516 1,645,327 +29.8%

Per-record CPU cost

TaskManager level: cpu_load / throughput

  • Before: 0.515 / 2,526,516 = 204 ns of CPU per record
  • After: 0.441 / 2,826,183 = 156 ns of CPU per record
  • Down 23.5%

Writer-task level: busy_ms_per_s / throughput

  • Before: 642.5 / 2,526,516 = 254 ns per record
  • After: 449.7 / 2,826,183 = 159 ns per record
  • Down 37.4%

The 37% drop at the writer-task level matches before flamegraph attributed
roughly half the writer CPU to
TimeZone.getTimeZone plus the AutoPartitionStrategy.from construction around
it. Removing both shaves about that much off the per-record cost. The remaining
work (Arrow encoding, ZSTD, serialisation) didn't change.

What got faster

Throughput is up at every percentile. Peak and p95 moved more than the mean,
which fits the picture of the writer no longer being the bottleneck.

CPU is down even though throughput is up. TM doing more records with less
work.

TM GC time dropped. The fix removes the per-record allocation of
AutoPartitionStrategy and the throwaway Configuration view, which is the
kind of short-lived garbage that drives young-gen churn.

Batches got about 30% bigger by both record count and bytes. With the writer
faster, the batch window naturally accumulates more records before the timeout
fires, so each RPC carries more payload.

Fluss Server-side

Metric Before After Delta
fluss_tabletserver_table_messagesInPerSecond, avg 2,526,438 2,816,269 +11.5%
fluss_tabletserver_table_messagesInPerSecond, peak 3,346,781 4,464,978 +33.4%
fluss_tabletserver_table_messagesInPerSecond, p95 3,077,221 4,198,341 +36.4%
fluss_tabletserver_table_bytesInPerSecond, avg 77.9 MB/s 86.6 MB/s +11.2%
fluss_tabletserver_table_bytesInPerSecond, peak 103.2 MB/s 138.1 MB/s +33.8%
fluss_tabletserver_table_bytesOutPerSecond, avg 99.7 MB/s 90.0 MB/s -9.7% (see note below)
fluss_tabletserver_table_totalProduceLogRequestsPerSecond, avg 149.1 186.7 +25.2%
fluss_tabletserver_table_totalProduceLogRequestsPerSecond, peak 427.5 373.2 -12.7% (fewer peak RPCs at higher throughput)
fluss_tabletserver_table_failedProduceLogRequestsPerSecond 0 0

Flink Task - Fluss Sink Flamegraph Before
Screenshot 2026-06-25 at 14 21 58

Flink Task - Fluss Sink Flamegraph After
No trace of getTimezone
Screenshot 2026-06-24 at 23 24 43

Flink
Note: on left hand side is before and on right hand side is after

Screenshot 2026-06-25 at 16 18 32 Screenshot 2026-06-25 at 16 19 05 Screenshot 2026-06-25 at 16 19 32 Screenshot 2026-06-25 at 16 19 51 Screenshot 2026-06-25 at 16 20 34 Screenshot 2026-06-25 at 16 20 50

Fluss
Note: on left hand side is before and on right hand side is after

Screenshot 2026-06-25 at 16 21 49 Screenshot 2026-06-25 at 16 23 46 Screenshot 2026-06-25 at 16 25 11 Screenshot 2026-06-25 at 16 27 04 Screenshot 2026-06-25 at 16 28 22

API and Format

No API or storage format changes.

Documentation

None needed. No new feature, no user-visible behaviour change.

Building the AutoPartitionStrategy on every record needlessly cost CPU in the hot write path, even for tables without partition keys.

Guard the dynamic partition creation call in WriterClient.doSend behind a partition-keys check so the strategy is only resolved for partitioned tables, and cache AutoPartitionStrategy lazily in TableConfig via a volatile field rather than rebuilding it on each access.

Signed-off-by: Evan <binary-signal@users.noreply.github.com>
@binary-signal

Copy link
Copy Markdown
Contributor Author

PTAL @fresh-borzoni && @luoyuxia xD

@morazow morazow left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @binary-signal great improvements! 🚀

I have added couple suggestions.

Another question: Do you have benchmarks / flamegraphs for autoPartitionStrategy caching only? E.g, with partitioned table.

Comment thread fluss-client/src/main/java/org/apache/fluss/client/write/WriterClient.java Outdated
Comment thread fluss-common/src/main/java/org/apache/fluss/config/TableConfig.java
Comment thread fluss-common/src/main/java/org/apache/fluss/config/TableConfig.java
- test  the strategy is memoized, so repeated calls return the same instancetestAutoPartitionStrategyIsCached

- use tableInfo.isPartitioned instead of !tableInfo.getPartitionKeys().isEmpty()

- address comments

Signed-off-by: Evan <binary-signal@users.noreply.github.com>
@binary-signal

Copy link
Copy Markdown
Contributor Author

Thanks @binary-signal great improvements! 🚀

I have added couple suggestions.

Another question: Do you have benchmarks / flamegraphs for autoPartitionStrategy caching only? E.g, with partitioned table.

Good question and no, i don't have benchmarks / flamegraphs for autoPartitionStrategy caching.

The benchmark was run on a non-partitioned log table. On that path the new isPartitioned() guard short-circuits before getAutoPartitionStrategy() is ever called, so the memoisation (patch 2) contributes nothing to those results, the entire gain there comes from patch 1 skipping the per-record work.

The caching only matters for partitioned tables. There the guard passes, and since Java evaluates the argument eagerly, getAutoPartitionStrategy() runs on every record (WriterClient.java:182):

private void doSend(WriteRecord record, WriteCallback callback) {
    ...
    ...
    if (tableInfo.isPartitioned()) {
        dynamicPartitionCreator.checkAndCreatePartitionAsync(
                physicalTablePath,
                tableInfo.getPartitionKeys(),
                tableInfo.getTableConfig().getAutoPartitionStrategy()); // arg evaluated per record
    }
    ...

The cache turns that per-record call into a single volatile read (TableConfig.java:165):

public AutoPartitionStrategy getAutoPartitionStrategy() {
    AutoPartitionStrategy s = autoPartitionStrategy; // volatile read on the hot path
    if (s == null) {                                 // construct once
        s = AutoPartitionStrategy.from(config);
        autoPartitionStrategy = s;
    }
    return s;
}

Without it, each record paid a fresh AutoPartitionStrategy.from(config), which includes the TimeZone.getTimeZone() lookup that showed up in the flamegraph (AutoPartitionStrategy.java:57):

public static AutoPartitionStrategy from(Configuration conf) {
    return new AutoPartitionStrategy(
            conf.getBoolean(ConfigOptions.TABLE_AUTO_PARTITION_ENABLED),
            conf.getString(ConfigOptions.TABLE_AUTO_PARTITION_KEY),
            conf.get(ConfigOptions.TABLE_AUTO_PARTITION_TIME_UNIT),
            conf.getInt(ConfigOptions.TABLE_AUTO_PARTITION_NUM_PRECREATE),
            conf.getInt(ConfigOptions.TABLE_AUTO_PARTITION_NUM_RETENTION),
            TimeZone.getTimeZone(conf.getString(ConfigOptions.TABLE_AUTO_PARTITION_TIMEZONE))); // costly lookup
}

So I'd expect a similar flamegraph delta to the non-partitioned case getTimeZone() + AutoPartitionStrategy.from() disappearing from the writer hot path but I haven't measured it yet.

@binary-signal binary-signal requested a review from morazow June 25, 2026 17:09

@morazow morazow left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explanation @binary-signal 🙏

Looks good from my side. Let's wait for PMC / contributors to review and merge 🤝

@fresh-borzoni fresh-borzoni left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@binary-signal Thank you for the PR, left suggestion, PTAL

Comment thread fluss-client/src/main/java/org/apache/fluss/client/write/WriterClient.java Outdated
instead of memoizing the strategy on TableConfig, pass TableInfo into
checkAndCreatePartitionAsync and resolve getAutoPartitionStrategy()
inside the create branch. The strategy is only needed when a partition
is actually created, so on the common "already exists" path it is never
built. This removes the cached field

- WriterClient: pass tableInfo instead of pre-computing the strategy
- DynamicPartitionCreator: resolve partitionKeys + strategy in the create branch only
- TableConfig: drop the cached volatile field and stale javadoc
- Remove the caching test added previously

Signed-off-by: Evan <binary-signal@users.noreply.github.com>
@binary-signal

Copy link
Copy Markdown
Contributor Author

@fresh-borzoni PTAL 🙏

@fresh-borzoni fresh-borzoni left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@binary-signal LGTM, thank you

@fresh-borzoni fresh-borzoni merged commit 38fae34 into apache:main Jul 3, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[client] WriterClient.doSend rebuilds AutoPartitionStrategy per record on non partitioned tables

3 participants