Skip to content

Metric reporting from native to JVM is a hot-path allocator / JNI bottleneck #4072

@andygrove

Description

@andygrove

Describe the problem

Profiling with async-profiler on native execution shows a significant amount of time spent in metric reporting. This happens on the per-batch hot path, so the cost is paid for every batch returned from DataFusion back to the JVM.

Where the cost comes from

update_comet_metric is called from jni_api.rs:

  • Once for every batch returned to the JVM (executePlan fast path)
  • Periodically during ScanExec polling (every ~100 polls, gated by metrics_update_interval)
  • At stream end

For each call, the pipeline does the following:

Rust side (native/core/src/execution/metrics/utils.rs):

  1. to_native_metric_node recursively walks the SparkPlan tree.
  2. Per node: allocates a fresh HashMap<String, i64>, inserts each metric with name.to_string() (a per-metric allocation), and pushes the children into a new Vec.
  3. Serializes the resulting NativeMetricNode protobuf via prost's encode_to_vec(): allocates a Vec<u8>.
  4. byte_array_from_slice allocates a new Java byte[] and copies the protobuf bytes in.
  5. One JNI upcall to set_all_from_bytes([B)V.

JVM side (CometMetricNode.scala):

  1. Metric.NativeMetricNode.parseFrom(bytes): full protobuf tree re-allocation.
  2. Recursive walk pairing the deserialized tree with the local CometMetricNode tree via zip(children).
  3. Per metric: metrics.get(metricName) lookup on the node's Map[String, SQLMetric], then SQLMetric.set(v).

So every update cycle includes: a tree of per-node HashMap allocations, per-metric String allocations, protobuf serialization, a Java byte-array allocation and copy, protobuf deserialization into a fresh tree, and a per-metric hash-map lookup on the JVM side, none of which carry state across cycles even though the tree shape and metric names are stable for the lifetime of the plan.

Impact

Visible in async-profiler flame graphs as a disproportionate share of native-thread CPU during query execution, especially for plans with many nodes and small-to-medium batch sizes (where update frequency is high relative to per-batch work).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions