Skip to content

Long segment and Host names stored in Zk can cause high heap usage and impact performance #14931

@vrajat

Description

@vrajat

The length of segment & host names stored in zookeeper as part of cluster and data metadata can be very long. The length is dependent on table names and host names in the cluster. A couple of examples:

pinot-controller-controller-0-0.pinot-pinot-controller-headless.cell-bzf7co-managed.svc.cluster.local_9000
nation_dm2_0_output_4341_csv_FileIngestionTask_1732618352252_3715

In a test setup with a table of 200K segments, there are 5 million String objects and take up 247mb of memory.
A couple of stack traces of allocations:

 ↖{j.u.LinkedHashMap}.values
  ↖{j.u.TreeMap}.values
    ↖org.apache.helix.zookeeper.datamodel.ZNRecord.mapFields
      ↖org.apache.helix.model.CurrentState._record

↖{j.u.LinkedHashMap}.values
  ↖{j.u.TreeMap}.values
    ↖org.apache.helix.zookeeper.datamodel.ZNRecord.mapFields
      ↖org.apache.helix.model.ResourceConfig._record
        ↖{j.u.HashMap}.values
          ↖org.apache.helix.common.caches.PropertyCache._objMap

Long names also affect performance. An example with a relatively small table name.

curl -s -S -n -H "Authorization: Bearer $SCALETEST_TOKEN" "https://$CONTROLLER_HOST:$CONTROLLER_PORT/segments/nation_OFFLINE" -o /dev/null -w "%{time_total},%{size_download},%{speed_download}\n" >> stats.log

❯ cat stats.log
8.461959,4698905,555297

Metadata

Metadata

Assignees

No one assigned

    Labels

    ingestionRelated to data ingestion pipelineperformanceRelated to performance optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions