Skip to content

Replace IntObjectHashMap with dense array for field related map to reduce heap usage#16201

Open
HUSTERGS wants to merge 1 commit into
apache:mainfrom
HUSTERGS:opt/reduce_field_map
Open

Replace IntObjectHashMap with dense array for field related map to reduce heap usage#16201
HUSTERGS wants to merge 1 commit into
apache:mainfrom
HUSTERGS:opt/reduce_field_map

Conversation

@HUSTERGS
Copy link
Copy Markdown
Contributor

@HUSTERGS HUSTERGS commented Jun 4, 2026

Description

This change adds ReadOnlyDenseIntObjectMap, a compact read-only representation for IntObjectHashMap instances whose keys are non-negative dense integers. It stores values directly in an Object[] indexed by key, avoiding the separate int[] keys table used by IntObjectHashMap.

This is useful for codec metadata maps keyed by FieldInfo.number. These maps are built while reading segment metadata and then only queried afterwards.

The new representation is only selected through maybeWrap(...) when it removes meaningful table slack. By default, wrapping requires at least 30% fewer value slots. If keys are sparse, negative, or values are null, the original IntObjectHashMap is kept.

Motivation

Several codec readers keep per-field metadata maps keyed by FieldInfo.number. After previous changes from field-name keyed maps to IntObjectHashMap, these maps no longer retain field name strings as keys, but they still keep an open-addressed hash table with both an int[] keys table and an Object[] values table. I've seen other PRs try to reduce heap usage of these maps, like #13961 #13327 #13368

This patch is motivated by a huge cluster in production. On one node we can have around 20k open segments, and each segment has 400+ fields. Most of these fields are keyword-like fields, so they are both indexed and have doc values.

For 400+ fields, IntObjectHashMap typically allocates a 1024-slot table, plus the extra slot used for key 0, so both arrays have 1025 entries. If field numbers are dense enough that maxFieldNumber + 1 is around half of the hash table size, the dense read-only representation replaces:

  • int[1025]
  • Object[1025]

with approximately:

  • Object[512]

Assuming compressed object pointers, this is roughly:

  • int[1025]: ~4.0 KB
  • Object[1025]: ~4.0 KB
  • Object[512]: ~2.0 KB

So the saving is about 6 KB per converted map, excluding the referenced values themselves.

In this workload, the main maps affected per segment are:

  • Lucene103BlockTreeTermsReader.fieldMap for indexed fields
  • PerFieldDocValuesFormat.FieldsReader.fields for doc-values fields
  • the populated Lucene90DocValuesProducer metadata map for the dominant doc-values type
    (for example sorted/sorted-set keyword fields)

This gives a rough estimate of about:

3 maps * 6 KB * 20,000 segments ~= 360,000 KB, or around 350 MB of heap reduction on such a
node.

The exact saving depends on field-number density and on how fields are distributed across doc-values types. If field numbers are denser, the saving can be slightly higher; if fields are split across multiple smaller maps or field numbers are sparse, maybeWrap keeps the original IntObjectHashMap.

Signed-off-by: gesong.samuel <gesong.samuel@bytedance.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant