Skip to content

ingesting with high cardinality dimension have low performance #8456

@quenlang

Description

@quenlang

Hello, @himanshug

I met a huge performance when I ingested with high cardinality dimensions. The uri_id and host_ip have a cardinality range 2000000000 to 5000000000.

With the above two dimensions, 1 task consumed 2w/s. When I incremented task count to 18, the datasource only consumed 20w/s. Even though I incremented task count to 36, the datasource ingestion throughput always keeps 20w/s.
one of the tasks log:
index_kafka_APP_NETWORK_DATA_MIN_JSON1_f25e9db0956e624_bebmfiod.txt

Then I removed uri_id and host_ip from the dimension list, 18 tasks consumed 37w/s, seem as 2w * 18=36w. I incremented task count to 36, throughput was 60w/s.
one of the tasks log:
index_kafka_APP_NETWORK_DATA_MIN_JSON1_361e588173127a7_peommncl.txt

I'm not understander that why increment task count but the throughput does not improve with high cardinality dimensions.

The num of Kafka topic partition is 36. If I partition data by hash(all dimensions) into topic partitions to reduce the dimensions cardinality which needs to be rolled in the task, does it improve the performance?

Is there any way to resolve this problem? Can you give me some advice?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions