-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
Hello, @himanshug
I met a huge performance when I ingested with high cardinality dimensions. The uri_id and host_ip have a cardinality range 2000000000 to 5000000000.
With the above two dimensions, 1 task consumed 2w/s. When I incremented task count to 18, the datasource only consumed 20w/s. Even though I incremented task count to 36, the datasource ingestion throughput always keeps 20w/s.
one of the tasks log:
index_kafka_APP_NETWORK_DATA_MIN_JSON1_f25e9db0956e624_bebmfiod.txt
Then I removed uri_id and host_ip from the dimension list, 18 tasks consumed 37w/s, seem as 2w * 18=36w. I incremented task count to 36, throughput was 60w/s.
one of the tasks log:
index_kafka_APP_NETWORK_DATA_MIN_JSON1_361e588173127a7_peommncl.txt
I'm not understander that why increment task count but the throughput does not improve with high cardinality dimensions.
The num of Kafka topic partition is 36. If I partition data by hash(all dimensions) into topic partitions to reduce the dimensions cardinality which needs to be rolled in the task, does it improve the performance?
Is there any way to resolve this problem? Can you give me some advice?