ingesting with high cardinality dimension have low performance

Hello, @himanshug 

I met a huge performance when I ingested with high cardinality dimensions.  The ```uri_id``` and ```host_ip``` have a cardinality range 2000000000 to 5000000000. 

With the above two dimensions, 1 task consumed 2w/s. When I incremented task count to 18, the datasource only consumed 20w/s.  Even though I incremented task count to 36, the datasource ingestion throughput always keeps 20w/s.
one of the tasks log:
[index_kafka_APP_NETWORK_DATA_MIN_JSON1_f25e9db0956e624_bebmfiod.txt](https://github.com/apache/incubator-druid/files/3570313/index_kafka_APP_NETWORK_DATA_MIN_JSON1_f25e9db0956e624_bebmfiod.txt)

Then I removed ```uri_id``` and ```host_ip``` from the dimension list, 18 tasks consumed 37w/s, seem as 2w * 18=36w. I incremented task count to 36, throughput was 60w/s.
one of the tasks log:
[index_kafka_APP_NETWORK_DATA_MIN_JSON1_361e588173127a7_peommncl.txt](https://github.com/apache/incubator-druid/files/3570326/index_kafka_APP_NETWORK_DATA_MIN_JSON1_361e588173127a7_peommncl.txt)

I'm not understander that why increment task count but the throughput does not improve with high cardinality dimensions.

The num of Kafka topic partition is 36. If I partition data by ```hash(all dimensions)``` into topic partitions to reduce the dimensions cardinality which needs to be rolled in the task, does it improve the performance?

Is there any way to resolve this problem?  Can you give me some advice? 








Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingesting with high cardinality dimension have low performance #8456

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ingesting with high cardinality dimension have low performance #8456

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions