Data ingestion CPU efficiency improvements

Pinot data ingestion from Kafka is following the 1 thread per Kafka partition mechanism. The scaling up is relying on increasing number of Kafka topic partitions. However, due to the nature of ingestion computation load, Kafka broker usually has a far higher traffic volume limit per partition than Pinot.
For example, with same type of hardware, Kafka could afford traffic over 8MB/s/partition but Pinot if doing complex transformation and index building (e.g. SchemaConformingTransformer & text index) can only afford <2 MB/s/partition. This makes the Kafka partition expansion not able to be always in sync with Pinot's system load.
In reality, we are observing that in a Pinot server with tens of cores, only 20% are busy with ingesting and others relatively idle.

Hence, there's requirement to improve the computation efficiency and do parallel (at least part of) single partition message processing.
![image](https://github.com/apache/pinot/assets/106711887/a6e6390b-ddfc-48f6-97b7-0959dad88bfc)
From the attached pic, there are a few components to be improved:

- gzip compression -> to zstd with proper level
- transformers -> using batch and parallel processing, there are some other OSS projects like [uForwarder](https://github.com/uber/uForwarder/blob/main/uforwarder-core/src/main/java/com/uber/data/kafka/datatransfer/worker/fetchers/kafka/AbstractKafkaFetcherThread.java#L237) doing the batch message processing
- indexing -> TBD
- Kafka polling -> batch polling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data ingestion CPU efficiency improvements #13319

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Data ingestion CPU efficiency improvements #13319

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions