Facilitate asynchronous realtime ingestion on decoding & transformation#13695
Facilitate asynchronous realtime ingestion on decoding & transformation#13695lnbest0707 wants to merge 2 commits intoapache:masterfrom
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #13695 +/- ##
============================================
+ Coverage 61.75% 61.99% +0.24%
+ Complexity 207 198 -9
============================================
Files 2436 2555 +119
Lines 133233 140750 +7517
Branches 20636 21891 +1255
============================================
+ Hits 82274 87257 +4983
- Misses 44911 46850 +1939
- Partials 6048 6643 +595
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Jackie-Jiang
left a comment
There was a problem hiding this comment.
How do you guarantee the message order when message batches are processed asynchronously?
| private AtomicInteger _numRowsConsumed = new AtomicInteger(0); | ||
| // Can be different from _numRowsConsumed when metrics update is enabled. | ||
| private AtomicInteger _numRowsIndexed = new AtomicInteger(0); | ||
| private AtomicInteger _numRowsErrored = new AtomicInteger(0); |
| BlockingQueue<Pair<List<GenericRow>, Integer>> transformedQueue = new LinkedBlockingQueue<>(); | ||
| AtomicInteger submittedMsgCount = new AtomicInteger(0); | ||
| // TODO: tune the number of threads | ||
| ExecutorService decodeAndTransformExecutor = Executors.newFixedThreadPool(1); |
There was a problem hiding this comment.
Starting a new executor per message batch can create big overhead. Consider creating an executor to be shared for different batches
| } | ||
| }); | ||
|
|
||
| indexingThread.start(); |
There was a problem hiding this comment.
Who is executing this thread?
There was a problem hiding this comment.
Consumer thread for the partition is the one kicking off this "indexingThread". I don't understand why we kick off a separate thread, and then in the next line, we wait for it to finish. What's the difference if we don't spin off a new thread, and use the main thread (consuming thread) to do the indexing?
|
cc: @sajjad-moradi |
|
Is is useful to create a sub-class of RealtimeSegmentDataManager that consumes asynchronously? |
|
To ensure the ingestion order, we might be able to use a producer-consumer pattern, where consumer thread creates |
|
How would ingestion rate limiting work? |
Thanks for the review. Right now, it is only using 1 thread to guarantee the order. For multiple threads, I am thinking about adapting the way mentioned in Multi-topic ingestion support if it goes through the review. |
|
|
||
| BlockingQueue<Pair<List<GenericRow>, Integer>> transformedQueue = new LinkedBlockingQueue<>(); | ||
| AtomicInteger submittedMsgCount = new AtomicInteger(0); | ||
| // TODO: tune the number of threads |
There was a problem hiding this comment.
We can't have more than one thread here otherwise the order of indexed rows will be different, and that's something we can't tolerate.
| } | ||
| }); | ||
|
|
||
| indexingThread.start(); |
There was a problem hiding this comment.
Consumer thread for the partition is the one kicking off this "indexingThread". I don't understand why we kick off a separate thread, and then in the next line, we wait for it to finish. What's the difference if we don't spin off a new thread, and use the main thread (consuming thread) to do the indexing?
ingestionenhancementfeature requestResolving issues mentioned in #13319.
Pinot ingestion is currently using a strictly serial processing. It fetches a batch of messages from Kafka and then for messages in the batch, it would process one by one with the order of offsets to:
It provides benefits to reuse the objects created in between to achieve better memory efficiency but not able to utilize all system resources. There are multiple solutions with their pros and cons:
Comparing the CPU usage and consumption speed on same server before and after (at 10:00) enabling "ASYNCHRONOUS":


The more CPU usage contributes a ~10% ingestion speed increase.
Notes:
The new mode is better performed on computation heavy ingestion but cannot really help on light weight use cases. In light computation use cases, the extra memory and GC overhead would compensate the async process gain.