Add InputStats to track bytes processed by a task by kfaraz · Pull Request #13520 · apache/druid

kfaraz · 2022-12-07T10:20:58Z

This is based on #12750 and #10407

Description

Track bytes processed by a task and expose in task reports along with row stats
Supported for classic batch and streaming ingestion
Not supported for MSQ ingestion in this PR
Not supported for FirehoseToInputSourceReaderAdaptor
processedBytes measure the uncompressed input bytes. e.g. in case of the sample "wikipedia" datasource, it measures the total size of the unzipped wikipedia.json file (and not the wikipedia.json.gz)
Records that are counted as unparseable, thrownAway or processedWithErrors also count towards processedBytes

Implementation

Add class InputStats to track processed bytes
Add method InputSourceReader.read(InputStats) to read input rows while counting bytes.

Since we need to count the bytes, we could not just have a wrapper around InputSourceReader or InputEntityReader (the way CountableInputSourceReader does) because the InputSourceReader only deals with InputRows and the byte information is already lost.

Classic batch: Use the new InputSourceReader.read(inputStats) in AbstractBatchIndexTask
Streaming: Increment processedBytes in StreamChunkParser. This does not use the new InputSourceReader.read(inputStats) method.
Extend InputStats with RowIngestionMeters so that bytes can be exposed in task reports

Tests and refactors

Update tests to verify the value of processedBytes
- HdfsInputSourceTest
- S3InputSourceTest
- GCSInputSourceTest
- OssInputSourceTest
- SqlInputSourceTest
- DruidSegmentReaderTest
- KinesisIndexTaskTest
- KafkaIndexTaskTest
Rename MutableRowIngestionMeters to SimpleRowIngestionMeters and remove duplicate class
Replace CacheTestSegmentCacheManager with NoopSegmentCacheManager
Refactor KafkaIndexTaskTest and KinesisIndexTaskTest

Pending items

docs

Further work

Support for MSQ ingestion

Release note

Track bytes processed by a task and publish them in the task report along with the row stats.

Sample row stats in a task report:

"rowStats": {
    "buildSegments": {
        "processed": 24433,
        "processedBytes": 11956576,
        "processedWithError": 0,
        "thrownAway": 0,
        "unparseable": 0
    }
}

Screenshots

Batch ingestion

Streaming ingestion

This PR has:

…rser

abhishekagarwal87 · 2022-12-08T07:39:29Z

core/src/main/java/org/apache/druid/data/input/InputStats.java

+  default void incrementProcessedBytes(long incrementByValue)
+  {
+
+  }
+
+  default long getProcessedBytes()
+  {
+    return 0;
+  }


why are default methods needed here?

RowIngestionMeters is marked as an extension point, which is where this is used. I can move the default impls there.

I am not sure if anyone uses their own impl of RowIngestionMeters. I would have preferred not having these default impls altogether.

Moved to RowIngestionMeters for now.

kfaraz · 2022-12-09T07:35:24Z

core/src/main/java/org/apache/druid/data/input/InputSourceReader.java

 public interface InputSourceReader
 {
-  CloseableIterator<InputRow> read() throws IOException;
+  default CloseableIterator<InputRow> read() throws IOException


This method can now be removed as it is now only used in tests.

kfaraz · 2022-12-12T04:08:18Z

core/src/main/java/org/apache/druid/data/input/impl/FirehoseToInputSourceReaderAdaptor.java


  @Override
-  public CloseableIterator<InputRow> read() throws IOException
+  public CloseableIterator<InputRow> read(InputStats inputStats) throws IOException


Maybe throw UnsupportedOpException here?

…ytes

AmatyaAvadhanula

Thank you @kfaraz! LGTM, +1 after builds pass

kfaraz · 2022-12-13T13:20:48Z

Thanks a lot for the review, @AmatyaAvadhanula !
Thanks for doing the initial work on this, @somu-imply , @pjain1 !

Follow up to #13520 Bytes processed are currently tracked for intermediate stages in MSQ ingestion. This patch adds the capability to track the bytes processed by an MSQ controller task while reading from an external input source or a segment source. Changes: - Track `processedBytes` for every `InputSource` read in `ExternalInputSliceReader` - Update `ChannelCounters` with the above obtained `processedBytes` when incrementing the input file count. - Update task report structure in docs The total input processed bytes can be obtained by summing the `processedBytes` as follows: totalBytes = 0 for every root stage (i.e. a stage which does not have another stage as an input): for every worker in that stage: for every input channel: (i.e. channels with prefix "input", e.g. "input0", "input1", etc.) totalBytes += processedBytes

Add InputStats to track bytes processed by a task

10614e8

kfaraz added WIP Area - Ingestion labels Dec 7, 2022

kfaraz added 2 commits December 7, 2022 20:53

Track processedBytes in ParallelIndexSupervisorTask and StreamChunkPa…

d2a7cf8

…rser

Fix for DruidSegmentReader and SqlReader

a228b2c

abhishekagarwal87 reviewed Dec 8, 2022

View reviewed changes

kfaraz added 2 commits December 9, 2022 11:45

Fix tests, fix file size for Druid re-index

2a27e7a

Measure processedBytes in tests

41a38a7

kfaraz commented Dec 9, 2022

View reviewed changes

kfaraz added 5 commits December 9, 2022 13:13

Remove unnecessary usage of library

e9a153c

Fix compilation error in test

e4fe295

Remove forbidden api usage in test

b54a4a0

Fix and simplify tests

5c6b4c1

Fix more test failures

3d3f77a

kfaraz mentioned this pull request Dec 10, 2022

[Refactor] Remove duplicate test class SimpleRowIngestionMeters #13549

Closed

3 tasks

Rename MutableRowIngestionMeters

ad21d61

kfaraz requested a review from AmatyaAvadhanula December 10, 2022 14:34

kfaraz added 2 commits December 11, 2022 09:56

Add tests for BytesCountingInputEntity

c1cae59

Fix for sql compat tests

e5e8986

kfaraz commented Dec 12, 2022

View reviewed changes

kfaraz removed the WIP label Dec 12, 2022

kfaraz added 4 commits December 12, 2022 14:25

Verify byte values in KafkaIndexTaskTest

5166517

Refactor KinesisIndexTaskTest

05b8aab

Merge branch 'master' of github.com:apache/druid into add_processed_b…

d1f52b1

…ytes

Verify bytes in KinesisIndexTaskTest

8c1748c

AmatyaAvadhanula approved these changes Dec 13, 2022

View reviewed changes

kfaraz merged commit 58a3acc into apache:master Dec 13, 2022

kfaraz mentioned this pull request Dec 13, 2022

Track input processedBytes with MSQ ingestion #13559

Merged

10 tasks

kfaraz mentioned this pull request Jan 14, 2023

Throughput by volume from #10407 #12750

Closed

9 tasks

clintropolis added this to the 26.0 milestone Apr 10, 2023

kfaraz mentioned this pull request May 5, 2023

emit processed bytes metric #10407

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add InputStats to track bytes processed by a task#13520

Add InputStats to track bytes processed by a task#13520
kfaraz merged 17 commits intoapache:masterfrom
kfaraz:add_processed_bytes

kfaraz commented Dec 7, 2022 •

edited

Loading

Uh oh!

abhishekagarwal87 Dec 8, 2022

Uh oh!

kfaraz Dec 8, 2022

Uh oh!

kfaraz Dec 8, 2022

Uh oh!

kfaraz Dec 9, 2022

Uh oh!

kfaraz Dec 9, 2022

Uh oh!

kfaraz Dec 12, 2022

Uh oh!

AmatyaAvadhanula left a comment

Uh oh!

kfaraz commented Dec 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kfaraz commented Dec 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Implementation

Tests and refactors

Pending items

Further work

Release note

Screenshots

Uh oh!

abhishekagarwal87 Dec 8, 2022

Choose a reason for hiding this comment

Uh oh!

kfaraz Dec 8, 2022

Choose a reason for hiding this comment

Uh oh!

kfaraz Dec 8, 2022

Choose a reason for hiding this comment

Uh oh!

kfaraz Dec 9, 2022

Choose a reason for hiding this comment

Uh oh!

kfaraz Dec 9, 2022

Choose a reason for hiding this comment

Uh oh!

kfaraz Dec 12, 2022

Choose a reason for hiding this comment

Uh oh!

AmatyaAvadhanula left a comment

Choose a reason for hiding this comment

Uh oh!

kfaraz commented Dec 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kfaraz commented Dec 7, 2022 •

edited

Loading