Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit metrics for S3UploadThreadPool #16616

Merged
merged 12 commits into from
Jun 21, 2024

Conversation

Akshat-Jain
Copy link
Contributor

@Akshat-Jain Akshat-Jain commented Jun 17, 2024

Description

This pull request introduces the functionality to emit metrics for the S3UploadThreadPool.

This aims to provide better visibility into the behavior of the thread pool, and tasks submitted to it.

Test plan

I verified that following metrics are being emitted:

2024-06-19T14:59:35,652 INFO [S3UploadThreadPool-2] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"feed":"metrics","metric":"s3/upload/part/queueSize","service":"druid/indexer","host":"localhost:8091","version":"","value":0,"timestamp":"2024-06-19T14:59:35.652Z"}
2024-06-19T14:59:35,653 INFO [S3UploadThreadPool-2] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"feed":"metrics","uploadId":"R3JdipDCberDCyiL8GTDOriBR.16E1E3l9Y5of2PJss9DMTIm.V7KBh8eu39cgmBegXlH0X_mGzS9iNEVOwRUE1VxasSkmMVa_hEYI.IGVRPNURpm5e14R.D7VptE2Ds","metric":"s3/upload/part/queuedTime","service":"druid/indexer","host":"localhost:8091","partNumber":1,"version":"","value":0,"timestamp":"2024-06-19T14:59:35.653Z"}
2024-06-19T14:59:36,297 INFO [S3UploadThreadPool-2] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"feed":"metrics","uploadId":"R3JdipDCberDCyiL8GTDOriBR.16E1E3l9Y5of2PJss9DMTIm.V7KBh8eu39cgmBegXlH0X_mGzS9iNEVOwRUE1VxasSkmMVa_hEYI.IGVRPNURpm5e14R.D7VptE2Ds","metric":"s3/upload/part/time","service":"druid/indexer","host":"localhost:8091","partNumber":1,"version":"","value":644,"timestamp":"2024-06-19T14:59:36.297Z"}

2024-06-19T14:59:36,651 INFO [processing-4] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"feed":"metrics","uploadId":"R3JdipDCberDCyiL8GTDOriBR.16E1E3l9Y5of2PJss9DMTIm.V7KBh8eu39cgmBegXlH0X_mGzS9iNEVOwRUE1VxasSkmMVa_hEYI.IGVRPNURpm5e14R.D7VptE2Ds","metric":"s3/upload/total/time","service":"druid/indexer","host":"localhost:8091","version":"","value":2992,"timestamp":"2024-06-19T14:59:36.651Z"}
2024-06-19T14:59:36,652 INFO [processing-4] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"feed":"metrics","uploadId":"R3JdipDCberDCyiL8GTDOriBR.16E1E3l9Y5of2PJss9DMTIm.V7KBh8eu39cgmBegXlH0X_mGzS9iNEVOwRUE1VxasSkmMVa_hEYI.IGVRPNURpm5e14R.D7VptE2Ds","metric":"s3/upload/total/bytes","service":"druid/indexer","host":"localhost:8091","version":"","value":2048,"timestamp":"2024-06-19T14:59:36.652Z"}

Release note

5 new metrics have been added to provide better visibility into the behavior of the thread pool used for uploading parts (of a multi-part upload) to S3 when durable storage is enabled.

  1. s3/upload/part/queuedTime: Milliseconds spent by a task in queue before it starts uploading a part (of a multi-part upload) to S3 when durable storage is enabled.
  2. s3/upload/part/queueSize: The number of tasks that are currently queued and waiting to upload a part (of a multi-part upload) to S3 when durable storage is enabled.
  3. s3/upload/part/time: Milliseconds taken by a task to upload a part (of a multi-part upload) to S3 when durable storage is enabled.
  4. s3/upload/total/time: The total time taken in milliseconds for uploading all parts of a file to S3 when durable storage is enabled.
  5. s3/upload/total/bytes: The total number of bytes uploaded across all parts of a file to S3 when durable storage is enabled.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the right method of emitting metrics. We shouldn't need to write a wrapper around ExecutorService for every new set of metrics that needs to be emitted.

There are 2 ways of emitting metrics in Druid:

  • Directly from the relevant code
  • Using a Monitor, e.g. TaskCountStatsMonitor. Monitors are typically useful when we want the metrics to be emitted optionally and not always.

Changes required in this PR

  • Remove the wrapper WaitTimeMonitoringExecutorService.
  • Just emit metrics directly from the S3UploadManager. If you face difficulties in emitting some metrics here, we can discuss that.
  • Use constant Strings for the metric names so that they are easy to identify
  • Add documentation for the new metrics in metrics.md.
  • Add a heading "Release Note" in the PR description which describes the new metrics

@Akshat-Jain
Copy link
Contributor Author

@kfaraz

If you face difficulties in emitting some metrics here, we can discuss that.

I needed a way to get the time spent by a task in the queue before execution. Can you suggest how to achieve that without adding a wrapper?

@kfaraz
Copy link
Contributor

kfaraz commented Jun 18, 2024

I needed a way to get the time spent by a task in the queue before execution.

Yes, I suspected this was the motivation, 🙂 .

Can you suggest how to achieve that without adding a wrapper?

Do the same thing, just not in the wrapper, i.e. in S3UploadManager.queueChunkForUpload(), start a Stopwatch and inside the submitted lambda, determine the queued duration just before calling RetryUtils.retry().

To determine queue size, you can keep an AtomicInteger which is incremented in the first line of queueChunkForUpload() and decremented inside the submitted lambda.

Hope that helps!

@Akshat-Jain
Copy link
Contributor Author

@kfaraz

Do the same thing, just not in the wrapper, i.e. in S3UploadManager.queueChunkForUpload(), start a Stopwatch and inside the submitted lambda, determine the queued duration just before calling RetryUtils.retry().

Makes sense, let me try it out! Thanks!

To determine queue size, you can keep an AtomicInteger which is incremented in the first line of queueChunkForUpload() and decremented inside the submitted lambda.

Why not use executor.getQueue().size() like I was using currently in WaitTimeMonitoringExecutorService#emitMetrics?

@kfaraz
Copy link
Contributor

kfaraz commented Jun 18, 2024

Why not use executor.getQueue().size() like I was using currently in WaitTimeMonitoringExecutorService#emitMetrics?

To do this, you need to update the interface to return a ThreadPoolExecutor instead of ExecutorService (which I am not sure if I really like 😛) or cast the returned ExecutorService into a ThreadPoolExecutor anyway (which would probably be worse).

Compared to this, maintaining the queue count in an integer is an easy workaround and not unclean either.

@Akshat-Jain
Copy link
Contributor Author

@kfaraz Fair enough, making the change! Thanks for the quick guidance on this!

@Akshat-Jain
Copy link
Contributor Author

@kfaraz Have made the suggested change, and have updated the PR description also with the same. Can you please take another look? Thanks!

@Akshat-Jain Akshat-Jain requested a review from kfaraz June 18, 2024 04:04
@Akshat-Jain Akshat-Jain requested a review from kfaraz June 18, 2024 05:02
@Akshat-Jain Akshat-Jain requested a review from kfaraz June 19, 2024 15:05
Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final suggestions, rest looks good.

@kfaraz
Copy link
Contributor

kfaraz commented Jun 20, 2024

@Akshat-Jain , some ITs have failed. The failures seem genuine but unrelated.

@findingrish , the failures are for centralized schema ITs. Have you seen these before?
https://github.com/apache/druid/actions/runs/9585328311/job/26433415630?pr=16616
https://github.com/apache/druid/actions/runs/9585328311/job/26433416324?pr=16616

@findingrish
Copy link
Contributor

@kfaraz Seeing the CDS test group failures for the first time. I see these failures on this PR https://github.com/apache/druid/actions/runs/9567431225/job/26382234870?pr=16619 as well.

I don't see them failing for any of the merged PRs.

I will look into the failures.

@cryptoe
Copy link
Contributor

cryptoe commented Jun 21, 2024

The task failures are unrelated. Going ahead with merge.

@cryptoe cryptoe merged commit cd438b1 into apache:master Jun 21, 2024
85 of 87 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants