New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start working on custom metrics for Kafka #404
Conversation
src/metrics/mod.rs
Outdated
pub(crate) fn register(_py: Python, m: &PyModule) -> PyResult<()> { | ||
m.add_class::<Counter>()?; | ||
m.add_class::<Gauge>()?; | ||
m.add_class::<Histogram>()?; | ||
Ok(()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. This feels like committing to an uphill battle. Writing a Python metrics API could be an entire startup into itself, and although yes we are piggy-backing on the great work of OpenTelemetry, just maintaining the binding API seems not fun. It feels like a theme of my comments recently has been "re-use APIs". I do not think this is a reasonable long-term solution.
["topic", "partition"], | ||
) | ||
# Labels to use when recording metrics | ||
# TODO: It would be nice to have the step_id available here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would update build_part
to include a step_id: str
argument which the input operator injects. If the partition itself wants to store it as an instance variable and use it, great; if not, it can ignore it. This would parallel the way we inject it into UnaryLogic
by closing over it in the builder.
src/metrics/mod.rs
Outdated
@@ -46,3 +167,10 @@ pub(crate) fn initialize_metrics() -> PyResult<()> { | |||
global::set_meter_provider(provider); | |||
Ok(()) | |||
} | |||
|
|||
pub(crate) fn register(_py: Python, m: &PyModule) -> PyResult<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although it might be more intricate, would the API surface of writing the "collector bridge" require us to shoulder the same problem of maintaining a binding API? Or would it be generic enough we could write it once?
My guess is it'd be a little better, but might not be the lowest cost.
src/metrics/mod.rs
Outdated
@@ -46,3 +167,10 @@ pub(crate) fn initialize_metrics() -> PyResult<()> { | |||
global::set_meter_provider(provider); | |||
Ok(()) | |||
} | |||
|
|||
pub(crate) fn register(_py: Python, m: &PyModule) -> PyResult<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option? Could the Python side metrics be entirely separate for now while it's difficult to bridge? Is there a big penalty to setting up two metrics endpoints that external collectors have to hit on different ports?
Sorry, I somehow missed this PR. I agree with David, that the way it's setup right now means we'll need to update and maintain the bindings just to offer a small subset of what the underlying library offers, depending on what we think might be useful to us or the users. But, being able to expose custom metrics in our own connectors is surely useful, and to do that we need to access the metrics backend from python. Doing a second, separate setup of the metrics system in python did introduce some problems when parallelizing, as most of the libraries I've worked with tend to rely on a global object, but maybe that's less of a problem now that the multiprocess sheningans are relegated to the testing runner, so it could be something to explore again. Or maybe we could try to expose a less specific api, something like a |
5f95edd
to
9e9b8d0
Compare
9e9b8d0
to
b3c754b
Compare
@davidselassie @Psykopear I reworked this PR to use the Python side exporter to generate metrics from the |
I like it, it keeps the nice user facing api without us having to map it manually. class PeriodicPartition(StatelessSourcePartition):
def __init__(self, step_id: str, worker_index: int, frequency: timedelta):
self.frequency = frequency
self._next_awake = datetime.now(timezone.utc)
self._gauge = Gauge(
f"next_batch_delay_{worker_index}",
"Calculated delay of when next batch was called in seconds",
["step_id", "partition"],
unit="seconds",
)
self._counter = 0
self._metric_labels = {"step_id": step_id, "partition": "0"} But it could be enough to warn users of this caveat. |
Great catch! I need to think carefully about the right way to do this. I think that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good enough for now, but it'd be good to re-visit if there are issues in the future.
// Remove trailing newline | ||
rust_metrics.pop(); | ||
|
||
format!("{rust_metrics}{py_metrics}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most brilliant hack of all: string concatenation.
14aacd8
to
4c3516c
Compare
- Removes Rust side wrappers. - Adds a Python dependency on prometheus-client.
Cleans up some old examples.
The Prometheus client library's REGISTRY is a global. When using a Gauge for Kafka, registering the same Gauge twice will cause an error. This change makes the declaration of Gauges used between multiple partitions global as well. Labels should be used to collect metrics for different partitions or workers.
CodSpeed Performance ReportMerging #404 will create unknown performance changesComparing Summary
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me 👍 I added a couple of comments, but nothing major.
pysrc/bytewax/connectors/demo.py
Outdated
@@ -92,7 +92,9 @@ def list_parts(self) -> List[str]: | |||
return [self._metric_name] | |||
|
|||
@override | |||
def build_part(self, for_part: str, resume_state: Optional[_RandomMetricState]): | |||
def build_part( | |||
self, _step_id: str, for_part: str, resume_state: Optional[_RandomMetricState] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When overriding methods we should probably keep the original name of the kwarg, so step_id
instead of _step_id
, even if it's not used, so that if we pass explicitly named arguments to the subclass it still works. Same for all the other overrides in this PR
): | ||
self._offset = starting_offset if resume_state is None else resume_state | ||
print(f"starting offset: {starting_offset}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe log rather than print here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to remove this. Thanks for pointing it out!
- Remove print of offsets - Change unused `_step_id` back to `step_id` to fix overrides.
Overview
The premise of this PR is to provide the ability for our users to add custom instrumentation to dataflows that will be reported from our
/metrics
endpoint.I've tried a few different approaches, and I didn't feel as though any of the ideas I've tried was a clear winner, so I'm opening this PR as a strawman to solicit some ideas on how to proceed.
I'll start with a list of general issues I encountered:
Use of the Python Prometheus library, or OpenTelemetry client
I looked for a bit into using the Python OpenTelemetry library to collect metrics on the Python side, and register them as a Collector to the Rust-side registry. I didn't end up prototyping this approach as it felt overly complicated.
Which Rust crates to use for metrics?
Our use of the
opentelemetry
crate'smetrics
feature was initially a little problematic as that crate does not offer a synchronous Gauge type yet. The callback pattern of metrics collection that is used by theconfluent_kafka
Python library doesn't seem amenable to using an async Gauge.The OpenTelemetry rust crate also requires the use of both the prometheus crate, and the opentelemetry_prometheus crate. The
opentelemetry_prometheus
crate depends on theprometheus
crate to provide the registry, and the encoder that is used to generate the text format for metrics that are scraped by Prometheus.You can see in this strawman PR that since we are already depending on the
prometheus
crate, I used theGauge
,Counter
, andHistogram
implementations that are provided there.I also included the
prometheus
crate'sprocess
feature, which includes some useful metrics around process memory usage and thread counts.A smaller issue- Where should custom metrics get the
step_id
to use as a label for metrics?In the
KafkaSource
example in this PR, there wasn't an obvious way to reference thestep_id
to use as a label for metrics without changing the signature forFixedPartitionSource
. Since this approach will probably change, I didn't make any changes to the API of input sources.