New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
EMQX-10534 integrate otel log handler #11921
EMQX-10534 integrate otel log handler #11921
Conversation
8f33482
to
24a327a
Compare
lists:flatmap(fun(M) -> M:tr_handlers(Conf) end, tr_modules()). | ||
|
||
tr_modules() -> | ||
[emqx_otel_log_handler]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably should be better decoupled from the main emqx
app, but I tried to keep it simple for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we install the handler at runtime? i.e. in emqx_opentelemetry application
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zmstone, yes, but it would mean that before emqx_opentelemetry
app is started, logs won't be sent to the external otel collector/backend. Unless we start emqx_opentelemetry earlier than any other app.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, current otel_log_handler
implementation anyway doesn't allow to start a handler at early stage (when kernel logger starts). I think we have two options for now:
- change otel_log_handler to start under kernel's
logger_sup
, start accepting log event writes to the table but postpone 'full init' before kernel is fully started.
Advantages: the handler can work similarly to native log handlers with no log event losses.
Drawbacks: - have to rely on some kernel internals (logger_sup),
- more complicated log_handler design is required - keep otel_log_handler as is and accept the fact that it can't be initialized at early boot stage, meaning some logs will always be missing (probably up to the point when emqx_conf app initiates).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: my previous comment is not absolutely correct. There was actually a deadlock trying to start ssl app (opentelemetry_exporter dependency) in adding_handler/1 logger cb. I'm fixing that now in opentelemetry lib, so starting a handler under kernel logger_sup shouldn't be required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's so critical to have the boot logs sent to otel ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved everything to emqx_opentelemetry
app
24a327a
to
4b8e9ab
Compare
@@ -29,18 +29,18 @@ roots() -> ["opentelemetry"]. | |||
|
|||
fields("opentelemetry") -> | |||
[ | |||
{exporter, | |||
{"exporter", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep using atoms for filed names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: why is it better to use atoms for these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed back to atoms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for missing this reply.
I think we should try to use atoms for field names everywhere so we are one step closer to avoid generating atoms.
4b8e9ab
to
f3d687c
Compare
NOTE: |
exporter.label: "Exporter" | ||
|
||
max_queue_size.desc: | ||
"""The maximum queue size. After the size is reached Open Telemetry signals are dropped.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: What is the strategy for dropping the signals? Does it drop the new signals until the queue is drained, or discards the queued data as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It discards new signals. No need to discard the queued data, since olp mechanism generally ensures that the 'queue' stops growing once the limit is reached. The 'queue' is actually not a real queue data structure, but an ETS table.
We keep using 'max_queue_size' because it's a term used in OpenTelemetry specification and our ETS -based implementation doesn't violate the spec.
Each insert to ETS table is done after decrementing and getting the value of the atomic counter. Once counter is decremented to 0 (it starts with max_queue_size), the new events are drooped and the processor (gen_statem process) is notified to start a next exporting cycle.
The actual implementation is slightly more complex since there are two ETS tables and they are reversed periodically, so that when the processor starts exporting, another table becomes available for inserts.
https://github.com/emqx/opentelemetry-erlang/blob/main/apps/opentelemetry/src/otel_batch_olp.erl#L178
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Perhaps it's worth mentioning in the documentation?
?HOCON( | ||
?R_REF(emqx_schema, "ssl_client_opts"), | ||
#{ | ||
default => #{<<"enable">> => false}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems ssl enable/disable is derived from the endpoint url scheme.
maybe create a new type here and call emqx_schema:common_ssl_opts_schema (export it) for the fields.
i.e. no need for enable
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed slightly differently, following similar cases in other schemas, e.g.: https://github.com/emqx/emqx/blob/master/apps/emqx_ldap/src/emqx_ldap.erl#L100
f3d687c
to
d9f95cd
Compare
Closes: https://emqx.atlassian.net/browse/EMQX-10534
Documentation follow-up ticket: https://emqx.atlassian.net/browse/EMQX-11385
Tests follow-up ticket: https://emqx.atlassian.net/browse/EMQX-11432
Summary
馃 Generated by Copilot at 0cfb424
This pull request adds support for external log handlers using the OpenTelemetry protocol in EMQ X. It updates the dependencies, the configuration schema, and the application version in
mix.exs
andemqx_opentelemetry.app.src
. It adds a new behaviouremqx_external_log_handler
and a new moduleemqx_otel_log_handler
that implements it. It modifiesemqx_config_logger
to enable the external log handlers. It also adds a new schemaotel_log_handler
andotel_exporter
to define the configuration options for the OpenTelemetry log handler and exporter.PR Checklist
Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:
changes/(ce|ee)/(feat|perf|fix|breaking)-<PR-id>.en.md
filesChecklist for CI (.github/workflows) changes
changes/
dir for user-facing artifacts update