Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

Opentelemetry goes berserker mode #2380

Closed
bonomat opened this issue Jul 6, 2022 · 17 comments · Fixed by #2381
Closed

Opentelemetry goes berserker mode #2380

bonomat opened this issue Jul 6, 2022 · 17 comments · Fixed by #2381

Comments

@bonomat
Copy link
Collaborator

bonomat commented Jul 6, 2022

I had the taker and maker running locally locally over night. Now I get spammed with these errors.

OpenTelemetry trace error occurred. cannot send span to the batch span processor because the channel is full
OpenTelemetry trace error occurred. cannot send span to the batch span processor because the channel is full
OpenTelemetry trace error occurred. cannot send span to the batch span processor because the channel is full
OpenTelemetry trace error occurred. cannot send span to the batch span processor because the channel is full
OpenTelemetry trace error occurred. cannot send span to the batch span processor because the channel is full
OpenTelemetry trace error occurred. cannot send span to the batch span processor because the channel is full
OpenTelemetry trace error occurred. cannot send span to the batch span processor because the channel is full
OpenTelemetry trace error occurred. cannot send span to the batch span processor because the channel is full
OpenTelemetry trace error occurred. cannot send span to the batch span processor because the channel is full
OpenTelemetry trace error occurred. cannot send span to the batch span processor because the channel is full
OpenTelemetry trace error occurred. cannot send span to the batch span processor because the channel is full
OpenTelemetry trace error occurred. cannot send span to the batch span processor because the channel is full

2 questions

  • What does this mean?
  • Can we use our tracing library to log those errors? Since they are not json we won't be able to parse our logs to json anymore.
@bonomat
Copy link
Collaborator Author

bonomat commented Jul 6, 2022

cc @klochowicz / @Restioson

@Restioson
Copy link
Collaborator

Restioson commented Jul 7, 2022

This is interesting. How long do the logs continue and does it ever stop? This error seems to be that the number of batches is piling up in the queue. Maybe this is just a case of adding an exporter timeout or making the queue bigger, but then again maybe not...

@Restioson
Copy link
Collaborator

Can we use our tracing library to log those errors? Since they are not json we won't be able to parse our logs to json anymore.

We can change set_error_handler to make them print as JSON. I think that using tracing to log it could cause a circle of error -> log error -> error from logging error -> error -> ...

@klochowicz
Copy link
Contributor

klochowicz commented Jul 7, 2022

as for the error circle - wouldn't this be outside of any tracing span? I thought mere usage of tracing for logging doesn't automatically export data to the collector?

@Restioson
Copy link
Collaborator

wouldn't this be outside of any tracing span

Perhaps, but I don't know if we can rely on knowing from where handle_error is called exactly.

I thought mere usage of tracing for logging doesn't automatically export data to the collector?

I think it will given that the collector is part of the tracing subscriber

@klochowicz
Copy link
Contributor

klochowicz commented Jul 7, 2022

Indeed, there's no other way currently other than overriding the error handler: open-telemetry/opentelemetry-rust#549

@klochowicz
Copy link
Contributor

wouldn't this be outside of any tracing span

Perhaps, but I don't know if we can rely on knowing from where handle_error is called exactly.

I thought mere usage of tracing for logging doesn't automatically export data to the collector?

I think it will given that the collector is part of the tracing subscriber

I have tested this hypothesis locally; even when the collector channel is closed (during the shutdown), tracing logs from such a custom error handler still work as expected. I'll throw a PR in a sec.

@klochowicz klochowicz linked a pull request Jul 7, 2022 that will close this issue
@bors bors bot closed this as completed in #2381 Jul 7, 2022
@bonomat
Copy link
Collaborator Author

bonomat commented Jul 19, 2022

I'm seeing these errors again on testnet.

@bonomat bonomat reopened this Jul 19, 2022
@bonomat
Copy link
Collaborator Author

bonomat commented Jul 19, 2022

image

@bonomat
Copy link
Collaborator Author

bonomat commented Jul 19, 2022

The reason for this is that the agent is not able to keep up with the logs

ts=2022-07-19T00:50:58Z level=info caller=exporterhelper/queued_retry.go:218 msg="Exporting failed. Will retry the request after interval." component=traces traces_config=default kind=exporter name=otlp/0 error="rpc error: code = ResourceExhausted desc = RATE_LIMITED: ingestion rate limit (166666 bytes) exceeded while adding 2114523 bytes" interval=8.688087835s
ts=2022-07-19T00:51:02Z level=error caller=exporterhelper/queued_retry_inmemory.go:109 msg="Exporting failed. No more retries left. Dropping data." component=traces traces_config=default kind=exporter name=otlp/0 error="max elapsed time expired rpc error: code = ResourceExhausted desc = RATE_LIMITED: ingestion rate limit (166666 bytes) exceeded while adding 2084234 bytes" dropped_items=512
ts=2022-07-19T00:51:09Z level=info caller=exporterhelper/queued_retry.go:218 msg="Exporting failed. Will retry the request after interval." component=traces traces_config=default kind=exporter name=otlp/0 error="rpc error: code = ResourceExhausted desc = RATE_LIMITED: ingestion rate limit (166666 bytes) exceeded while adding 2081270 bytes" interval=14.829523867s
ts=2022-07-19T00:51:13Z level=error caller=exporterhelper/queued_retry_inmemory.go:109 msg="Exporting failed. No more retries left. Dropping data." component=traces traces_config=default kind=exporter name=otlp/0 error="max elapsed time expired rpc error: code = ResourceExhausted desc = RATE_LIMITED: ingestion rate limit (166666 bytes) exceeded while adding 2114808 bytes" dropped_items=512
ts=2022-07-19T00:51:20Z level=info caller=exporterhelper/queued_retry.go:218 msg="Exporting failed. Will retry the request after interval." component=traces traces_config=default kind=exporter name=otlp/0 error="rpc error: code = ResourceExhausted desc = RATE_LIMITED: ingestion rate limit (166666 bytes) exceeded while adding 2558012 bytes" interval=13.571884121s
ts=2022-07-19T00:51:24Z level=error caller=exporterhelper/queued_retry_inmemory.go:109 msg="Exporting failed. No more retries left. Dropping data." component=traces traces_config=default kind=exporter name=otlp/0 error="max elapsed time expired rpc error: code = ResourceExhausted desc = RATE_LIMITED: ingestion rate limit (166666 bytes) exceeded while adding 2081270 bytes" dropped_items=512
ts=2022-07-19T00:51:32Z level=info caller=exporterhelper/queued_retry.go:218 msg="Exporting failed. Will retry the request after interval." component=traces traces_config=default kind=exporter name=otlp/0 error="rpc error: code = ResourceExhausted desc = RATE_LIMITED: ingestion rate limit (166666 bytes) exceeded while adding 3624990 bytes" interval=16.302738013s
ts=2022-07-19T00:51:49Z level=error caller=exporterhelper/queued_retry_inmemory.go:109 msg="Exporting failed. No more retries left. Dropping data." component=traces traces_config=default kind=exporter name=otlp/0 error="max elapsed time expired rpc error: code = ResourceExhausted desc = RATE_LIMITED: ingestion rate limit (166666 bytes) exceeded while adding 3624990 bytes" dropped_items=506

@bonomat
Copy link
Collaborator Author

bonomat commented Jul 19, 2022

Can we change opentelementry to a polling mode which automatically drops un-polled spans after a few seconds?

@bonomat
Copy link
Collaborator Author

bonomat commented Jul 19, 2022

Note: this is not a blocker for 0.5.0. If we can't solve this we just don't enable instrumentation on mainnet.

@Restioson
Copy link
Collaborator

This is partially a symptom of another problem, which is that at some point logs get really spammy and we get ratelimited... I wonder what the underlying cause of that issue is?

@Restioson
Copy link
Collaborator

Restioson commented Jul 19, 2022

The docs of opentelemetry_otlp says that we can configure the timeout - maybe this could help?

@bonomat
Copy link
Collaborator Author

bonomat commented Jul 19, 2022

The docs of opentelemetry_otlp says that we can configure the timeout - maybe this could help?

How would the application behave if it can't send the trace to the collector due to a timeout?

@Restioson
Copy link
Collaborator

Restioson commented Jul 19, 2022

I'm not sure - opentelemetry-otlp just notes that it is the "timeout to the collector". I asked already on the opentelemetry-rust gitter room about this, so I'll add this question too

@bonomat
Copy link
Collaborator Author

bonomat commented Sep 2, 2022

I think we can close this.

@bonomat bonomat closed this as completed Sep 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants