Skip to content

Conversation

@rubvs
Copy link
Contributor

@rubvs rubvs commented Oct 5, 2025

  • If ES is down, we want to return an error unavailable code, along with a user facing error: retryable server error.
  • If elasticsearchErr is not ok, it implies the ES instance could not be reached, and therefore does not return a specific ES error, but rather a code.Internal error. The error happens on the TCP connection level.

Manual Test

# Console 1: Spin up a cluster
> tilt up

# Console 2: Make ES unavailable
> kubectl scale statefulset elasticsearch-es-default --replicas=0

# Console 3: Ensure a retryable error is returned to the user when trying to auth to an unavailable ES.
> TEST_COUNT=10 make run-otelbench mode=local

2025-10-05T15:56:21.572Z  error  internal/queue_sender.go:51  Exporting failed. Dropping data.
{
  "resource": {
    "service.instance.id": "645a4319-43c4-4f8b-b97b-27f5adf6dc0c",
    "service.name": "/ko-app/loadgen",
    "service.version": "0.0.1"
  },
  "otelcol.component.id": "otlp",
  "otelcol.component.kind": "exporter",
  "otelcol.signal": "logs",
  "error": "interrupted due to shutdown: rpc error: code = Unavailable desc = rpc error: code = Unavailable desc = retryable server error \"o1voipUB6v8S06vJAX5J\": an error happened during the HasPrivileges query execution: dial tcp 10.96.217.236:9200: connect: connection refused",
  "dropped_items": 325
}

@rubvs rubvs requested review from a team as code owners October 5, 2025 15:32
@rubvs rubvs requested a review from vigneshshanmugam October 5, 2025 15:45
Copy link
Member

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a unit test for the new branch that you're adding?

Copy link
Member

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are talking about serverless ES that is unavailable, we might get 502 from proxy instead. Have you considered handling 502 and surfacing that as a retryable error to otel client?

Copy link
Contributor

@isaacaflores2 isaacaflores2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good. I still think we need to add a separate test case when we get a non elasticsearch err like @carson mentioned here

@rubvs
Copy link
Contributor Author

rubvs commented Oct 31, 2025

@carsonip any non-elasticsearch error will return a retryable error. So if the proxy - sitting between ingest and ES - returns any error, a retryable error should be returned.

@isaacaflores2 I've added a basic unit test to mock the proxy returning a 502.

Copy link
Contributor

@isaacaflores2 isaacaflores2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates!

Copy link
Member

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 nit and a non-blocking question on the e2e behavior in the collector, whether the error is sanitized enough for user consumption.

@rubvs rubvs merged commit 3e67dd0 into main Oct 31, 2025
8 of 10 checks passed
@rubvs rubvs deleted the update-apikey-error-code branch October 31, 2025 19:13
@rubvs
Copy link
Contributor Author

rubvs commented Oct 31, 2025

How did this merge without passing CI first. Creating a fix quickly.

@rubvs rubvs mentioned this pull request Oct 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants