ECS EC2 Prometheus pipeline example - service replicas? #454

bmcclory · 2021-04-20T05:06:14Z

The example at:

https://github.com/aws-observability/aws-otel-collector/blob/main/examples/ecs/prometheus-pipeline/ecs-ec2-task-def.json

deploys OTEL collector as a sidecar using a static Prometheus configuration for the linked container (Docker bridge network).

While this technique seems to work fine for a single instance ECS task, in a replicated service this static configuration causes AMP to reject the remote write samples that it thinks are duplicates.

I'm new to OTEL and Prometheus; is there a recommendation for overcoming this limitation?

If we stick with the sidecar approach, it seems like ECS users would need something similar to k8sprocessor for enriching metric data with ECS container/task metadata.

Is an agent (non- sidecar) configuration recommended for ECS users, e.g. something built atop the Prometheus File SD? I see some progress is being made in the upstream (and ECS observer), but have concerns about scalability of a single receiver in a large cluster; scraping 100s of targets and hammering AWS APIs for service discovery (and likely bumping into throttling limits.)

pingleig · 2021-04-20T18:05:30Z

TL;DR our recommendation is using ecsobserver with replica equals 1. You can shard using hashmod or spliting the discovery config for different applications if the cluster is too big for a single receiver.

When deploying as a replica, otel collector is same as cloudwatch agent, replica number must be 1. If you need to distribute the workload, you can use hashmod (e.g. two configs, run same discovery logic but with different mod). There is no operator alike for ECS that can do autosharding (i.e. run discovery in one place and distribute targets to multiple collectors e.g. open-telemetry/wg-prometheus#27)

ECS API does not have list watch like k8s, in ecs observer has partial in memory caching for task definition, ec2 instances (but there is no way to cache the task api) and the default polling interval is large e.g. 1min. Which may not be good if you are running batch jobs on ECS, though I am not sure if using prometheus for short lived tasks is a good idea.

bmcclory · 2021-04-21T03:04:36Z

Thanks for the detailed response!

our recommendation is using ecsobserver with replica equals 1

Doesn't look like that PR has merged in here yet, but I see it upstream. Looks pretty slick, I'll give it a shot!

... if the cluster is too big for a single receiver.

Some guidance from AWS about what is "too big" might help. I was doing a Prometheus file based ECS service discovery using the EC2/ECS APIs -- similar to what's implemented in ecsobserver -- and I was seeing occasional API request throttling errors. The ecsobserver caching may strategy (and longer polling interval) might help with that; I guess I'll have to try it and find out.

I am not sure if using prometheus for short lived tasks is a good idea.

That's another thing I've been wondering about. Prometheus has a PushGateway component that is recommended for short-lived batch jobs. I can easily stand that up and include it as a target for the OTEL receiver. But if there's another way to accomplish this goal -- e.g. to have batch apps export their Prometheus metrics directly to an AMP remote write endpoint -- that'd be cool.

pingleig · 2021-04-21T04:15:39Z

I think #428 is not going to get merged for a while, the original plan is to include in the aws distrol repo for the incoming release (because it's too big for upstream to merge it, and I was too lazy on splitting PRs). Now we plan to do it after prometheus receiver is GA in upstream. Which might give us enough time to merge smaller PRs like open-telemetry/opentelemetry-collector-contrib#3133 and import the extension from otelcontrib repo directly.

You should be able to build it from source in the PR and have it up and running, I have the WIP doc in another PR

Some guidance from AWS about what is "too big" might help

Actually I don't have too much data, the code and caching strategy is from cloudwatch agent which does support prometheus on ecs, but it can only send data to cloudwatch (as its name suggests XD). We had external customer saying cloudwatch agent can't handle their ecs cluster, but I wasn't able to find to much information about it, some teammates mentioned the conclusion is they decided to use hashmod. I should be able to run some scale test when I have time, after all I don't get charged much by aws and this should be in the FAQ section for doc.

have batch apps export their Prometheus metrics directly to an AMP remote write endpoint

I am not very familiar with the push way of prometheus, I think there is no pushgatey alike receiver in otel collector and contrib repo. I am not sure if otel sdk (in any language) can push in a protocol other than otel's own http/grpc protocol. There is a down side of pushing to a remote API in a batch job, it's hard to do retry/jitter and buffer in batch job, and you may get throttled by AWS API. One alternative approach I see people using (not for prometheus) is write metrics as EMF log in lambda, the cloudwatch backend will then extract metrics from log.

bmcclory · 2021-04-22T02:42:22Z

Appreciate the insights! Last semi-related question:

My (perhaps invalid) assumption was that the ECS Container Metrics Receiver component required a sidecar deployment in order to gather container and task metadata for each task to which it belongs. That is, a single Collector configured with the ECS Container Metrics Receiver is not going to gather metadata for all tasks and containers in the cluster -- only its own. True?

If true, and I need an OTEL sidecar for each task, then it's a bit strange / inconvenient that I also need another standalone Collector to scrape the Prometheus metrics via an ecsobserver discovery mechanism. I mean, I may as well have the sidecars scrape their statically configured linked container targets (and avoid all these potential scalability / rate limiting pitfalls).

It seems like the pieces are almost in place to make the ECS sidecar model work out-of-the-box; the only thing that's missing is a processor for enriching those static Prometheus targets with extra resource attributes (labels) so that metrics received from replicated tasks have unique dimensions before they're exported to AMP.

pingleig · 2021-04-22T18:02:54Z

ECS Container Metrics Receiver is not going to gather metadata for all tasks and containers in the cluster -- only its own

Yes, it is using the task meta data api when creating the client so it has no way to go beyond the task the sidecar is running inside.

have the sidecars scrape their statically configured linked container targets

This should work. I guess the reason we have an extra ecs service for prometheus in otel could be inherited from cloduwatch agent. For cloudwatch, container metrics on ecs is collected by other agents (that user don't need to deploy and can't config) when you enable container insight (I guess it's the ecs agent). Since there is no existing sidecar, we just run one prometheus scraper for entire cluster.

the only thing that's missing is a processor for enriching those static Prometheus targets with extra resource attributes

I think the ecs part of resource detection processor might work for you, though it's missing things like service name, underlying ec2 instance etc. Those information are not provided by the metadata endpoint and requires extra IAM policy for calling ECS API (describe service, describe container instance).

bmcclory · 2021-04-22T19:10:04Z

I think the ecs part of resource detection processor might work for you,

Huzzah! That's exactly what I was looking for. Thanks!

It'd be cool if the ECS Prometheus examples/ guides in this repository -- which demonstrate setting up both ECS Container Metrics Receiver + Prometheus Receiver w/ AMP exporter -- would discuss these different topologies (sidecar, agent, etc.). As it stands right now, the examples aren't viable for a real ECS workload with replicated tasks. Figuring out the combination of components (+ config) to use is a little intimidating for newcomers.

pingleig · 2021-04-23T04:16:10Z

I have a feature request in the doc repo to allow people interactive find the config they need based on their environment aws-otel/aws-otel.github.io#99 And we are still working on make the examples in this repo more organised and demonstrate usage of different components.

Though I think the main reason it's not in the example is because the early version of aws otel distro does not ship with resource detector processor until #393

bmcclory closed this as completed Apr 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ECS EC2 Prometheus pipeline example - service replicas? #454

ECS EC2 Prometheus pipeline example - service replicas? #454

bmcclory commented Apr 20, 2021 •

edited

Loading

pingleig commented Apr 20, 2021 •

edited

Loading

bmcclory commented Apr 21, 2021

pingleig commented Apr 21, 2021 •

edited

Loading

bmcclory commented Apr 22, 2021

pingleig commented Apr 22, 2021

bmcclory commented Apr 22, 2021

pingleig commented Apr 23, 2021

ECS EC2 Prometheus pipeline example - service replicas? #454

ECS EC2 Prometheus pipeline example - service replicas? #454

Comments

bmcclory commented Apr 20, 2021 • edited Loading

pingleig commented Apr 20, 2021 • edited Loading

bmcclory commented Apr 21, 2021

pingleig commented Apr 21, 2021 • edited Loading

bmcclory commented Apr 22, 2021

pingleig commented Apr 22, 2021

bmcclory commented Apr 22, 2021

pingleig commented Apr 23, 2021

bmcclory commented Apr 20, 2021 •

edited

Loading

pingleig commented Apr 20, 2021 •

edited

Loading

pingleig commented Apr 21, 2021 •

edited

Loading