Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to programmatically get event "unable to place a task because the resources could not be found" #121

Open
mattolenik opened this Issue Jan 30, 2018 · 18 comments

Comments

Projects
None yet
@mattolenik
Copy link

mattolenik commented Jan 30, 2018

I'm trying to get some event for when an ECS task fails to be placed, specifically when a task cannot be placed due to insufficient resources. I want this event here to trigger a Lambda, which I can use to respond with scale-out actions.

I have tried listening to the ECS Event Stream, but no event at all was triggered for task placement failure, the Lambda trigger never occurred. I also didn't see anything in CloudWatch logs for ECS at all.

Is there any way to receive notification of this event? We are able to alert on it in DataDog, how do they get it? Do we need to resort to polling?

@aaithal

This comment has been minimized.

Copy link
Member

aaithal commented Jan 30, 2018

Hi @mattolenik, ECS event stream notifications are sent either when the state of instances in your cluster changes or when the state of tasks in your cluster change. If you're building a solution to automatically scale your cluster when its running low on resources, you'd have to reconstruct your cluster state using these events. A sample implementation for the same can be found here.

If you're looking to autoscale across the CPU/Memory dimensions, it's much simpler instead to depend on the CPU/Memory reservation metrics for your cluster. Here's a tutorial for the same. Please let us know if that helps with your current setup/use-case.

Thanks,
Anirudh

@devshorts

This comment has been minimized.

Copy link

devshorts commented Jan 30, 2018

anirudh, wed like to scale only when we receive placement pressure. So not preemptively on low resources via cpu or memory but only in reaction to the event that a task can not be placed. The reason there is that its not always clear what is and isn't low resources. If someone wants to place a heavyweight task but the cluster isn't under pressure, pre-emptively scaling won't catch that. If we instead scale when the cluster can't accept any more items, we are OK with a short delay while the cluster scales up then tasks get placed.

Our general scaling idea is to scale up by N% on pressure, and constantly scale down by 1 box every X minutes. This will give us a saw-tooth pattern of our cluster size, and auto compensate for over-provisioning. However, to do that we need the event fired somewhere we can capture it. Without that event, or continuing event, we can't realistically scale the cluster

The event clearly exists in the event logs in the amazon UI, but we can’t find this event in the event stream. That makes reacting to this event impossible.

We’re asking if the event is published anywhere or if we need to somehow poll some other API to get at that data

@aaithal

This comment has been minimized.

Copy link
Member

aaithal commented Jan 30, 2018

The event clearly exists in the event logs in the amazon UI, but we can’t find this event in the event stream.

You're referring to the event in the service event messages, correct? If yes, you're correct in pointing out that, that particular message is not published to the event stream.

I have logged this internally as a feature request.

@s-maj

This comment has been minimized.

Copy link

s-maj commented Jan 31, 2018

You can use lambda and cloudwatch events (to invoke it every minute) to bump ASG group on certain event. Example code to catch message from scheduler:

def check_insufficient_resources(cluster):
    client = boto3.client('ecs')

    paginator = client.get_paginator('list_services')
    response_iterator = paginator.paginate(
        cluster=cluster,
        PaginationConfig={
            'MaxItems': 500,
        }
    )

    insufficient_resources = []
    for page in response_iterator:
        grouped_services = [page['serviceArns'][i:i + 10] for i in range(0, len(page['serviceArns']), 10)]
        for services in grouped_services:
            response = client.describe_services(
                cluster=cluster,
                services=services
            )
            for service in response['services']:
                events = service['events']
                sorted_events = sorted(events, key=lambda k: k["createdAt"], reverse=True)
                latest_message = sorted_events[0]['message']

                if 'unable to place a task because no container instance met all of its requirements' in latest_message:
                    insufficient_resources.append(latest_message)

    return insufficient_resources
@mattolenik

This comment has been minimized.

Copy link
Author

mattolenik commented Jan 31, 2018

s-maj, that does not work. I already tried that, and the event with the "unable to place a task" never happens. It simply isn't something that can be captured with the event stream. It's not a start or stop event of any kind.

@willthames

This comment has been minimized.

Copy link

willthames commented Jan 31, 2018

My workaround is to have a scheduled lambda for each cluster that looks for services that have (running tasks + pending tasks < desired tasks) and bump desired capacity in that case.

But I would love for an event to fire if a task can't be placed that can then trigger auto scaling without the intermediate lambda.

@nullck

This comment has been minimized.

Copy link

nullck commented Mar 1, 2018

+1
I really would love to have this event ("was unable to place a task because the resources could not be found") in cloudwatch events that will allow us to trigger a lambda function to scale-out our ecs cluster.

For while something like mentioned by @mattolenik seems ok.

@joeykhashab

This comment has been minimized.

Copy link

joeykhashab commented Apr 20, 2018

Until this added, a workaround would be to have a lambda on a scheduled basis call the APIs to the get events and save that to cloudwatch.

i.e: for python boto3.

boto3.client('ecs')
services_list = ecs_client.describe_services(cluster=cluster_arn, services=[service_arn])
for service in services_list['services']:
  print(service['events'])
@tkersh

This comment has been minimized.

Copy link

tkersh commented Jul 18, 2018

Ideally, it seems like ECS should auto-scale for this case without having to manually set up triggers and lambdas etc. Blue/Green is an important use case for ECS.

As long as you have defined enough headroom between allocated capacity and maxSize, that should be right in ECS's wheelhouse.

@jonathonsim

This comment has been minimized.

Copy link

jonathonsim commented Jul 25, 2018

+1 for these events to appear in the event stream

I can confirm they do appear in the DescribeServices API output (in the 'events' attribute ) so approaches like @joeykhashab or @s-maj propose, that poll the API rather than cloudwatch events will work.

Although it would be much simpler and more robust to simply be able to trigger this on a cloudwatch event

@dsouzajude

This comment has been minimized.

Copy link

dsouzajude commented Sep 21, 2018

+1

The problem with writing lambdas to poll the ECS API every minute could potentially cause throttling on the ECS API itself which can then cause even more destructive side-effects on how other services integrate with ECS. Publishing event about failure to start a task, in particular these as well would be really good. We can then act upon them, get metrics and generate alerts. It would be super useful.

Would appreciate if this can be taken as a feature request.

@hampsterx

This comment has been minimized.

Copy link

hampsterx commented Oct 13, 2018

+1 this is kind of an critical event that would be very useful to utilize.

@coultn

This comment has been minimized.

Copy link

coultn commented Oct 31, 2018

Thanks everyone for the feedback! Please be assured that we on the ECS team are aware of this issue, and that it is under active consideration. +1's and additional details on use cases are always appreciated and will help inform our work moving forward.

@efenderbosch

This comment has been minimized.

Copy link

efenderbosch commented Nov 16, 2018

We are in the exact same scenario as several others in this thread. We currently have something similar to the tutorial setup where the ECS cluster will auto-scale if the cpu or memory headroom falls below a certain percent. This doesn't work, however when the sum of the free cpu/memory across the cluster is still within the threshold, but there's still no EC2 instance with sufficient headroom to launch a new task. We could have 10 EC2 instances, all with 400 cpu free and the alarm won't trigger, but a single task that requires 500 cpu will fail and auto-scaling never happens.

@toredash

This comment has been minimized.

Copy link

toredash commented Dec 7, 2018

Same scenario. Manual workaround is to use CW Events as described here:
https://stackoverflow.com/questions/42394656/how-to-listen-for-an-insufficient-cpu-memory-event-in-an-aws-ecs-service

This can trigger Lambda, SNS or any other mechanism so you are aware there are workloads not able to start.

@siwyd

This comment has been minimized.

Copy link

siwyd commented Jan 16, 2019

I find it very strange that something so obviously very useful still hasn't been made available. Using ECS internal knowledge to scale the cluster is the only way that actually makes sense.

@mtsr

This comment has been minimized.

Copy link

mtsr commented Mar 12, 2019

@toredash Unfortunately that SO solution only works if there is another AWS API Call made on the service. New API Calls will get logged and will include the list of events that have happened on the service. But if you only ever do a CreateService, the service will silently fail to be placed.

@vimmis

This comment has been minimized.

Copy link

vimmis commented Mar 21, 2019

+1
Very useful especially when you don't have the liberty to add as many instances you like and allows better CICD checks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.