Skip to content

[ECS] [request]: Timeout for RunTask #291

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hwatts opened this issue May 14, 2019 · 19 comments
Open

[ECS] [request]: Timeout for RunTask #291

hwatts opened this issue May 14, 2019 · 19 comments
Assignees
Labels
ECS Amazon Elastic Container Service Under consideration

Comments

@hwatts
Copy link

hwatts commented May 14, 2019

Tell us about your request
An optional timeout parameter for the RunTask API

Which service(s) is this request for?
Fargate, ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
As well as services, that are expected to be always in a running state, we also run scheduled tasks in ECS that are expected to complete various batch processes, then exit. On rare occasions, these jobs can hang or take an excessive amount of time to complete, incurring cost and potentially impacting future schedules of the task. An optional timeout parameter that's enforced by the ECS scheduler would help to manage these.

Are you currently working around this issue?
Only by manually calling the StopTask API when we spot long running tasks.

@opqpop
Copy link

opqpop commented Sep 27, 2021

Hi any updates on this? We're running into the same issue where rogue fargate tasks that have gone wrong somehow end up running forever.

We've tried including timeouts in the function itself to end, but for some reason this doesn't work and tasks continue to keep running.

@Hc747
Copy link

Hc747 commented Oct 28, 2021

Would also love to see this implemented!

@palharsh
Copy link

Also running into a similar issue, the task doesn't stop even when the underlying process completes, only happens occasionally(< 1% of executions), but still important to not manually check every time if there are any rogue tasks lying around, or build another automation to stop them

@jeroenhabets
Copy link

To add to the scenario: "On rare occasions, these jobs can hang or take an excessive amount of time to complete, incurring cost and potentially impacting future schedules of the task. "
in our case it can also block one our queues (as one of the couple of workers is blocked indefinitely)

@otavioribeiromedeiros
Copy link

Facing same problem here.

@mikedorfman
Copy link

I'm running into the same problem. Step functions can submit ECS tasks but doesn't clean up tasks (even if a timeout is specified). I have to set up a relatively elaborate catch and cleanup in StepFunctions to clean up jobs that hang indefinitely - that would block further processing. It would be so much easier if we could just specify this stop-after-x-seconds value in ECS.

@ewascent
Copy link

ewascent commented Mar 8, 2023

me, as well. I want this, plz

@mreferre
Copy link

I am wrapping this up in a short blog post to add more context but I built a SF workflows that essentially kicks off, check if there is a tag TIMEOUT associated to the task and if there is it waits n seconds before sending a stopTask (where n is the value of the TIMEOUT tag)

This is the CFN template that includes everything (SF workflow, EB rules, IAM roles, etc). There is nothing else to do: when the stack is deployed as-is all tasks launched in the account/region with a TIMEOUT tag will be stopped after the value specified (in seconds).

Resources:
  ecstaskrunning:
    Type: AWS::Events::Rule
    Properties:
      EventPattern:
        source:
          - aws.ecs
        detail-type:
          - ECS Task State Change
        detail:
          lastStatus:
            - RUNNING
          desiredStatus:
            - RUNNING
      Targets:
        - Id: !GetAtt tasktimeoutstatemachine.Name
          Arn: !Ref tasktimeoutstatemachine
          RoleArn: !GetAtt ecstaskrunningTotasktimeoutstatemachine.Arn
  tasktimeoutstatemachine:
    Type: AWS::Serverless::StateMachine
    Properties:
      Definition:
        Comment: State machine to create/update a Route53 record
        StartAt: ListTagsForResource
        States:
          ListTagsForResource:
            Type: Task
            Next: CheckTimeout
            Parameters:
              ResourceArn.$: $.resources[0]
            ResultPath: $.listTagsForResource
            Resource: arn:aws:states:::aws-sdk:ecs:listTagsForResource
          CheckTimeout:
            Type: Pass
            Parameters:
              timeoutexists.$: States.ArrayLength($.listTagsForResource.Tags[?(@.Key == TIMEOUT)])
            ResultPath: $.timeoutconfiguration
            Next: IsTimoutSet
          IsTimoutSet:
            Type: Choice
            Choices:
              - Variable: $.timeoutconfiguration.timeoutexists
                NumericEquals: 1
                Next: GetTimeoutValue
            Default: Success
          GetTimeoutValue:
            Type: Pass
            Parameters:
              timeoutvalue.$: States.ArrayGetItem($.listTagsForResource.Tags[?(@.Key == TIMEOUT)].Value, 0)
            ResultPath: $.timeoutconfiguration
            Next: Wait
          Success:
            Type: Succeed
          Wait:
            Type: Wait
            Next: StopTask
            SecondsPath: $.timeoutconfiguration.timeoutvalue
          StopTask:
            Type: Task
            Parameters:
              Task.$: $.resources[0]
              Cluster.$: $.detail.clusterArn
            Resource: arn:aws:states:::aws-sdk:ecs:stopTask
            End: true
      Logging:
        Level: ALL
        IncludeExecutionData: true
        Destinations:
          - CloudWatchLogsLogGroup:
              LogGroupArn: !GetAtt tasktimeoutstatemachineLogGroup.Arn
      Policies:
        - AWSXrayWriteOnlyAccess
        - Statement:
            - Effect: Allow
              Action:
                - ecs:ListTagsForResource
                - ecs:StopTask
                - logs:CreateLogDelivery
                - logs:GetLogDelivery
                - logs:UpdateLogDelivery
                - logs:DeleteLogDelivery
                - logs:ListLogDeliveries
                - logs:PutResourcePolicy
                - logs:DescribeResourcePolicies
                - logs:DescribeLogGroups
              Resource: '*'
      Tracing:
        Enabled: true
      Type: STANDARD
  tasktimeoutstatemachineLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Sub
        - /aws/vendedlogs/states/${AWS::StackName}-${ResourceId}-Logs
        - ResourceId: tasktimeoutstatemachine
  ecstaskrunningTotasktimeoutstatemachine:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          Effect: Allow
          Principal:
            Service: !Sub events.${AWS::URLSuffix}
          Action: sts:AssumeRole
          Condition:
            ArnLike:
              aws:SourceArn: !Sub
                - arn:${AWS::Partition}:events:${AWS::Region}:${AWS::AccountId}:rule/${AWS::StackName}-${ResourceId}-*
                - ResourceId: ecstaskrunning
      Policies:
        - PolicyName: StartExecutionPolicy
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action: states:StartExecution
                Resource: !Ref tasktimeoutstatemachine
Transform: AWS::Serverless-2016-10-31

I hear you the ideal solution would be native support for this capability in ECS but I am curious re whether an approach like this would work? In addition to having to pay extra for this Step Functions (I hear you, again), what are other reasons why this approach would not work Vs a timeout flag in the RunTask.

@mreferre
Copy link

This is the blog post that gets into more context: https://it20.info/2023/03/configuring-a-timeout-for-amazon-ecs-tasks/

@maishsk
Copy link
Contributor

maishsk commented Mar 20, 2023

This looks similar to #572

@jeroenhabets
Copy link

@mreferre regarding your question

what are other reasons why this approach would not work Vs a timeout flag in the RunTask.

Many turn to AWS and services ECS to handle most of their hosting complexities in order to be able to focus on where they can deliver the most value. So "it" may technically work (i.e. step functions approach 🚀 ) but introduces needless complexity (over timeout flag), not only in implementation but also maintenance and support.

With in total almost 250 👍 let's hope the ECS team can deliver this (sub)feature sometime soon.

@mreferre
Copy link

@jeroenhabets fair enough. Thanks!

@maishsk
Copy link
Contributor

maishsk commented May 17, 2023

@mreferre regarding your question

what are other reasons why this approach would not work Vs a timeout flag in the RunTask.

Many turn to AWS and services ECS to handle most of their hosting complexities in order to be able to focus on where they can deliver the most value. So "it" may technically work (i.e. step functions approach 🚀 ) but introduces needless complexity (over timeout flag), not only in implementation but also maintenance and support.

With in total almost 250 👍 let's hope the ECS team can deliver this (sub)feature sometime soon.

I would like to suggest another method, and that is using a sidecar container, all native inside ECS.
Add a small essential container to your task definition which runs a sleep command and exits after a defined amount of time.

{
  "family": "lifespan",
  "networkMode": "awsvpc",
  "requiresCompatibilities": [
    "EC2",
    "FARGATE"
  ],
  "cpu": "256",
  "memory": "512",
  "containerDefinitions": [
    {
      "name": "nginx",
      "image": "public.ecr.aws/nginx/nginx:mainline",
      "essential": true
    },
    {
      "name": "lifespan",
      "image": "public.ecr.aws/docker/library/busybox:stable",
      "essential": true,
      "command": [
        "sh",
        "-c",
        "sleep $TIMEOUT"
      ],
      "environment": [
        {
          "name": "TIMEOUT",
          "value": "60"
        }
      ]
    }
  ]
}

For a more detailed explanation I wrote this up

Also helps with #572

Would be interested in your feedback.

@calebplum
Copy link

Please implement this

@jeroenhabets
Copy link

@maishsk same feedback for your workaround :

Many turn to AWS and services ECS to handle most of their hosting complexities in order to be able to focus on where they can deliver the most value. So "it" may technically work (i.e. step functions approach 🚀 ) but introduces needless complexity (over timeout flag), not only in implementation but also maintenance and support.

With in total almost 250 👍 let's hope the ECS team can deliver this (sub)feature sometime soon.

@plurch
Copy link

plurch commented Jan 27, 2024

A potential workaround is to enforce the timeout in your own application code ENTRYPOINT or CMD calls.

For example, I am using the timeout command with an environment variable that I set like this:

timeout ${TIMEOUT} python my_script.py

@ldipotetjob
Copy link

Definitely this feature could be quite helpful. Trying this now is a challenging. BTW in our use case we had to run tasks in different times so services are no an option.

@vibhav-ag vibhav-ag self-assigned this Oct 23, 2024
@github-project-automation github-project-automation bot moved this to Researching in containers-roadmap Oct 23, 2024
@jenmlinaws jenmlinaws added Under consideration and removed Proposed Community submitted issue labels Oct 23, 2024
@trallnag
Copy link

This is similar to activeDeadlineSeconds in Kubernetes.

Another way to terminate a Job is by setting an active deadline. Do this by setting the .spec.activeDeadlineSeconds field of the Job to a number of seconds. The activeDeadlineSeconds applies to the duration of the job, no matter how many Pods are created. Once a Job reaches activeDeadlineSeconds, all of its running Pods are terminated and the Job status will become type: Failed with reason: DeadlineExceeded.

https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup

@craigmcnamara
Copy link

craigmcnamara commented Jan 30, 2025

I was able to implement this using timeout in my docker entrypoint.

timeout --foreground 12h "$@"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ECS Amazon Elastic Container Service Under consideration
Projects
Status: Researching
Development

No branches or pull requests