Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EcsRunTaskOperator does not send logs if task times out #39026

Open
1 of 2 tasks
anneadb opened this issue Apr 15, 2024 · 6 comments
Open
1 of 2 tasks

EcsRunTaskOperator does not send logs if task times out #39026

anneadb opened this issue Apr 15, 2024 · 6 comments
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:amazon-aws AWS/Amazon - related issues

Comments

@anneadb
Copy link

anneadb commented Apr 15, 2024

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==8.7.1

Apache Airflow version

2.7.2

Operating System

Amazon Linux AMI

Deployment

Amazon (AWS) MWAA

Deployment details

We are deploying MWAA using CDK.
Dags are located in S3.

What happened

We are running several Airflow tasks that start ECS containers and wait for their completion before starting the following tasks. I noticed that logs only appear upon the completion of the task.
However, if a task runs into a TimeoutError in Airflow, no task logs are created at all in Cloudwatch.

I have been in contact with AWS Enterprise Support but they are saying that the issue is located in the Operator and not MWAA.

What you think should happen instead

ECS task logs are written to CloudWatch even if the Airflow task fails.
Even better: task logs are written continuously to CloudWatch.

How to reproduce

from datetime import datetime, timedelta

from airflow import DAG
from airflow.providers.amazon.aws.operators.ecs import EcsRunTaskOperator

DEFAULT_ARGS = {
    "owner": "airflow",
    "depends_on_past": False,
    "start_date": datetime(2024, 1, 1),
}

with DAG(
    "sample",
    default_args=DEFAULT_ARGS,
    description="A test DAG to try out functionality",
    schedule=None,
) as dag:
    EcsRunTaskOperator(
        task_id="ecs_operator",
        dag=dag,
        execution_timeout=timedelta(minutes=2),
        retries=3,
        aws_conn_id="aws_default",
        cluster="ecs-airflow-cluster",
        task_definition="ecs-airflow-task:1",
        launch_type="FARGATE",
        overrides={"containerOverrides": [
                {
                "name": CONTAINER_NAME,
                "command": ["python", "-c", "import time; for i in range(30): print(i); time.sleep(10)"],
            },
        ]},
        network_configuration={
            "awsvpcConfiguration": {
                "subnets": ["subnet-xxxxx", "subnet-xxx"],
                 "securityGroups": ["sg-xxxx"]
        }
        },
        awslogs_group="ecs-airflow",
        awslogs_region="eu-central-1",
        awslogs_stream_prefix=f"ecs/{CONTAINER_NAME}",
        awslogs_fetch_interval=timedelta(seconds=5)
    )

Anything else

Logs example:

ip-xxxxx.eu-central-1.compute.internal
*** Reading remote log from Cloudwatch log_group: airflow-my_env_name-Task log_stream: dag_id=my_dag_id/run_id=scheduled__2024-04-02T05_30_00+00_00/task_id=mytaskid/attempt=1.log.
[2024-04-03, 07:30:19 CEST] {{taskinstance.py:1159}} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: my_dag_id.mytaskid scheduled__2024-04-02T05:30:00+00:00 [queued]>
[2024-04-03, 07:30:20 CEST] {{taskinstance.py:1159}} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: my_dag_id.mytaskid scheduled__2024-04-02T05:30:00+00:00 [queued]>
[2024-04-03, 07:30:20 CEST] {{taskinstance.py:1361}} INFO - Starting attempt 1 of 2
[2024-04-03, 07:30:20 CEST] {{taskinstance.py:1382}} INFO - Executing <Task(EcsRunTaskOperator): mytaskid> on 2024-04-02 05:30:00+00:00
[2024-04-03, 07:30:20 CEST] {{standard_task_runner.py:57}} INFO - Started process 24629 to run task
[2024-04-03, 07:30:20 CEST] {{standard_task_runner.py:84}} INFO - Running: ['airflow', 'tasks', 'run', 'my_dag_id', 'mytaskid', 'scheduled__2024-04-02T05:30:00+00:00', '--job-id', '3107', '--raw', '--subdir', 'DAGS_FOLDER/my_dag_file.py', '--cfg-path', '/tmp/tmpw6wsfwwu']
[2024-04-03, 07:30:20 CEST] {{standard_task_runner.py:85}} INFO - Job 3107: Subtask mytaskid
[2024-04-03, 07:30:20 CEST] {{task_command.py:416}} INFO - Running <TaskInstance: my_dag_id.mytaskid scheduled__2024-04-02T05:30:00+00:00 [running]> on host ip-10-123-136-202.eu-central-1.compute.internal
[2024-04-03, 07:30:21 CEST] {{taskinstance.py:1662}} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='di' AIRFLOW_CTX_DAG_ID='my_dag_id' AIRFLOW_CTX_TASK_ID='mytaskid' AIRFLOW_CTX_EXECUTION_DATE='2024-04-02T05:30:00+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='scheduled__2024-04-02T05:30:00+00:00'
[2024-04-03, 07:30:21 CEST] {{ecs.py:531}} INFO - Running ECS Task - Task definition: my_task_definition_name - on cluster my-cluster-name
[2024-04-03, 07:30:21 CEST] {{ecs.py:534}} INFO - EcsOperator overrides: {'containerOverrides': []}
[2024-04-03, 07:30:22 CEST] {{base.py:73}} INFO - Using connection ID 'my_aws_account_connection_id' for task execution.
[2024-04-03, 07:30:24 CEST] {{ecs.py:644}} INFO - ECS Task started: {'tasks': [{'attachments': [{'id': 'eab65079-d61e-4c16-9984-cec51e01cae4', 'type': 'ElasticNetworkInterface', 'status': 'PRECREATED', 'details': [{'name': 'subnetId', 'value': 'subnet-12345'}]}], 'attributes': [{'name': 'ecs.cpu-architecture', 'value': 'x86_64'}], 'availabilityZone': 'eu-central-1b', 'clusterArn': 'arn:aws:ecs:eu-central-1:123456789:cluster/my-cluster-name', 'containers': [{'containerArn': 'arn:aws:ecs:eu-central-1:123456789:container/my-cluster-name/d6907e6e16c54ff0ba6bc477ef4f5c6e/10c5ac9b-9623-418b-afd7-10d2d0b762f6', 'taskArn': 'arn:aws:ecs:eu-central-1:123456789:task/my-cluster-name/d6907e6e16c54ff0ba6bc477ef4f5c6e', 'name': 'my_task_definition_name', 'image': '123456789.dkr.ecr.eu-central-1.amazonaws.com/my_task_definition_name:0.0.2', 'lastStatus': 'PENDING', 'networkInterfaces': [], 'cpu': '0'}], 'cpu': '512', 'createdAt': datetime.datetime(2024, 4, 3, 5, 30, 24, 868000, tzinfo=tzlocal()), 'desiredStatus': 'RUNNING', 'enableExecuteCommand': False, 'group': 'family:my_task_definition_name', 'lastStatus': 'PROVISIONING', 'launchType': 'FARGATE', 'memory': '1024', 'overrides': {'containerOverrides': [{'name': 'my_task_definition_name'}], 'inferenceAcceleratorOverrides': []}, 'platformVersion': '1.4.0', 'platformFamily': 'Linux', 'startedBy': 'di', 'tags': [], 'taskArn': 'arn:aws:ecs:eu-central-1:123456789:task/my-cluster-name/d6907e6e16c54ff0ba6bc477ef4f5c6e', 'taskDefinitionArn': 'arn:aws:ecs:eu-central-1:123456789:task-definition/my_task_definition_name:2', 'version': 1, 'ephemeralStorage': {'sizeInGiB': 20}}], 'failures': [], 'ResponseMetadata': {'RequestId': '02377efa-1c03-443e-aad8-c6d0d6157f06', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '02377efa-1c03-443e-aad8-c6d0d6157f06', 'content-type': 'application/x-amz-json-1.1', 'content-length': '1692', 'date': 'Wed, 03 Apr 2024 05:30:24 GMT'}, 'RetryAttempts': 0}}
[2024-04-03, 07:30:24 CEST] {{ecs.py:647}} INFO - ECS task ID is: d6907e6e16c54ff0ba6bc477ef4f5c6e
[2024-04-03, 07:30:25 CEST] {{ecs.py:573}} INFO - Starting ECS Task Log Fetcher
[2024-04-03, 07:30:55 CEST] {{base.py:73}} INFO - Using connection ID 'my_aws_account_connection_id' for task execution.
[2024-04-03, 07:50:21 CEST] {{timeout.py:68}} ERROR - Process timed out, PID: 24629
[2024-04-03, 07:50:21 CEST] {{ecs.py:757}} INFO - {'task': {'attachments': [{'id': 'eab65079-d61e-4c16-9984-cec51e01cae4', 'type': 'ElasticNetworkInterface', 'status': 'ATTACHED', 'details': [{'name': 'subnetId', 'value': 'subnet-12345'}, {'name': 'networkInterfaceId', 'value': 'eni-08793a79328aa8a69'}, {'name': 'macAddress', 'value': '06:08:01:5a:69:fb'}, {'name': 'privateDnsName', 'value': 'ip-10-123-41-14.eu-central-1.compute.internal'}, {'name': 'privateIPv4Address', 'value': 'xxxxx'}]}], 'attributes': [{'name': 'ecs.cpu-architecture', 'value': 'x86_64'}], 'availabilityZone': 'eu-central-1b', 'clusterArn': 'arn:aws:ecs:eu-central-1:123456789:cluster/my-cluster-name', 'connectivity': 'CONNECTED', 'connectivityAt': datetime.datetime(2024, 4, 3, 5, 30, 28, 546000, tzinfo=tzlocal()), 'containers': [{'containerArn': 'arn:aws:ecs:eu-central-1:123456789:container/my-cluster-name/d6907e6e16c54ff0ba6bc477ef4f5c6e/10c5ac9b-9623-418b-afd7-10d2d0b762f6', 'taskArn': 'arn:aws:ecs:eu-central-1:123456789:task/my-cluster-name/d6907e6e16c54ff0ba6bc477ef4f5c6e', 'name': 'my_task_definition_name', 'image': '123456789.dkr.ecr.eu-central-1.amazonaws.com/my_task_definition_name:0.0.2', 'imageDigest': 'sha256:9d8af306540177ee5faef1dea1f2c72c3e16a02aedcb0c85fdd42cf96f8eb7d9', 'runtimeId': 'd6907e6e16c54ff0ba6bc477ef4f5c6e-873919162', 'lastStatus': 'RUNNING', 'networkBindings': [], 'networkInterfaces': [{'attachmentId': 'eab65079-d61e-4c16-9984-cec51e01cae4', 'privateIpv4Address': 'xxxxx'}], 'cpu': '0'}], 'cpu': '512', 'createdAt': datetime.datetime(2024, 4, 3, 5, 30, 24, 868000, tzinfo=tzlocal()), 'desiredStatus': 'STOPPED', 'enableExecuteCommand': False, 'group': 'family:my_task_definition_name', 'lastStatus': 'RUNNING', 'launchType': 'FARGATE', 'memory': '1024', 'overrides': {'containerOverrides': [{'name': 'my_task_definition_name'}], 'inferenceAcceleratorOverrides': []}, 'platformVersion': '1.4.0', 'platformFamily': 'Linux', 'pullStartedAt': datetime.datetime(2024, 4, 3, 5, 30, 37, 967000, tzinfo=tzlocal()), 'pullStoppedAt': datetime.datetime(2024, 4, 3, 5, 31, 9, 202000, tzinfo=tzlocal()), 'startedAt': datetime.datetime(2024, 4, 3, 5, 31, 9, 595000, tzinfo=tzlocal()), 'startedBy': 'di', 'stopCode': 'UserInitiated', 'stoppedReason': 'Task killed by the user', 'stoppingAt': datetime.datetime(2024, 4, 3, 5, 50, 21, 287000, tzinfo=tzlocal()), 'tags': [], 'taskArn': 'arn:aws:ecs:eu-central-1:123456789:task/my-cluster-name/d6907e6e16c54ff0ba6bc477ef4f5c6e', 'taskDefinitionArn': 'arn:aws:ecs:eu-central-1:123456789:task-definition/my_task_definition_name:2', 'version': 4, 'ephemeralStorage': {'sizeInGiB': 20}}, 'ResponseMetadata': {'RequestId': '67522c9e-82c3-45e9-b0b1-2c71a9ce1fc0', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '67522c9e-82c3-45e9-b0b1-2c71a9ce1fc0', 'content-type': 'application/x-amz-json-1.1', 'content-length': '2470', 'date': 'Wed, 03 Apr 2024 05:50:20 GMT'}, 'RetryAttempts': 0}}
[2024-04-03, 07:50:21 CEST] {{taskinstance.py:1937}} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/airflow/providers/amazon/aws/operators/ecs.py", line 578, in execute
    self._wait_for_task_ended()
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/airflow/providers/amazon/aws/operators/ecs.py", line 669, in _wait_for_task_ended
    waiter.wait(
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/botocore/waiter.py", line 55, in wait
    Waiter.wait(self, **kwargs)
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/botocore/waiter.py", line 393, in wait
    time.sleep(sleep_amount)
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/airflow/utils/timeout.py", line 69, in handle_timeout
    raise AirflowTaskTimeout(self.error_message)
airflow.exceptions.AirflowTaskTimeout: Timeout, PID: 24629
[2024-04-03, 07:50:21 CEST] {{taskinstance.py:1400}} INFO - Marking task as UP_FOR_RETRY. dag_id=my_dag_id, task_id=mytaskid, execution_date=20240402T053000, start_date=20240403T053019, end_date=20240403T055021
[2024-04-03, 07:50:21 CEST] {{standard_task_runner.py:104}} ERROR - Failed to execute job 3107 for task mytaskid (Timeout, PID: 24629; 24629)
[2024-04-03, 07:50:21 CEST] {{local_task_job_runner.py:228}} INFO - Task exited with return code 1
[2024-04-03, 07:50:21 CEST] {{taskinstance.py:2778}} INFO - 0 downstream tasks scheduled from follow-on schedule check

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@anneadb anneadb added area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Apr 15, 2024
@eladkal eladkal added provider:amazon-aws AWS/Amazon - related issues good first issue and removed needs-triage label for new issues that we didn't triage yet labels Apr 15, 2024
@slycyberguy
Copy link
Contributor

May I handle this issue?

@potiuk
Copy link
Member

potiuk commented Apr 16, 2024

Feel free.

@mateuslatrova
Copy link
Contributor

Hi @slycyberguy , are you still handling this issue? I am also interested in it, in case you need any help.

@slycyberguy
Copy link
Contributor

Hi @slycyberguy , are you still handling this issue? I am also interested in it, in case you need any help.

Hi. Yeah, I'm working on it. I should be done in a few more days. I'll let you know if anything changes. Thanks for offering.

@slycyberguy
Copy link
Contributor

Hi @slycyberguy , are you still handling this issue? I am also interested in it, in case you need any help.

Hi @mateuslatrova. It looks like I'm getting overwhelmed at work with a lot of things that require high priority at the moment. I think I might be free to work on it next week, but I don't want this sitting around for too long. So if you still want, you can take over the issue. I'm unassigning myself.

@slycyberguy slycyberguy removed their assignment May 20, 2024
@mateuslatrova
Copy link
Contributor

Hi @slycyberguy , are you still handling this issue? I am also interested in it, in case you need any help.

Hi @mateuslatrova. It looks like I'm getting overwhelmed at work with a lot of things that require high priority at the moment. I think I might be free to work on it next week, but I don't want this sitting around for too long. So if you still want, you can take over the issue. I'm unassigning myself.

Hi @slycyberguy !

Right now I am focused on another issue. If I finish it before anyone takes this one over, I will let everyone know that I will work on it.

Thanks for letting me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

No branches or pull requests

5 participants