Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exit code 255 on 2.24.0 release #348

Closed
ChristianAlexander opened this issue May 10, 2022 · 29 comments
Closed

Exit code 255 on 2.24.0 release #348

ChristianAlexander opened this issue May 10, 2022 · 29 comments

Comments

@ChristianAlexander
Copy link

Describe the question/issue

ECS task is not making it past the pending stage, with the fluent bit container exiting with a 255 status code.

This is only happening with 2.24.0, not 2.23.4.

Configuration

{
  "ipcMode": null,
  "executionRoleArn": "arn:aws:iam::xxxx:role/Execution-Role",
  "containerDefinitions": [
    {
      // Other container definitions here
    },
    {
      "dnsSearchDomains": null,
      "environmentFiles": null,
      "logConfiguration": null,
      "entryPoint": null,
      "portMappings": [],
      "command": null,
      "linuxParameters": null,
      "cpu": 0,
      "environment": [],
      "resourceRequirements": null,
      "ulimits": null,
      "dnsServers": null,
      "mountPoints": [],
      "workingDirectory": null,
      "secrets": null,
      "dockerSecurityOptions": null,
      "memory": null,
      "memoryReservation": null,
      "volumesFrom": [],
      "stopTimeout": null,
      "image": "906394416424.dkr.ecr.us-east-1.amazonaws.com/aws-for-fluent-bit:latest",
      "startTimeout": null,
      "firelensConfiguration": {
        "type": "fluentbit",
        "options": {
          "config-file-type": "file",
          "enable-ecs-log-metadata": "true",
          "config-file-value": "/fluent-bit/configs/parse-json.conf"
        }
      },
      "dependsOn": null,
      "disableNetworking": null,
      "interactive": null,
      "healthCheck": null,
      "essential": true,
      "links": null,
      "hostname": null,
      "extraHosts": null,
      "pseudoTerminal": null,
      "user": "0",
      "readonlyRootFilesystem": null,
      "dockerLabels": null,
      "systemControls": null,
      "privileged": null,
      "name": "log_router"
    }
  ],
  "placementConstraints": [],
  "memory": "2048",
  "taskRoleArn": null,
  "compatibilities": [
    "EC2",
    "FARGATE"
  ],
  "taskDefinitionArn": "arn:aws:ecs:us-east-1:xxxx:task-definition/xxxx:123",
  "family": "xxxx",
  "requiresAttributes": [
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.ecr-auth"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.firelens.fluentbit"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.firelens.options.config.file"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.secrets.asm.environment-variables"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.logging-driver.awsfirelens"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.execution-role-ecr-pull"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.secrets.asm.bootstrap.log-driver"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.task-eni"
    }
  ],
  "pidMode": null,
  "requiresCompatibilities": [
    "FARGATE"
  ],
  "networkMode": "awsvpc",
  "runtimePlatform": null,
  "cpu": "1024",
  "revision": 123,
  "status": "INACTIVE",
  "inferenceAccelerators": null,
  "proxyConfiguration": null,
  "volumes": [],
  "statusString": "(INACTIVE)"
}

Fluent Bit Log Output

I was unable to obtain logs from the container, as it crashed.

Fluent Bit Version Info

This has been an issue on latest and 2.24.0, but was not an issue with stable or 2.23.4.

Cluster Details

ECS fargate, VPC endpoints, sidecar deployment.

Private network with API gateway to the outside world.

Application Details

At startup, the service produces ~10 logs in the first second or two.

Steps to reproduce issue

  • Deploy 2.24.0
  • Observe that the task is stuck in a pending state, with the fluent bit container exiting 255

I have observed a rollback to 2.23.4 successfully being deployed.

Related Issues

None that I could find

@stinney1103
Copy link

We are seeing this as well.

@chester0
Copy link

chester0 commented May 11, 2022

we got this as well, we switched to the stable tag from latest to get it working again.

@nakulpathak3
Copy link

I can confirm this is happening as well. Caused a lot of confusion and crashes to all our services yesterday 😅

@PettitWesley
Copy link
Contributor

No one got any logs before the crash? Usually if FB starts up it will still print something. Can you please all share the configs that you used that led to this crash.

@PettitWesley
Copy link
Contributor

It should be noted that our release runs through two sets of tests before we push out images:

  1. Simple integ tests that use forward and each AWS plugin: https://github.com/aws/aws-for-fluent-bit/tree/mainline/integ
  2. Load tests which run real ECS Fargate tasks and the results of which go in our release notes: https://github.com/aws/aws-for-fluent-bit/releases

So the fairly simple use cases of just an AWS output and nothing else do not appear to crash for this version.

Once someone sends me a config that causes a crash I will repro it from AWS side. @ChristianAlexander Your task def uses the built in JSON parser config file, I will double check that, but I don't see the logConfigurtion.options for any of your containers so I don't know what output plugin you are using. Please add those details.

@bimp
Copy link

bimp commented May 11, 2022

I can confirm my team saw the same issue. Upgrading to 2.25.0 still caused issues. Rolled back to 2.23.4 to fix things.

@ChristianAlexander
Copy link
Author

My apologies! Here's the logging configuration section from the application container:

"logConfiguration": {
    "logDriver": "awsfirelens",
    "secretOptions": [
        {
            "valueFrom": "arn:aws:secretsmanager:us-east-1:xxxx:secret:xxxx/datadog_api_key-xxxx",
            "name": "apikey"
        }
    ],
    "options": {
        "dd_message_key": "log",
        "provider": "ecs",
        "dd_service": "xxxx",
        "Host": "http-intake.logs.datadoghq.com",
        "TLS": "on",
        "dd_source": "ecs",
        "dd_tags": "env:staging",
        "env": "staging",
        "Name": "datadog"
    }
}

@bimp
Copy link

bimp commented May 11, 2022

Additional config info. FYI we use fluent-bit as part of a Fargate Firelens stack pushing to kinesis firehose --> elasticsearch.
Hope this helps

fluent-bit sidecar container task definition

{
      "essential": true,
      "image": "xxxxxxxxxxxxxxxxx.dkr.ecr.${region}.amazonaws.com/fluent-bit:latest",
      "name": "log_router",
      "firelensConfiguration": {
        "type": "fluentbit",
        "options": {
          "config-file-type": "file",
          "config-file-value": "/fluentbit.conf"
        }
      },
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "${aws_logs_group}",
          "awslogs-region": "${region}",
          "awslogs-stream-prefix": "${app}-${env}-${region}"
        }
      },
      "memoryReservation": 50,
      "environment": [{ "name": "BUILDNUMBER", "value": "${build_version}" }]
}

And the fluentbit.conf file:

[SERVICE]
    Parsers_File /fluent-bit/parsers/parsers.conf
    Flush 1
    Grace 30

[FILTER]
    Name parser
    Match *
    Key_Name log
    Parser json
    Reserve_Data True

[FILTER]
    Name record_modifier
    Match *
    Record build-number $${BUILDNUMBER}
    Reserve_Data True

@PettitWesley
Copy link
Contributor

@ChristianAlexander I replicated the config firelens would generate from your task def locally (ref: https://github.com/aws-samples/amazon-ecs-firelens-under-the-hood/blob/mainline/generated-configs/fluent-bit/generated_by_firelens.conf)

And this is what I get:

[2022/05/12 00:02:25] [error] [config] datadog: unknown configuration property 'env'. The following properties are allowed: compress, apikey, dd_service, dd_source, dd_tags, proxy, include_tag_key, tag_key, dd_message_key, provider, and json_date_key.
[2022/05/12 00:02:25] [ help] try the command: /fluent-bit/bin/fluent-bit -o datadog -h

env is not a valid config option.

https://docs.fluentbit.io/manual/pipeline/outputs/datadog

@PettitWesley
Copy link
Contributor

@bimp I think this might not be your full config since I do not see an output, is that output defined in your app's logConfiguration.options?

@bimp
Copy link

bimp commented May 12, 2022

@PettitWesley our Fargate service log configuration is the following which streams it to kinesis firehose

"logConfiguration": {
        "logDriver": "awsfirelens",
        "options": {
          "Name": "firehose",
          "region": "${region}",
          "delivery_stream": "${delivery_stream}",
          "time_key": "timestamp",
          "time_key_format": "%Y-%m-%dT%H:%M:%S%z"
        }
 }

@albertschwarzkopf
Copy link

albertschwarzkopf commented May 12, 2022

Same here with version 2.24.0 and 2.25.0

My error output:

│ Fluent Bit v1.9.3                                                                                                                                                                                                                          │
│ * Copyright (C) 2015-2022 The Fluent Bit Authors                                                                                                                                                                                           │
│ * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd                                                                                                                                                                           │
│ * https://fluentbit.io                                                                                                                                                                                                                     │
│ [2022/05/12 08:01:51] [ info] [fluent bit] version=1.9.3, commit=9eb4996b7d, pid=1                                                                                                                                                         │
│ [2022/05/12 08:01:51] [ info] [storage] version=1.2.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128                                                                                                                 │
│ [2022/05/12 08:01:51] [ info] [cmetrics] version=0.3.1                                                                                                                                                                                     │
│ [2022/05/12 08:01:51] [error] [config] systemd: unknown configuration property 'Parser'. The following properties are allowed: path, max_fields, max_entries, systemd_filter_type, systemd_filter, read_from_tail, lowercase, strip_unders │
│ [2022/05/12 08:01:51] [error] [lib] backend failed                                                                                                                                                                                         │
│ [2022/05/12 08:01:51] [ help] try the command: /fluent-bit/bin/fluent-bit -i systemd -h                                                                                                                                                    │
│ [2022/05/12 08:01:51] [ info] [input] pausing tail.0   

No problem with version 2.23.4 and same config. Deployed as Daemonset in AWS EKS 1.22

My Config as Configmap:


--- 
apiVersion: v1
data: 
  cloudwatch-output.conf: |
      [OUTPUT]
          Name cloudwatch
          Match   kube.*
          region eu-central-1
          log_group_name /eks/XXX/containers
          log_stream_prefix fluentbit.
          auto_create_group true
          log_retention_days 7
          
      [OUTPUT]
          Name cloudwatch
          Match   host.*
          region eu-central-1
          log_group_name /eks/XXX/kubelet
          log_stream_prefix fluentbit.
          auto_create_group true
          log_retention_days 7
  fluent-bit.conf: |
      [SERVICE]
          Parsers_File /fluent-bit/parsers/parsers.conf
      
      [INPUT]
          Name              tail
          Tag               kube.*
          Path              /var/log/containers/*.log
          Exclude_Path      /var/log/containers/*_starboard-system_*.log
          DB                /var/log/flb_kube.db
          Parser            docker
          Docker_Mode       On
          Mem_Buf_Limit     5MB
          Skip_Long_Lines   On
          Refresh_Interval  10
          
      [INPUT]
          Name            systemd
          Tag             host.*
          Systemd_Filter  _SYSTEMD_UNIT=kubelet.service
          Path              /var/log/journal
          Parser            syslog-rfc3164-local
          DB                /var/log/flb_kube.db
          Mem_Buf_Limit     5MB
          Skip_Long_Lines   On
          Refresh_Interval  10        
      
      [FILTER]
          Name                kubernetes
          Match               kube.*
          Kube_URL            https://kubernetes.default.svc.cluster.local:443
          Merge_Log           On
          Merge_Log_Key       data
          K8S-Logging.Parser  On
          K8S-Logging.Exclude On
          
      [FILTER]
          Name                record_modifier
          Match               host.*
          Record              containername ${HOSTNAME}
      
      @INCLUDE s3-output.conf
  s3-output.conf: |
      [OUTPUT]
          Name                s3
          Match               kube.*
          Bucket              XXX
          region              eu-central-1
          total_file_size     1M
          use_put_object      On
          upload_timeout      60s
          Compression         gzip
          s3_key_format       /eks/containers/$TAG[4]/$UUID.gz
      
      [OUTPUT]
          Name                s3
          Match               host.*
          Bucket              XXX
          region              eu-central-1
          total_file_size     1M
          use_put_object      On
          upload_timeout      60s
          Compression         gzip
          s3_key_format       /eks/kubelet/$UUID.gz
kind: ConfigMap
metadata: 
  creationTimestamp: "2022-05-11T09:29:55Z"
  labels: 
    app.kubernetes.io/name: fluentbit
    kustomize.toolkit.fluxcd.io/name: fluentbit
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: fluentbit-config
  namespace: kube-system
  resourceVersion: "4813"
  uid: 29f5a1ef-be5c-4d28-a71f-256677b5e1e2



@ChristianAlexander
Copy link
Author

@PettitWesley, thanks for digging in! I'll give this a try without the env value and report back.

Did the 2.24.0 release add new validations or assertions that extraneous parameters should cause fluent bit to terminate?

@ChristianAlexander
Copy link
Author

Confirmed, 2.24.0 didn't crash once the env was removed.

@PettitWesley
Copy link
Contributor

@albertschwarzkopf This one is fun, it seems that in previous versions, the config for systemd input was not actually validated, thus it was possible to input options that don't exist: fluent/fluent-bit@773581f

In previous versions I'm able to run that input with all sorts of random fake keys added.

https://docs.fluentbit.io/manual/pipeline/inputs/systemd

I think you need to use the filter parser with your parser to parse these logs: https://docs.fluentbit.io/manual/pipeline/filters/parser

@PettitWesley
Copy link
Contributor

@bimp this is the issue I think:

[FILTER]
    Name record_modifier
    Match *
    Record build-number $${BUILDNUMBER}
    Reserve_Data True

Reserve_Data is a valid config key on filter parser, but not record modifier: https://docs.fluentbit.io/manual/pipeline/filters/record-modifier

@PettitWesley
Copy link
Contributor

@stinney1103 @chester0 @nakulpathak3 Please see my comments above to see if you are facing the same config issue and please post your configurations.

@bimp
Copy link

bimp commented May 12, 2022

@bimp this is the issue I think:

[FILTER]
    Name record_modifier
    Match *
    Record build-number $${BUILDNUMBER}
    Reserve_Data True

Reserve_Data is a valid config key on filter parser, but not record modifier: https://docs.fluentbit.io/manual/pipeline/filters/record-modifier

@PettitWesley thanks I'll try removing that. Would that explain the following Fargate task log error message I saw:
image

@bimp
Copy link

bimp commented May 13, 2022

@PettitWesley further log hunting revealed that you're probably right:

[2022/05/11 17:00:31] [ info] [fluent bit] version=1.9.3, commit=9eb4996b7d, pid=1
[2022/05/11 17:00:31] [ info] [storage] version=1.2.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2022/05/11 17:00:31] [ info] [cmetrics] version=0.3.1
[2022/05/11 17:00:31] [error] [lib] backend failed
[2022/05/11 17:00:31] [ info] [input:forward:forward.0] listening on unix:///var/run/fluent.sock
[2022/05/11 17:00:31] [ info] [input:forward:forward.1] listening on 127.0.0.1:24224
[2022/05/11 17:00:31] [ info] [input:tcp:tcp.2] listening on 127.0.0.1:8877
[2022/05/11 17:00:31] [error] [config] record_modifier: unknown configuration property 'Reserve_Data'. The following properties are allowed: record, remove_key, allowlist_key, and whitelist_key.
[2022/05/11 17:00:31] [ help] try the command: /fluent-bit/bin/fluent-bit -F record_modifier -h
[2022/05/11 17:00:31] [ info] [input] pausing forward.0
AWS for Fluent Bit Container Image Version 2.24.0

question is why did this happen now? I've had this incorrect option forever. It completely blocked the Fargate service from running so I'm curious why it is now failing so catastrophically.

@PettitWesley
Copy link
Contributor

@bimp Seems the answer is the same as for the systemd plugin (see farther back in the comment stream on this issue), config validation was missing previously and was only just added: fluent/fluent-bit@fbe829e

@hsg944
Copy link

hsg944 commented May 13, 2022

Below are the valid keys for datadog

Host, TLS, compress, apikey, Proxy, provider, json_date_key, include_tag_key, tag_key, dd_service, dd_source, dd_tags, dd_message_key. Any other key will result in error. if you need to add env tags then it can be part of dd_tags

@kylenas
Copy link

kylenas commented May 17, 2022

Ran into this issue today, Had to remove exclude-pattern and include-pattern from my log config to get the latest container to start. Both should be valid config options still.

@PettitWesley
Copy link
Contributor

@kylenas Can you share the working vs not working task def and config please?

@nakulpathak3
Copy link

@PettitWesley Not sure why the stable image was updated with this issue open.. This is easily reproducible with aws-for-fluent-bit sidecar container and has 28 comments so something is clearly broken. Now it looks like the stable image has also been broken. Can you please revert at least the broken stable image?

@aashitvyas
Copy link

aashitvyas commented Jun 20, 2022

Hi, We have got hit with this issue today and suddenly services stopped working for us.

We are currently using stable version of the fluent-bit image.

Below is our logging configuration section.

  "logDriver" : "awsfirelens",
      "options" : {
        "Name" : "datadog",
        "Region" : data.aws_region.current.name,
        "Host" : "http-intake.logs.datadoghq.com",
        "dd_service" : "test",
        "dd_source" : "test",
        "dd_message_key" : "log",
        "dd_tags" : "Env:${var.env}",
        "TLS" : "on",
        "provider" : "ecs"
      },
      "secretOptions" : [
        { "name" : "apiKey", "valueFrom" : "${test}:api_key::" }
      ]
 {
      "essential" : true,
      "cpu" : 0,
      "environment" : [],
      "mountPoints" : [],
      "portMappings" : [],
      "volumesFrom" : [],
      "user" : "0",
      "image" : "amazon/aws-for-fluent-bit:stable",
      "name" : "log_router",
      "firelensConfiguration" : {
        "type" : "fluentbit",
        "options" : { "enable-ecs-log-metadata" : "true", "config-file-type" : "file",
        "config-file-value" : "/fluent-bit/configs/parse-json.conf" }
      }

Looks like I have the exact same log config as @ChristianAlexander without the env value and its still failing with exit code 255 on AWS Fargate.

@PettitWesley
Copy link
Contributor

@aashitvyas I think Region is not a valid option for datadog.

@aashitvyas
Copy link

@PettitWesley worked ! Thank you.

@jacob-gravie
Copy link

Any updates here? We are seeing immediate 255 exit codes on both stable and latest versions of this image.

@PettitWesley
Copy link
Contributor

@jacob-gravie Can you please open a new issue for your problem and can you please check out: #491

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests