Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest and Stable versions failing due to invalid config options #371

Open
discovery-NukulSharma opened this issue Jun 17, 2022 · 18 comments

Comments

@discovery-NukulSharma
Copy link

discovery-NukulSharma commented Jun 17, 2022

We are noticing stable version since yesterday are failing, seems previous version is working fine , if we just use specific tagof previous version.

Failed

  • public.ecr.aws/aws-observability/aws-for-fluent-bit:stable
  • public.ecr.aws/aws-observability/aws-for-fluent-bit:latest

Passed - older version

  • public.ecr.aws/aws-observability/aws-for-fluent-bit:2.23.3
@discovery-NukulSharma discovery-NukulSharma changed the title Latest and Stable version have started failing Latest and Stable versions have started failing Jun 17, 2022
@JasonIamAUnixAdmin
Copy link

Same here. Using the provided params by AWS cause this to fail to start with a 255 error taking the rest of the task with it. When past default values causes crashes this should not be in "stable"

@zhonghui12
Copy link
Contributor

@discovery-NukulSharma @JasonIamAUnixAdmin Thanks for report. Working on finding the root cause

@zhonghui12
Copy link
Contributor

Also @discovery-NukulSharma @JasonIamAUnixAdmin , could you please explain more about what "failing" means? And could any of you confirm for me that if our dockerhub images have the same problem or the issue only happens with public ECR? https://hub.docker.com/r/amazon/aws-for-fluent-bit/tags

@JasonIamAUnixAdmin
Copy link

Failing: the log container exits with a 255 error code, since it is tagged as essential it takes the whole task out.

Pinning to the old "stable" image of 2.23.3 fixes the issue, using the current stable has the same issue.
We use this to send logs to DataDog, error seems related to #348

seems that our requirement of passing in DD_LOGS_CONFIG_LOGS_DD_URL which is a needed to change the endpoint used to ship logs might be the culprit

@zhonghui12
Copy link
Contributor

zhonghui12 commented Jun 17, 2022

@JasonIamAUnixAdmin I see, so the root cause is that the latest few versions 2.25.1 and 2.26.0 have this unfixed bug in it right? Yesterday we advanced our stable version so I think that might be the problem..

@JasonIamAUnixAdmin
Copy link

correct. We changed to :stable rather than the default provided by the tools of :latest to get away from this kind of breaking change.

And since we are messing with logging I understand why it is hard to get the "why" out out to show us but man is this painful to debug on the user end.

@zhonghui12
Copy link
Contributor

zhonghui12 commented Jun 17, 2022

@JasonIamAUnixAdmin so from the #348 we can see that from fluent bit 1.9, it starts to do some input validation. Can you please share your config so that we can figure out which field is not allowed?

also @discovery-NukulSharma

@JasonIamAUnixAdmin
Copy link

Here is one task:

 "logConfiguration": {
        "logDriver": "awsfirelens",
        "secretOptions": [
          {
            "valueFrom": "arn:aws:secretsmanager:us-east-1:XXX:secret:YYY",
            "name": "apikey"
          }
        ],
        "options": {
          "provider": "ecs",
          "dd_service": "peach_scheduled",
          "dd_source": "peach",
          "DD_LOGS_CONFIG_LOGS_DD_URL": "tcp-encrypted-intake.logs.datadoghq.com:10516",
          "Name": "datadog"
        }
      },

And another:

     "logConfiguration": {
        "logDriver": "awsfirelens",
        "secretOptions": [
          {
            "valueFrom": "arn:aws:secretsmanager:us-east-1:XXX:secret:YYY",
            "name": "apikey"
          }
        ],
        "options": {
          "DD_DJANGO_DATABASE_SERVICE_NAME_PREFIX": "health-",
          "exclude-pattern": "/health_check/",
          "provider": "ecs",
          "dd_service": "health-staging-web",
          "dd_source": "health-staging",
          "DD_LOGS_CONFIG_LOGS_DD_URL": "tcp-encrypted-intake.logs.datadoghq.com:10516",
          "Name": "datadog"
        }
      },

@zhonghui12
Copy link
Contributor

zhonghui12 commented Jun 17, 2022

Thanks @JasonIamAUnixAdmin , so from the public repo: https://docs.fluentbit.io/manual/pipeline/outputs/datadog. not only DD_LOGS_CONFIG_LOGS_DD_URL, but also DD_DJANGO_DATABASE_SERVICE_NAME_PREFIX and exclude-pattern are invalid. Because the new validation in datadog: fluent/fluent-bit@0e94b10. Those should be removed to make fluent bit work from fluent bit 1.9

@JasonIamAUnixAdmin
Copy link

Thanks @JasonIamAUnixAdmin , so from the public repo: https://docs.fluentbit.io/manual/pipeline/outputs/datadog. not only DD_LOGS_CONFIG_LOGS_DD_URL, but also DD_DJANGO_DATABASE_SERVICE_NAME_PREFIX and exclude-pattern are invalid. Because the new validation in datadog: fluent/fluent-bit@0e94b10. Those should be removed to make fluent bit work from fluent bit 1.9

All the options in use came from DD in the past.

Looks like HOST replaces the logs URL
I don't see anything like the excludes pattern however

@zhonghui12
Copy link
Contributor

@JasonIamAUnixAdmin sorry I was wrong. So exclude-pattern and include-pattern are generated by FireLens and I've verified these two should still work: https://docs.aws.amazon.com/AmazonECS/latest/userguide/firelens-filtering-logs.html.

Thanks.

@nakulpathak3
Copy link

nakulpathak3 commented Jun 20, 2022

Can you please revert the broken stable image while you're debugging? Our tasks cannot start and we now have to rollout a new definition prior to this release with an older tag.

@zhonghui12
Copy link
Contributor

zhonghui12 commented Jun 20, 2022

@nakulpathak3 apologies, right now we have not been able to conclude that the stable version is broken. We can see the root cause here is wrong task definition because Fluent Bit 1.9 adds more input validation. Users should be able to control their definition configs and make it work.

To be more specific, revert cannot resolve the issue because Fluent Bit will not revert their validation which means after Fluent Bit 1.9, the restriction will always there. I will recommend you to update your task definition config and if you have any problem, you can post it here and we are always willing to help you.

Thanks

@james-skinner-deltatre
Copy link

We also were hit by this. Logs confirm its a config validation issue:

2022-06-21 11:56:37AWS for Fluent Bit Container Image Version 2.25.1
...
2022-06-21 11:56:37[2022/06/21 09:56:37] [error] [config] datadog: unknown configuration property 'dd_env'. The following properties are allowed: compress, apikey, dd_service, dd_source, dd_tags, proxy, include_tag_key, tag_key, dd_message_key, provider, and json_date_key.
2022-06-21 11:56:37[2022/06/21 09:56:37] [ help] try the command: /fluent-bit/bin/fluent-bit -o datadog -h
2022-06-21 11:56:37[2022/06/21 09:56:37] [ info] [fluent bit] version=1.9.3, commit=eb4e2e770f, pid=1
2022-06-21 11:56:37[2022/06/21 09:56:37] [ info] [storage] version=1.2.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
2022-06-21 11:56:37[2022/06/21 09:56:37] [ info] [cmetrics] version=0.3.1

As said, reverting the :stable tag does not make sense but this bump could be counted as a major version change since increasing to this version breaks things with no other config changes.

@PettitWesley
Copy link
Contributor

My perspective on this recent issue is that I fully understand the pain and its unfortunate that folks were impacted by the addition of config validation; furthermore, fixing config validation is a good thing, and it's not a backwards breaking change. The config options that all of you had in your configs never had any effect- so no features were removed. I think of the fluent bit config options as an API contract. Imagine you were calling a service API without adhering to its contract- failing to validate and fail those requests would be a bug IMO.

@PettitWesley PettitWesley changed the title Latest and Stable versions have started failing Latest and Stable versions failing due to invalid config options Jun 21, 2022
@james-skinner-deltatre
Copy link

well its absolutely a contract, but I don't agree it is a non-breaking change to that contract. If I made a similar change to a REST API without a major version bump and brought down production, that's on me, not the consumer of the API

even so it is a positive change to make and the real problem here is people (us included) not pinning the image version and expecting nothing to ever change under them.

@PettitWesley
Copy link
Contributor

@james-skinner-deltatre has a good point at the end of his comment that I strongly agree with and that folks should check out- it is best to lock to a specific version tag. Rather than locking to our latest or stable, have a human check these files and then pick that version:

@JasonIamAUnixAdmin
Copy link

For folks coming here having issues with DataDog the Host directive is no longer needed for people with HIPPA requirements... they just didn't tell us, details are here https://docs.datadoghq.com/data_security/logs/#configuration-requirements-for-hipaa-enabled-customers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants