Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluent Bit 1.6 - ES Plugin: Failed to source credential on Amazon EKS IAM Roles for Service Account #2714

Closed
yantk-hk opened this issue Oct 21, 2020 · 14 comments
Assignees
Labels
AWS Issues with AWS plugins or experienced by users running on AWS

Comments

@yantk-hk
Copy link

Bug Report

Describe the bug
Fluent Bit 1.6 - ES Plugin: Keep sourcing credential from EC2 instance rather than IAM Roles for Service Account on Amazon EKS Worker Node

To Reproduce

  1. Create an Amazon Elasticsearch domain version 7.7 with open access
  2. Create a service account in EKS cluster with IAM Roles for Service Account & corresponding AWS IAM policies (e.g. es*)
  3. Upgrade fluent bit from 1.5 to 1.6 and keep using configuration in EKS Configmap
    Name            es
    Match           kube.*
    Host            amazon-es-domain.ap-southeast-1.es.amazonaws.com
    Port            443
    TLS             On
    Logstash_Format On
    Logstash_Prefix eks-cluster-1
    Retry_Limit     10
    AWS_Auth        On
    AWS_Region      ap-southeast-1
    Generate_ID     On
    Replace_Dots    On```

4. Following error would shown in fluent bit stdout/log in EKS.
Following error messages keep appearing and it shows the pod or fluent bit keep sourcing AWS credential from the underlying EKS worker node (EC2 instance) rather than the annotated EKS IAM Roles for Service Account (IRSA).

```[2020/10/16 09:52:24] [error] [output:es:es.3] HTTP status=403 URI=/_bulk, response: {"error":{"root_cause":[{"type":"security_exception","reason":"no permissions for [indices:data/write/bulk] and User [name=arn:aws:iam::XXX873347XXX:role/eksctl-cluster-1-nodegroup-ng-al1-NodeInstanceRole-7GZZR0O6HRQS, backend_roles=[arn:aws:iam::XXX873347XXX:role/eksctl-cluster-1-nodegroup-ng-al1-NodeInstanceRole-7GZZR0O6HRQS], requestedTenant=null]"}],"type":"security_exception","reason":"no permissions for [indices:data/write/bulk] and User [name=arn:aws:iam::XXX873347XXX:role/eksctl-cluster-1-nodegroup-ng-al1-NodeInstanceRole-7GZZR0O6HRQS, backend_roles=[arn:aws:iam::XXX873347XXX:role/eksctl-cluster-1-nodegroup-ng-al1-NodeInstanceRole-7GZZR0O6HRQS], requestedTenant=null]"},"status":403}```


**Expected behavior**
Source AWS credential by using the EKS IAM Roles for Service Account (IRSA) but not underlying Worker Node

@Angry-Potato
Copy link

seeing the exact same problem, I installed using the helm chart from google stable repo, here's the manifests that end up in the cluster:

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::my-acc:role/fluent-bit
    meta.helm.sh/release-name: fluent-bit
    meta.helm.sh/release-namespace: logging
  labels:
    app: fluent-bit
    app.kubernetes.io/managed-by: Helm
    chart: fluent-bit-2.10.1
    heritage: Helm
    release: fluent-bit
  name: fluent-bit
  namespace: logging
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: eks.privileged
  labels:
    app: fluent-bit
    controller-revision-hash: 7d55f48cd8
    pod-template-generation: "1"
    release: fluent-bit
  name: fluent-bit-2g698
  namespace: logging
spec:
  containers:
  - env:
    - name: AWS_DEFAULT_REGION
      value: eu-west-1
    - name: HOSTNAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: AWS_ROLE_ARN
      value: arn:aws:iam::my-acc:role/fluent-bit
    - name: AWS_WEB_IDENTITY_TOKEN_FILE
      value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    image: fluent/fluent-bit:1.6-debug
    imagePullPolicy: Always
    name: fluent-bit
    resources:
      limits:
        cpu: 100m
        memory: 128Mi
      requests:
        cpu: 100m
        memory: 128Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/log
      name: varlog
    - mountPath: /var/lib/docker/containers
      name: varlibdockercontainers
      readOnly: true
    - mountPath: /fluent-bit/etc/fluent-bit.conf
      name: config
      subPath: fluent-bit.conf
    - mountPath: /fluent-bit/etc/fluent-bit-service.conf
      name: config
      subPath: fluent-bit-service.conf
    - mountPath: /fluent-bit/etc/fluent-bit-input.conf
      name: config
      subPath: fluent-bit-input.conf
    - mountPath: /fluent-bit/etc/fluent-bit-filter.conf
      name: config
      subPath: fluent-bit-filter.conf
    - mountPath: /fluent-bit/etc/fluent-bit-output.conf
      name: config
      subPath: fluent-bit-output.conf
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: fluent-bit-token-65zvs
      readOnly: true
    - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
      name: aws-iam-token
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: fluent-bit
  serviceAccountName: fluent-bit
  volumes:
  - name: aws-iam-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: sts.amazonaws.com
          expirationSeconds: 86400
          path: token
  - hostPath:
      path: /var/log
      type: ""
    name: varlog
  - hostPath:
      path: /var/lib/docker/containers
      type: ""
    name: varlibdockercontainers
  - configMap:
      defaultMode: 420
      name: fluent-bit-config
    name: config
  - name: fluent-bit-token-65zvs
    secret:
      defaultMode: 420
      secretName: fluent-bit-token-65zvs

and the fluent-bit config:

[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_Tag_Prefix     kube.var.log.containers.
    Kube_URL.           https://kubernetes.default.svc:443
    Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

    Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token

    Merge_Log           On
    Merge_Log_Key       log_processed
    K8S-Logging.Parser  On
    K8S-Logging.Exclude On

[OUTPUT]
    Name  es
    Match *
    Host  my-es-domain.eu-west-1.es.amazonaws.com
    Port  443
    Logstash_Format On
    Retry_Limit False
    Type  _doc
    Time_Key @timestamp
    Replace_Dots On
    Logstash_Prefix my-domain
    AWS_Auth On
    AWS_Region eu-west-1
    tls On

and the fluent-bit logs:

Fluent Bit v1.6.1
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2020/10/21 13:51:00] [ info] [engine] started (pid=1)
[2020/10/21 13:51:00] [ info] [storage] version=1.0.6, initializing...
[2020/10/21 13:51:00] [ info] [storage] in-memory
[2020/10/21 13:51:00] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/10/21 13:51:00] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2020/10/21 13:51:00] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2020/10/21 13:51:00] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[2020/10/21 13:51:00] [ info] [filter:kubernetes:kubernetes.0] API server connectivity OK
[2020/10/21 13:51:00] [ warn] net_tcp_fd_connect: getaddrinfo(host=''): Name or service not known
[2020/10/21 13:51:00] [error] [io] connection #41 failed to: :443
[2020/10/21 13:51:00] [ info] [sp] stream processor started
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=11550335 watch_fd=1 name=/var/log/containers/alertmanager-kube-prometheus-stack-alertmanager-0_monitoring_alertmanager-bbd78510252b994ae61670696334be15c0f531b091829a450137a319e88a4178.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=9452023 watch_fd=2 name=/var/log/containers/alertmanager-kube-prometheus-stack-alertmanager-0_monitoring_config-reloader-2c0eca22b0b789b324075ec787fea4e4c9cc4e05ec902702155dce64ac2315b5.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=2146617 watch_fd=3 name=/var/log/containers/aws-node-jnh68_kube-system_aws-node-f5d7f8524182bd4e58639665e66c4bef88c1d341147002b6277115a8287c0fd7.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=44106514 watch_fd=4 name=/var/log/containers/aws-node-termination-handler-r228s_kube-system_aws-node-termination-handler-0406067f9f3385355f8aabfb09cd2ae4f1f8050b40118e37dda7cfc424a748d8.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=6564993 watch_fd=5 name=/var/log/containers/cert-manager-6f657bd884-qzz8b_cert-manager_cert-manager-ca6753d0b6f36ff3d1296ecbd8dbb98e4a73f36c7a0fcbcbaa7813874fcb9e35.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=19931777 watch_fd=6 name=/var/log/containers/cert-manager-webhook-cdb5c8884-fm4ll_cert-manager_cert-manager-d7cf10a24e970c39f31f559ba5597d03cd9bd7cde26b855762a488a8e3e706e4.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=22025337 watch_fd=7 name=/var/log/containers/cluster-autoscaler-chart-aws-cluster-autoscaler-chart-56776q4lj_kube-system_aws-cluster-autoscaler-chart-b1176a841aef00ed3607c8b5b3d281174f2aa5882fd325f1bbf0a7ef299fbc6d.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=27283613 watch_fd=8 name=/var/log/containers/external-dns-6bcd486cbb-mfnc9_default_external-dns-cec5f935de2f6b33e58990fa2fcac283c802a06029ebef84955a78700fe0a8e8.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=25177292 watch_fd=9 name=/var/log/containers/ingress-nginx-controller-66dc9984d8-lvgbl_default_controller-3c2f1518500ad04125eb4ddae521736835d5954593ea0225a1914fa7e71cd68f.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=37753150 watch_fd=10 name=/var/log/containers/kube-prometheus-stack-kube-state-metrics-66789f8885-55p7v_monitoring_kube-state-metrics-c652ec07303dab30431dacb8af3ee662b262d4c6f0d7c74d2b5c0208630a6009.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=6563946 watch_fd=11 name=/var/log/containers/kube-prometheus-stack-operator-f4c99ffb7-7kcqg_monitoring_kube-prometheus-stack-517db77d62039dc269d655441517e58af09590490547eb874ae5ad4ba4d44fa5.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=39868468 watch_fd=12 name=/var/log/containers/kube-prometheus-stack-operator-f4c99ffb7-7kcqg_monitoring_tls-proxy-b3730cf541be60158c2c7f82f013a3b9a44582ae4f8fc5c43e52c5143630fc30.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=13651330 watch_fd=13 name=/var/log/containers/kube-prometheus-stack-prometheus-node-exporter-nthzc_monitoring_node-exporter-fef57c689fd7c0bd5771eeb0f4b6dcb626707e202afdf852aaedfd2416df9a0d.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=20973204 watch_fd=14 name=/var/log/containers/kube-proxy-hshn9_kube-system_kube-proxy-ceab9231e552b168ea86a73c00a1433b1be0e8696698acf0a65311690e0a51d8.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=30578779 watch_fd=15 name=/var/log/containers/node-problem-detector-6vxjl_kube-system_node-problem-detector-f57b63acf906730858ee06d2fead9cffe5dfabc2472ed4219ec9307a2423e1ea.log
[2020/10/21 13:51:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=17841555 watch_fd=16 name=/var/log/containers/fluent-bit-6bwdd_logging_fluent-bit-afed86fbe22f5309d7754a263daadcd8eca6677b57548849a822a66c6718904b.log
[2020/10/21 13:51:01] [error] [output:es:es.0] HTTP status=403 URI=/_bulk, response:
{"Message":"User: arn:aws:sts::my-acc:assumed-role/my-cluster20200904094215498700000007/i-0bc48f7cbce18e8e6 is not authorized to perform: es:ESHttpPost"}

The role mentioned in that last log statement is the instance / node profile, the same issue described by the OP.

@Angry-Potato
Copy link

I have progressed the issue by following this advice to block access to the node role, now the fluent-bit logs read:

[2020/10/21 14:08:12] [error] [aws_credentials] Could not read shared credentials file /root/.aws/credentials
[2020/10/21 14:08:12] [error] [aws_credentials] Failed to retrieve credentials for AWS Profile default
[2020/10/21 14:08:12] [ warn] net_tcp_fd_connect: getaddrinfo(host=''): Name or service not known
[2020/10/21 14:08:12] [error] [io] connection #54 failed to: :443
[2020/10/21 14:08:12] [error] [aws_client] connection initialization error
[2020/10/21 14:08:12] [error] [aws_credentials] STS assume role request failed
[2020/10/21 14:08:12] [ warn] [aws_credentials] No cached credentials are available and a credential refresh is already in progress. The current co-routine will retry.
[2020/10/21 14:08:12] [ warn] [aws_credentials] No cached credentials are available and a credential refresh is already in progress. The current co-routine will retry.
[2020/10/21 14:08:12] [error] [signv4] Provider returned no credentials, service=es
[2020/10/21 14:08:12] [error] [output:es:es.0] could not sign request with sigv4
[2020/10/21 14:08:12] [ warn] [engine] failed to flush chunk '1-1603288814.10956238.flb', retry in 583 seconds: task_id=91, input=tail.0 > output=es.0

@Angry-Potato
Copy link

A noble stranger on the provider-aws kubernetes slack channel gave me a workaround that fixes this issue for myself and the stranger, specify the AWS_STS_Endpoint in the OUTPUT config:


[OUTPUT]
    Name  es
    Match *
    Host  my-es-domain.eu-west-1.es.amazonaws.com
    Port  443
    Logstash_Format On
    Retry_Limit False
    Type  _doc
    Time_Key @timestamp
    Replace_Dots On
    Logstash_Prefix my-domain
    AWS_Auth On
    AWS_Region eu-west-1
    AWS_STS_Endpoint https://sts.eu-west-1.amazonaws.com <-- here, notice the region might be different for you
    tls On

@donomur
Copy link

donomur commented Oct 21, 2020

Hi there! I am the 'noble stranger' mentioned above. 😅 Apologies for not filing the bug beforehand, I thought it was just something weird with the AWS account I was using.

Anyhow, I see that no one has posted debug logs for this yet, so I'll post this snippet from mine from when I ran into this issue last week, since that's what led me to go down the STS endpoint config path:

[2020/10/15 00:50:39] [debug] [aws_credentials] Init called on the EKS provider
[2020/10/15 00:50:39] [debug] [aws_credentials] Calling STS..
[2020/10/15 00:50:39] [ warn] net_tcp_fd_connect: getaddrinfo(host=''): Name or service not known
[2020/10/15 00:50:39] [error] [io] connection #39 failed to: :443
[2020/10/15 00:50:39] [debug] [upstream] connection #39 failed to :443
[2020/10/15 00:50:39] [debug] [aws_client] connection initialization error
[2020/10/15 00:50:39] [debug] [aws_credentials] STS assume role request failed

It may also be worth noting that I am using the amazon/aws-for-fluent-bit image that uses Fluent Bit 1.6

@PettitWesley PettitWesley self-assigned this Oct 21, 2020
@PettitWesley PettitWesley added the AWS Issues with AWS plugins or experienced by users running on AWS label Oct 21, 2020
@PettitWesley
Copy link
Contributor

I think this is probably a bug... IAM Roles for SA calls STS... we made a change to the STS endpoint code to enable custom endpoints.

I bet there's a bug there...

@hoegertn
Copy link

hoegertn commented Oct 22, 2020

I can confirm that this happens on ECS/Fargate with Firelens also.

Setting AWS_STS_Endpoint helps.

@PettitWesley
Copy link
Contributor

@hoegertn Are you specifying an IAM role with the aws_role_arn parameter?

I'm about to put up a PR to fix this... basically calling STS is broken (which happens if you use EKS IRSA or a custom role).

@hoegertn
Copy link

Yes, I am assuming a role that has ES permissions. As you mentioned the STS call is broken as it does not know the hostname to contact.

@PettitWesley
Copy link
Contributor

PettitWesley commented Oct 22, 2020

Yeah, basically it's because the config map sets "" as the default for aws_sts_endpoint instead of NULL. This leads the code to incorrectly think that there is an custom STS endpoint, and then Fluent Bit tries to make a request to "".

https://github.com/fluent/fluent-bit/blob/master/plugins/out_es/es.c#L804

At least that's what I'm testing right now..

PettitWesley added a commit to PettitWesley/fluent-bit that referenced this issue Oct 22, 2020
Signed-off-by: Wesley Pettit <wppttt@amazon.com>
edsiper pushed a commit that referenced this issue Oct 22, 2020
Signed-off-by: Wesley Pettit <wppttt@amazon.com>
PettitWesley added a commit to PettitWesley/fluent-bit that referenced this issue Oct 22, 2020
Signed-off-by: Wesley Pettit <wppttt@amazon.com>
edsiper pushed a commit that referenced this issue Oct 23, 2020
Signed-off-by: Wesley Pettit <wppttt@amazon.com>
@PettitWesley
Copy link
Contributor

This was fixed in 1.6.2

AWS for Fluent Bit has not been updated yet since we are still trying to fix #2715

@ypicard
Copy link

ypicard commented Oct 21, 2021

Is this really fixed?

@PettitWesley
Copy link
Contributor

@ypicard Yes. Please open a new issue if you are having credential issues: https://github.com/aws/aws-for-fluent-bit

@tejarora
Copy link

2 years have passed, and the issue still exists.
Installed 2.1.8 using helm.
fluent-bit is unable to use the role in the serviceaccount (IRSA) and defaults to the node role.
Uninstalled and re-installed with AWS_STS_Endpoint in the OUTPUT es section, and that made no difference at all.

[2023/08/23 06:18:30] [error] [output:es:es.0] HTTP status=403 URI=/_bulk, response: {"Message":"User: arn:aws:sts::xxxx:assumed-role/xxxx/i-084a7914b30b44399 is not authorized to perform: es:ESHttpPost because no identity-based policy allows the es:ESHttpPost action"}

@bgarcial
Copy link

@tejarora is also happening in somehow for me, getting STS assume role request failed because the pod looks for a different path file to get the token. I've describe it in the above link.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AWS Issues with AWS plugins or experienced by users running on AWS
Projects
None yet
Development

No branches or pull requests

8 participants