Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS resolution timeout/failure in 1.8.9 #4260

Closed
stevenarvar opened this issue Oct 29, 2021 · 28 comments
Closed

DNS resolution timeout/failure in 1.8.9 #4260

stevenarvar opened this issue Oct 29, 2021 · 28 comments
Labels

Comments

@stevenarvar
Copy link

Bug Report

Describe the bug
Hi, I am facing a DNS resolution timeout/failure using 1.8.9 with the forward module to a stackdriver.

To Reproduce

  • upgrade from 1.8.3 to 1.8.9
[2021/10/29 18:43:37] [error] [input:emitter:fluent_log_emitted] error registering chunk with tag: st01.fluent
[2021/10/29 18:43:37] [error] [input:emitter:fluent_log_emitted] error registering chunk with tag: st01.fluent
[2021/10/29 18:43:37] [error] [input:emitter:fluent_log_emitted] error registering chunk with tag: st01.fluent
[2021/10/29 18:43:38] [ warn] [net] getaddrinfo(host='logging.googleapis.com', err=12): Timeout while contacting DNS servers
[2021/10/29 18:43:38] [ warn] [net] getaddrinfo(host='logging.googleapis.com', err=12): Timeout while contacting DNS servers
[2021/10/29 18:43:38] [ warn] [net] getaddrinfo(host='logging.googleapis.com', err=12): Timeout while contacting DNS servers
[2021/10/29 18:43:38] [ warn] [engine] chunk '1-1635532763.608049522.flb' cannot be retried: task_id=41, input=standard_log_emitted > output=stackdriver.1
[2021/10/29 18:43:41] [ warn] [input] emitter.8 paused (mem buf overlimit)
[2021/10/29 18:43:41] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:41] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:41] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01

Your Environment

  • Version used: 1.8.9
  • Configuration: stackdriver output and tail input
  • Environment name and version (e.g. Kubernetes? What version?): K8S 1.19
  • Filters and plugins: stackdriver & tail

Additional context
Some fluent-bit pods eventually output logs such as Resource temporarily unavailable and gave up:

[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [src/flb_http_client.c:1172 errno=11] Resource temporarily unavailable
[2021/10/29 18:43:58] [ warn] [output:stackdriver:stackdriver.1] http_do=-1
[2021/10/29 18:43:58] [error] [src/flb_http_client.c:1172 errno=11] Resource temporarily unavailable
[2021/10/29 18:43:58] [ warn] [output:stackdriver:stackdriver.1] http_do=-1 
@stevenarvar
Copy link
Author

Issue could be related to #4050

@matthewfala
Copy link
Contributor

Please also see: #4257

@sdwerwed
Copy link

sdwerwed commented Nov 15, 2021

I have a similar issue by using fluent/fluent-bit:1.8.9-debug, fluent-bit can not resolve the headless service in AKS to forward the logs to flunetd statefulset. fluent/fluent-bit:1.8.4-debug image does not give those errors, unfortunately, I have to downgrade till there will be a fix for this bug, tested nslookup in fluent-bit and still can not resolve, I tested nslookup with an ubuntu image and it works, is there any chance

fluentbit logs:

[2021/11/15 15:14:49] [ warn] [engine] failed to flush chunk '1-1636989288.537725817.flb', retry in 10 seconds: task_id=1, input=tail.0 > output=forward.0 (out_id=0)
[2021/11/15 15:14:50] [ warn] [net] getaddrinfo(host='fluentd-1.fluentd-headless', err=4): Domain name not found
[2021/11/15 15:14:50] [error] [output:forward:forward.0] no upstream connections available
[2021/11/15 15:14:50] [ warn] [engine] failed to flush chunk '1-1636989289.606633533.flb', retry in 8 seconds: task_id=3, input=tail.0 > output=forward.0 (out_id=0)
[2021/11/15 15:14:51] [ warn] [net] getaddrinfo(host='fluentd-1.fluentd-headless', err=4): Domain name not found
[2021/11/15 15:14:51] [error] [output:forward:forward.0] no upstream connections available
[2021/11/15 15:14:51] [ warn] [engine] failed to flush chunk '1-1636989290.988513419.flb', retry in 8 seconds: task_id=8, input=tail.0 > output=forward.0 (out_id=0)
[2021/11/15 15:14:51] [ warn] [net] getaddrinfo(host='fluentd-1.fluentd-headless', err=4): Domain name not found
[2021/11/15 15:14:51] [error] [output:forward:forward.0] no upstream connections available
[2021/11/15 15:14:51] [ warn] [engine] failed to flush chunk '1-1636989290.511459650.flb', retry in 7 seconds: task_id=4, input=tail.0 > output=forward.0 (out_id=0)
[2021/11/15 15:14:51] [ warn] [net] getaddrinfo(host='fluentd-1.fluentd-headless', err=4): Domain name not found
[2021/11/15 15:14:51] [error] [output:forward:forward.0] no upstream connections available
[2021/11/15 15:14:51] [ warn] [engine] failed to flush chunk '1-1636989290.704430196.flb', retry in 7 seconds: task_id=6, input=tail.0 > output=forward.0 (out_id=0)
[2021/11/15 15:14:51] [ warn] [net] getaddrinfo(host='fluentd-1.fluentd-headless', err=4): Domain name not found
[2021/11/15 15:14:51] [error] [output:forward:forward.1] no upstream connections available

@bensta
Copy link

bensta commented Nov 22, 2021

I get the same error with the Elasticsearch output when configuring it with the Cloud_ID and Cloud_Auth config in both Minikube and AKS.
I tried with multiple versions:
1.8.x series: I get the Domain Not Found error described above
1.7.9: I get an Unknown error.

The exact error (with v1.8.10) is:

[2021/11/22 20:03:17] [ warn] [net] getaddrinfo(host='***redacted***.azure.elastic-cloud.com:9243', err=4): Domain name not found
[2021/11/22 20:03:17] [debug] [upstream] connection #-1 failed to **redacted***.azure.elastic-cloud.com:9243:443

So I have two observations:

  1. Maybe Fluent Bit is doing a DNS lookup using the host AND the port? The correct lookup would use only the host, not the port concatenated to it.
  2. It tries to connect to the host using two ports: The correct one and 443. I tried to set the port manually in addition to the Cloud_ID setting, but I get the same result.

@urpyLLIKa
Copy link

On 1.8.10 sporadically reproduced too
[input:emitter:emitter_for_rewrite_tag.6] error registering chunk with tag:

@ehelvacikoylu
Copy link

I tried the helm chart and manual installation but I have the same problem. is there any solution?

  • fluent-bit version 1.8.9
  • running on AKS.

[2022/02/07 20:21:18] [ warn] [net] getaddrinfo(host='*****.westeurope.azure.elastic-cloud.com:9243', err=4): Domain name not found
[2022/02/07 20:21:18] [ warn] [engine] failed to flush chunk '1-1644265237.698532458.flb', retry in 16 seconds: task_id=363, input=tail.0 > output=es.0 (out_id=0)

@jcamu
Copy link

jcamu commented Mar 2, 2022

Hello,

I had the same issue.
Is there some resolution available ?
This problem will be fixed?

Regards,

vijay-veeranki added a commit to ministryofjustice/cloud-platform-terraform-logging that referenced this issue Mar 3, 2022
As we see this issue in the latest version
fluent/fluent-bit#4260
vijay-veeranki added a commit to ministryofjustice/cloud-platform-terraform-logging that referenced this issue Mar 3, 2022
* Update fluent-bit to the latest version

As the stable chart is not supported, used:
https://github.com/fluent/helm-charts/blob/main/charts/fluent-bit/Chart.yaml

* Pinned app version to "1.8.4"

As we see this issue in the latest version
fluent/fluent-bit#4260

* Add Unit tests and Documentation actions
vijay-veeranki added a commit to ministryofjustice/cloud-platform-infrastructure that referenced this issue Mar 3, 2022
Update fluent-bit to the latest version, as the stable chart is not supported, used:
https://github.com/fluent/helm-charts/blob/main/charts/fluent-bit/Chart.yaml

Pinned app version to "1.8.3", as we see this issue in the latest version
fluent/fluent-bit#4260
vijay-veeranki added a commit to ministryofjustice/cloud-platform-infrastructure that referenced this issue Mar 7, 2022
Update fluent-bit to the latest version, as the stable chart is not supported, used:
https://github.com/fluent/helm-charts/blob/main/charts/fluent-bit/Chart.yaml

Pinned app version to "1.8.4", as we see this issue in the latest version
fluent/fluent-bit#4260
vijay-veeranki added a commit to ministryofjustice/cloud-platform-infrastructure that referenced this issue Mar 7, 2022
* Upgrade logging to use latest version

Update fluent-bit to the latest version, as the stable chart is not supported, used:
https://github.com/fluent/helm-charts/blob/main/charts/fluent-bit/Chart.yaml

Pinned app version to "1.8.4", as we see this issue in the latest version
fluent/fluent-bit#4260
@patrick-stephens
Copy link
Contributor

Can you retest with the latest 1.8 (1.8.13 currently) or 1.9.0 release? There have been various fixes around DNS.

@030
Copy link
Contributor

030 commented May 13, 2022

@patrick-stephens Issue also occurs in 1.9.3.

[2022/05/13 09:13:27] [ warn] [net] getaddrinfo(host='xyz.westeurope.azure.elastic-cloud.com:9243', err=4):
  Domain name not found

@bensta I think you are right. When I issue a curl inside the fluent-debug container then a response is returned. If the issue would be related to the kube-dns or resolving then a resolving error should be returned by curl as well.

@030
Copy link
Contributor

030 commented May 13, 2022

030 added a commit to 030/fluent-bit that referenced this issue May 16, 2022
030 added a commit to 030/fluent-bit that referenced this issue May 16, 2022
030 added a commit to 030/fluent-bit that referenced this issue May 16, 2022
030 added a commit to 030/fluent-bit that referenced this issue May 16, 2022
030 added a commit to 030/fluent-bit that referenced this issue May 16, 2022
030 added a commit to 030/fluent-bit that referenced this issue May 16, 2022
…t using cloud_id.

Signed-off-by: 030 <chocolatey030@gmail.com>
030 added a commit to 030/fluent-bit that referenced this issue May 16, 2022
…t using cloud_id.

Signed-off-by: 030 <chocolatey030@gmail.com>
@leonardo-albertovich
Copy link
Collaborator

Which fluent-bit version are they running?

@PettitWesley
Copy link
Contributor

@leonardo-albertovich 1.8.9, same as reported in this issue.

@nkinkade
Copy link

We are also seeing this same issue with fluent-bit v1.9.3. We are seeing repeated log messages like the following, and fluent-bit does not upload logs using the Stackdriver output plugin:

[2022/07/25 21:59:25] [ warn] [net] getaddrinfo(host='logging.googleapis.com', err=12): Timeout while contacting DNS servers

I have not yet tried ndt.dns.mode=TCP, but am in the process of trying that. I will also note that inside of our fluent-bit containers the default resolver setting of ndots:5 is in place, and we are actively testing setting that to ndots:2. With ndots:5, we saw that fluent-bit was issuing 4 DNS queries for every external name resolution (e.g., "logging.googleapis.com"), which is not optimal at all. We have not yet got ndots:2 into our production platform, but I will report back here whether ndt.dns.mode=TCP in conjunction with ndots:2 will help us with this issue.

Side note: our platform is very geographically dispersed (all over the globe). Anecdotally, we are seeing these DNS issues with fluent-bit instances running on nodes that are very far away geographically from the VMs where CoreDNS is running in our clusters. We have been hypothesizing that the repeated failing DNS queries caused by ndots:5 on top of UDP packets traversing large distances, many hops and many networks, might be at least partially to blame for the DNS issues we are seeing with fluent-bit. Again, I will report back here once I have more details about how these mitigations help us, or not.

030 added a commit to 030/fluent-bit that referenced this issue Aug 8, 2022
…s a port.

[fluentGH-4260] Resolve domain name not found by adding code that is capable
of extracting the port if it exists. If not then the default 443 will be
used.

Signed-off-by: 030 <chocolatey030@gmail.com>
030 added a commit to 030/fluent-bit that referenced this issue Aug 8, 2022
…s a port.

[fluentGH-4260] Resolve domain name not found by adding code that is capable
of extracting the port if it exists. If not then the default 443 will be
used.

Signed-off-by: 030 <chocolatey030@gmail.com>
@leonardo-albertovich
Copy link
Collaborator

@nkinkade I had to test stackdriver in GCE yesterday and in order to get it to work properly there you need to add this option dns.prefer_ipv4 on, that makes fluent-bit prioritize ipv4 results when querying the nameserver which works around the underlying issue with gce not allowing ipv6 connections to that address from those networks (I don't remember the exact cause because it's been a long time).

@PettitWesley
Copy link
Contributor

@leonardo-albertovich How come all DNS settings are not documented? https://github.com/fluent/fluent-bit/blob/master/src/flb_upstream.c#L43

@leonardo-albertovich
Copy link
Collaborator

I think some of those settings catered some very specific corner cases and weren't meant for general usage.

@PettitWesley
Copy link
Contributor

PettitWesley commented Aug 12, 2022

@leonardo-albertovich I will bring this up next time we have some sort of community or other meeting, IMO we should not have special hidden settings that only some maintainers understand and know about. If a setting needs a warning attached to it or some caveats, sure, that makes sense, but anything that exists should be documented IMO.

@nkinkade
Copy link

nkinkade commented Aug 12, 2022

@leonardo-albertovich: Thanks for the tip. Some small parts of our cluster run in GCP, but the overwhelming majority is comprised of globally distributed bare-metal machines. To be sure I understand the option, does it simply mean that if a DNS query returns both a v4 and v6 address for a name, that fluent-bit will always chose to use the v4 address over the v6 address? If so, I'm not sure how that would help us.

In my previous post I said I would report back on what the ndt.dns.mode=TCP and ndots:2 configuration did for us. They seem to have helped. Since we implemented them we haven't seen any other failures that appear on the surface to be related to failing DNS queries or timeouts. It's not clear exactly how or if ndt.dns.mode=TCP is working for us. It seems to have increased the coredns_dns_request_duration_seconds metric by around 100%, but that is okay because it was already only a couple milliseconds. Changing to ndots:2 made a huge difference, at least as far as the overall number of DNS requests and their results. The request rate to CoreDNS dropped by around 75%, and NXDOMAIN response codes went to almost nothing.

I still suspect there is some sort of bug in fluent-bit that causes a deadlock or something similar after certain network timeouts or failures.

@leonardo-albertovich
Copy link
Collaborator

Yes @nkinkade that setting would cause fluent-bit to prefer ipv4 records any time both ipv4 and ipv6 records are available. It doesn't mean it it will stop using ipv6 if that's the only record type available, it's just about ordering and it's meant to address a very specific issue in GCE and it's not useful outside of that environment.

@edsiper
Copy link
Member

edsiper commented Aug 12, 2022

@PettitWesley I think there is no intention to hide configuration options, actually the binary helper list them here:

~/coding/fluent-bit/build (master) » bin/fluent-bit -o stackdriver -h                                   130 ↵ edsiper@monox-2
Fluent Bit v2.0.0
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

HELP
stackdriver output plugin

DESCRIPTION
Send events to Google Stackdriver Logging

OPTIONS
google_service_credentials     Set the path for the google service credentials file
                               > default: default, type: string

...redacted...

custom_k8s_regex               Set a custom kubernetes regex filter
                               > default: (?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$, type: string

resource_labels                Set the resource labels
                               > default: default, type: multiple comma delimited strings


NETWORKING
net.dns.mode                   Select the primary DNS connection type (TCP or UDP)
                               > default: default, type: string

net.dns.resolver               Select the primary DNS resolver type (LEGACY or ASYNC)
                               > default: default, type: string

net.dns.prefer_ipv4            Prioritize IPv4 DNS results when trying to establish a
                               connection
                               > default: false, type: boolean

net.keepalive                  Enable or disable Keepalive support
                               > default: true, type: boolean

net.keepalive_idle_timeout     Set maximum time allowed for an idle Keepalive connection
                               > default: 30s, type: time

net.connect_timeout            Set maximum time allowed to establish a connection, this
                               time includes the TLS handshake
                               > default: 10s, type: time

net.connect_timeout_log_error  On connection timeout, specify if it should log an error.
                               When disabled, the timeout is logged as a debug message
                               > default: true, type: boolean

net.source_address             Specify network address to bind for data traffic
                               > default: default, type: string

net.keepalive_max_recycle      Set maximum number of times a keepalive connection can be
                               used before it is retired.
                               > default: 2000, type: integer

we will make sure to update the web docs with such same info, but again, there is no such "special hidden settings", just undocumented in web.. it will be fixed soon

030 added a commit to 030/fluent-bit that referenced this issue Aug 30, 2022
…s a port.

[fluentGH-4260] Resolve domain name not found by adding code that is capable
of extracting the port if it exists. If not then the default 443 will be
used.

Signed-off-by: 030 <chocolatey030@gmail.com>
030 added a commit to 030/fluent-bit that referenced this issue Sep 2, 2022
…s a port.

[fluentGH-4260] Resolve domain name not found by adding code that is capable
of extracting the port if it exists. If not then the default 443 will be
used.

Signed-off-by: 030 <chocolatey030@gmail.com>
030 added a commit to 030/fluent-bit that referenced this issue Sep 3, 2022
…s a port.

[fluentGH-4260] Resolve domain name not found by adding code that is capable
of extracting the port if it exists. If not then the default 443 will be
used.

Signed-off-by: 030 <chocolatey030@gmail.com>
030 added a commit to 030/fluent-bit that referenced this issue Sep 5, 2022
…s a port.

[fluentGH-4260] Resolve domain name not found by adding code that is capable
of extracting the port if it exists. If not then the default 443 will be
used.

Signed-off-by: 030 <chocolatey030@gmail.com>
030 added a commit to 030/fluent-bit that referenced this issue Sep 7, 2022
…s a port.

[fluentGH-4260] Resolve domain name not found by adding code that is capable
of extracting the port if it exists. If not then the default 443 will be
used.

Signed-off-by: 030 <chocolatey030@gmail.com>
@030
Copy link
Contributor

030 commented Sep 12, 2022

@edsiper Could you check the PR?

030 added a commit to 030/fluent-bit that referenced this issue Sep 14, 2022
…s a port.

[fluentGH-4260] Resolve domain name not found by adding code that is
capable of extracting the port if it exists. If not then the
default 443 will be used.

Signed-off-by: 030 <chocolatey030@gmail.com>
030 added a commit to 030/fluent-bit that referenced this issue Sep 14, 2022
…s a port.

[fluentGH-4260] Resolve domain name not found by adding code that is
capable of extracting the port if it exists. If not then the
default 443 will be used.

Signed-off-by: 030 <chocolatey030@gmail.com>
030 added a commit to 030/fluent-bit that referenced this issue Sep 14, 2022
…s a port.

[fluentGH-4260] Resolve domain name not found by adding code that is
capable of extracting the port if it exists. If not then the
default 443 will be used.

Signed-off-by: 030 <chocolatey030@gmail.com>
edsiper pushed a commit that referenced this issue Sep 15, 2022
…s a port. (#5458)

[GH-4260] Resolve domain name not found by adding code that is
capable of extracting the port if it exists. If not then the
default 443 will be used.

Signed-off-by: 030 <chocolatey030@gmail.com>

Signed-off-by: 030 <chocolatey030@gmail.com>
mgeriesa pushed a commit to mgeriesa/fluent-bit that referenced this issue Oct 25, 2022
…s a port. (fluent#5458)

[fluentGH-4260] Resolve domain name not found by adding code that is
capable of extracting the port if it exists. If not then the
default 443 will be used.

Signed-off-by: 030 <chocolatey030@gmail.com>

Signed-off-by: 030 <chocolatey030@gmail.com>
Signed-off-by: Manal Geries <mgeriesa@gmail.com>
@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Dec 12, 2022
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 18, 2022
sumitd2 pushed a commit to sumitd2/fluent-bit that referenced this issue Feb 8, 2023
…s a port. (fluent#5458)

[fluentGH-4260] Resolve domain name not found by adding code that is
capable of extracting the port if it exists. If not then the
default 443 will be used.

Signed-off-by: 030 <chocolatey030@gmail.com>

Signed-off-by: 030 <chocolatey030@gmail.com>
Signed-off-by: root <root@sumit-acs.novalocal>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests