DNS resolution timeout/failure in 1.8.9 #4260

stevenarvar · 2021-10-29T19:54:57Z

Bug Report

Describe the bug
Hi, I am facing a DNS resolution timeout/failure using 1.8.9 with the forward module to a stackdriver.

To Reproduce

upgrade from 1.8.3 to 1.8.9

[2021/10/29 18:43:37] [error] [input:emitter:fluent_log_emitted] error registering chunk with tag: st01.fluent
[2021/10/29 18:43:37] [error] [input:emitter:fluent_log_emitted] error registering chunk with tag: st01.fluent
[2021/10/29 18:43:37] [error] [input:emitter:fluent_log_emitted] error registering chunk with tag: st01.fluent
[2021/10/29 18:43:38] [ warn] [net] getaddrinfo(host='logging.googleapis.com', err=12): Timeout while contacting DNS servers
[2021/10/29 18:43:38] [ warn] [net] getaddrinfo(host='logging.googleapis.com', err=12): Timeout while contacting DNS servers
[2021/10/29 18:43:38] [ warn] [net] getaddrinfo(host='logging.googleapis.com', err=12): Timeout while contacting DNS servers
[2021/10/29 18:43:38] [ warn] [engine] chunk '1-1635532763.608049522.flb' cannot be retried: task_id=41, input=standard_log_emitted > output=stackdriver.1
[2021/10/29 18:43:41] [ warn] [input] emitter.8 paused (mem buf overlimit)
[2021/10/29 18:43:41] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:41] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:41] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01

Your Environment

Version used: 1.8.9
Configuration: stackdriver output and tail input
Environment name and version (e.g. Kubernetes? What version?): K8S 1.19
Filters and plugins: stackdriver & tail

Additional context
Some fluent-bit pods eventually output logs such as Resource temporarily unavailable and gave up:

[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [src/flb_http_client.c:1172 errno=11] Resource temporarily unavailable
[2021/10/29 18:43:58] [ warn] [output:stackdriver:stackdriver.1] http_do=-1
[2021/10/29 18:43:58] [error] [src/flb_http_client.c:1172 errno=11] Resource temporarily unavailable
[2021/10/29 18:43:58] [ warn] [output:stackdriver:stackdriver.1] http_do=-1

The text was updated successfully, but these errors were encountered:

stevenarvar · 2021-10-29T19:55:25Z

Issue could be related to #4050

matthewfala · 2021-10-29T20:30:55Z

Please also see: #4257

sdwerwed · 2021-11-15T14:55:26Z

I have a similar issue by using fluent/fluent-bit:1.8.9-debug, fluent-bit can not resolve the headless service in AKS to forward the logs to flunetd statefulset. fluent/fluent-bit:1.8.4-debug image does not give those errors, unfortunately, I have to downgrade till there will be a fix for this bug, tested nslookup in fluent-bit and still can not resolve, I tested nslookup with an ubuntu image and it works, is there any chance

fluentbit logs:

[2021/11/15 15:14:49] [ warn] [engine] failed to flush chunk '1-1636989288.537725817.flb', retry in 10 seconds: task_id=1, input=tail.0 > output=forward.0 (out_id=0)
[2021/11/15 15:14:50] [ warn] [net] getaddrinfo(host='fluentd-1.fluentd-headless', err=4): Domain name not found
[2021/11/15 15:14:50] [error] [output:forward:forward.0] no upstream connections available
[2021/11/15 15:14:50] [ warn] [engine] failed to flush chunk '1-1636989289.606633533.flb', retry in 8 seconds: task_id=3, input=tail.0 > output=forward.0 (out_id=0)
[2021/11/15 15:14:51] [ warn] [net] getaddrinfo(host='fluentd-1.fluentd-headless', err=4): Domain name not found
[2021/11/15 15:14:51] [error] [output:forward:forward.0] no upstream connections available
[2021/11/15 15:14:51] [ warn] [engine] failed to flush chunk '1-1636989290.988513419.flb', retry in 8 seconds: task_id=8, input=tail.0 > output=forward.0 (out_id=0)
[2021/11/15 15:14:51] [ warn] [net] getaddrinfo(host='fluentd-1.fluentd-headless', err=4): Domain name not found
[2021/11/15 15:14:51] [error] [output:forward:forward.0] no upstream connections available
[2021/11/15 15:14:51] [ warn] [engine] failed to flush chunk '1-1636989290.511459650.flb', retry in 7 seconds: task_id=4, input=tail.0 > output=forward.0 (out_id=0)
[2021/11/15 15:14:51] [ warn] [net] getaddrinfo(host='fluentd-1.fluentd-headless', err=4): Domain name not found
[2021/11/15 15:14:51] [error] [output:forward:forward.0] no upstream connections available
[2021/11/15 15:14:51] [ warn] [engine] failed to flush chunk '1-1636989290.704430196.flb', retry in 7 seconds: task_id=6, input=tail.0 > output=forward.0 (out_id=0)
[2021/11/15 15:14:51] [ warn] [net] getaddrinfo(host='fluentd-1.fluentd-headless', err=4): Domain name not found
[2021/11/15 15:14:51] [error] [output:forward:forward.1] no upstream connections available

bensta · 2021-11-22T19:50:05Z

I get the same error with the Elasticsearch output when configuring it with the Cloud_ID and Cloud_Auth config in both Minikube and AKS.
I tried with multiple versions:
1.8.x series: I get the Domain Not Found error described above
1.7.9: I get an Unknown error.

The exact error (with v1.8.10) is:

[2021/11/22 20:03:17] [ warn] [net] getaddrinfo(host='***redacted***.azure.elastic-cloud.com:9243', err=4): Domain name not found
[2021/11/22 20:03:17] [debug] [upstream] connection #-1 failed to **redacted***.azure.elastic-cloud.com:9243:443

So I have two observations:

Maybe Fluent Bit is doing a DNS lookup using the host AND the port? The correct lookup would use only the host, not the port concatenated to it.
It tries to connect to the host using two ports: The correct one and 443. I tried to set the port manually in addition to the Cloud_ID setting, but I get the same result.

urpyLLIKa · 2021-11-23T06:18:25Z

On 1.8.10 sporadically reproduced too
[input:emitter:emitter_for_rewrite_tag.6] error registering chunk with tag:

ehelvacikoylu · 2022-02-07T20:32:42Z

I tried the helm chart and manual installation but I have the same problem. is there any solution?

fluent-bit version 1.8.9
running on AKS.

[2022/02/07 20:21:18] [ warn] [net] getaddrinfo(host='*****.westeurope.azure.elastic-cloud.com:9243', err=4): Domain name not found
[2022/02/07 20:21:18] [ warn] [engine] failed to flush chunk '1-1644265237.698532458.flb', retry in 16 seconds: task_id=363, input=tail.0 > output=es.0 (out_id=0)

jcamu · 2022-03-02T16:53:37Z

Hello,

I had the same issue.
Is there some resolution available ?
This problem will be fixed?

Regards,

As we see this issue in the latest version fluent/fluent-bit#4260

* Update fluent-bit to the latest version As the stable chart is not supported, used: https://github.com/fluent/helm-charts/blob/main/charts/fluent-bit/Chart.yaml * Pinned app version to "1.8.4" As we see this issue in the latest version fluent/fluent-bit#4260 * Add Unit tests and Documentation actions

Update fluent-bit to the latest version, as the stable chart is not supported, used: https://github.com/fluent/helm-charts/blob/main/charts/fluent-bit/Chart.yaml Pinned app version to "1.8.3", as we see this issue in the latest version fluent/fluent-bit#4260

Update fluent-bit to the latest version, as the stable chart is not supported, used: https://github.com/fluent/helm-charts/blob/main/charts/fluent-bit/Chart.yaml Pinned app version to "1.8.4", as we see this issue in the latest version fluent/fluent-bit#4260

* Upgrade logging to use latest version Update fluent-bit to the latest version, as the stable chart is not supported, used: https://github.com/fluent/helm-charts/blob/main/charts/fluent-bit/Chart.yaml Pinned app version to "1.8.4", as we see this issue in the latest version fluent/fluent-bit#4260

patrick-stephens · 2022-03-14T21:00:22Z

Can you retest with the latest 1.8 (1.8.13 currently) or 1.9.0 release? There have been various fixes around DNS.

030 · 2022-05-13T09:14:49Z

@patrick-stephens Issue also occurs in 1.9.3.

[2022/05/13 09:13:27] [ warn] [net] getaddrinfo(host='xyz.westeurope.azure.elastic-cloud.com:9243', err=4):
  Domain name not found

@bensta I think you are right. When I issue a curl inside the fluent-debug container then a response is returned. If the issue would be related to the kube-dns or resolving then a resolving error should be returned by curl as well.

030 · 2022-05-13T09:55:55Z

Related: https://stackoverflow.com/q/69405701/2777965

…t using cloud_id.

…t using cloud_id. Signed-off-by: 030 <chocolatey030@gmail.com>

leonardo-albertovich · 2022-07-19T19:09:51Z

Which fluent-bit version are they running?

PettitWesley · 2022-07-19T21:31:11Z

@leonardo-albertovich 1.8.9, same as reported in this issue.

nkinkade · 2022-07-27T21:52:25Z

We are also seeing this same issue with fluent-bit v1.9.3. We are seeing repeated log messages like the following, and fluent-bit does not upload logs using the Stackdriver output plugin:

[2022/07/25 21:59:25] [ warn] [net] getaddrinfo(host='logging.googleapis.com', err=12): Timeout while contacting DNS servers

I have not yet tried ndt.dns.mode=TCP, but am in the process of trying that. I will also note that inside of our fluent-bit containers the default resolver setting of ndots:5 is in place, and we are actively testing setting that to ndots:2. With ndots:5, we saw that fluent-bit was issuing 4 DNS queries for every external name resolution (e.g., "logging.googleapis.com"), which is not optimal at all. We have not yet got ndots:2 into our production platform, but I will report back here whether ndt.dns.mode=TCP in conjunction with ndots:2 will help us with this issue.

Side note: our platform is very geographically dispersed (all over the globe). Anecdotally, we are seeing these DNS issues with fluent-bit instances running on nodes that are very far away geographically from the VMs where CoreDNS is running in our clusters. We have been hypothesizing that the repeated failing DNS queries caused by ndots:5 on top of UDP packets traversing large distances, many hops and many networks, might be at least partially to blame for the DNS issues we are seeing with fluent-bit. Again, I will report back here once I have more details about how these mitigations help us, or not.

…s a port. [fluentGH-4260] Resolve domain name not found by adding code that is capable of extracting the port if it exists. If not then the default 443 will be used. Signed-off-by: 030 <chocolatey030@gmail.com>

leonardo-albertovich · 2022-08-12T19:00:18Z

@nkinkade I had to test stackdriver in GCE yesterday and in order to get it to work properly there you need to add this option dns.prefer_ipv4 on, that makes fluent-bit prioritize ipv4 results when querying the nameserver which works around the underlying issue with gce not allowing ipv6 connections to that address from those networks (I don't remember the exact cause because it's been a long time).

PettitWesley · 2022-08-12T19:16:18Z

@leonardo-albertovich How come all DNS settings are not documented? https://github.com/fluent/fluent-bit/blob/master/src/flb_upstream.c#L43

leonardo-albertovich · 2022-08-12T19:22:51Z

I think some of those settings catered some very specific corner cases and weren't meant for general usage.

PettitWesley · 2022-08-12T19:39:55Z

@leonardo-albertovich I will bring this up next time we have some sort of community or other meeting, IMO we should not have special hidden settings that only some maintainers understand and know about. If a setting needs a warning attached to it or some caveats, sure, that makes sense, but anything that exists should be documented IMO.

nkinkade · 2022-08-12T19:41:45Z

@leonardo-albertovich: Thanks for the tip. Some small parts of our cluster run in GCP, but the overwhelming majority is comprised of globally distributed bare-metal machines. To be sure I understand the option, does it simply mean that if a DNS query returns both a v4 and v6 address for a name, that fluent-bit will always chose to use the v4 address over the v6 address? If so, I'm not sure how that would help us.

In my previous post I said I would report back on what the ndt.dns.mode=TCP and ndots:2 configuration did for us. They seem to have helped. Since we implemented them we haven't seen any other failures that appear on the surface to be related to failing DNS queries or timeouts. It's not clear exactly how or if ndt.dns.mode=TCP is working for us. It seems to have increased the coredns_dns_request_duration_seconds metric by around 100%, but that is okay because it was already only a couple milliseconds. Changing to ndots:2 made a huge difference, at least as far as the overall number of DNS requests and their results. The request rate to CoreDNS dropped by around 75%, and NXDOMAIN response codes went to almost nothing.

I still suspect there is some sort of bug in fluent-bit that causes a deadlock or something similar after certain network timeouts or failures.

leonardo-albertovich · 2022-08-12T19:46:07Z

Yes @nkinkade that setting would cause fluent-bit to prefer ipv4 records any time both ipv4 and ipv6 records are available. It doesn't mean it it will stop using ipv6 if that's the only record type available, it's just about ordering and it's meant to address a very specific issue in GCE and it's not useful outside of that environment.

edsiper · 2022-08-12T21:41:36Z

@PettitWesley I think there is no intention to hide configuration options, actually the binary helper list them here:

~/coding/fluent-bit/build (master) » bin/fluent-bit -o stackdriver -h                                   130 ↵ edsiper@monox-2
Fluent Bit v2.0.0
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

HELP
stackdriver output plugin

DESCRIPTION
Send events to Google Stackdriver Logging

OPTIONS
google_service_credentials     Set the path for the google service credentials file
                               > default: default, type: string

...redacted...

custom_k8s_regex               Set a custom kubernetes regex filter
                               > default: (?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$, type: string

resource_labels                Set the resource labels
                               > default: default, type: multiple comma delimited strings


NETWORKING
net.dns.mode                   Select the primary DNS connection type (TCP or UDP)
                               > default: default, type: string

net.dns.resolver               Select the primary DNS resolver type (LEGACY or ASYNC)
                               > default: default, type: string

net.dns.prefer_ipv4            Prioritize IPv4 DNS results when trying to establish a
                               connection
                               > default: false, type: boolean

net.keepalive                  Enable or disable Keepalive support
                               > default: true, type: boolean

net.keepalive_idle_timeout     Set maximum time allowed for an idle Keepalive connection
                               > default: 30s, type: time

net.connect_timeout            Set maximum time allowed to establish a connection, this
                               time includes the TLS handshake
                               > default: 10s, type: time

net.connect_timeout_log_error  On connection timeout, specify if it should log an error.
                               When disabled, the timeout is logged as a debug message
                               > default: true, type: boolean

net.source_address             Specify network address to bind for data traffic
                               > default: default, type: string

net.keepalive_max_recycle      Set maximum number of times a keepalive connection can be
                               used before it is retired.
                               > default: 2000, type: integer

we will make sure to update the web docs with such same info, but again, there is no such "special hidden settings", just undocumented in web.. it will be fixed soon

…s a port. [fluentGH-4260] Resolve domain name not found by adding code that is capable of extracting the port if it exists. If not then the default 443 will be used. Signed-off-by: 030 <chocolatey030@gmail.com>

030 · 2022-09-12T08:02:01Z

@edsiper Could you check the PR?

…s a port. [fluentGH-4260] Resolve domain name not found by adding code that is capable of extracting the port if it exists. If not then the default 443 will be used. Signed-off-by: 030 <chocolatey030@gmail.com>

…s a port. (#5458) [GH-4260] Resolve domain name not found by adding code that is capable of extracting the port if it exists. If not then the default 443 will be used. Signed-off-by: 030 <chocolatey030@gmail.com> Signed-off-by: 030 <chocolatey030@gmail.com>

…s a port. (fluent#5458) [fluentGH-4260] Resolve domain name not found by adding code that is capable of extracting the port if it exists. If not then the default 443 will be used. Signed-off-by: 030 <chocolatey030@gmail.com> Signed-off-by: 030 <chocolatey030@gmail.com> Signed-off-by: Manal Geries <mgeriesa@gmail.com>

github-actions · 2022-12-12T02:04:07Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions · 2022-12-18T01:58:53Z

This issue was closed because it has been stalled for 5 days with no activity.

…s a port. (fluent#5458) [fluentGH-4260] Resolve domain name not found by adding code that is capable of extracting the port if it exists. If not then the default 443 will be used. Signed-off-by: 030 <chocolatey030@gmail.com> Signed-off-by: 030 <chocolatey030@gmail.com> Signed-off-by: root <root@sumit-acs.novalocal>

vijay-veeranki added a commit to ministryofjustice/cloud-platform-terraform-logging that referenced this issue Mar 3, 2022

Pinned app version to "1.8.3"

3090064

As we see this issue in the latest version fluent/fluent-bit#4260

vijay-veeranki mentioned this issue Mar 3, 2022

Fluent bit upgrade ministryofjustice/cloud-platform-terraform-logging#44

Merged

vijay-veeranki mentioned this issue Mar 3, 2022

Update logging module ministryofjustice/cloud-platform-infrastructure#1493

Closed

vijay-veeranki mentioned this issue Mar 3, 2022

Upgrade logging to use latest version ministryofjustice/cloud-platform-infrastructure#1494

Merged

matthewfala mentioned this issue Mar 5, 2022

Elastic Cloud. Domain name found. Connection failed to domain.com:9243:443 aws/aws-for-fluent-bit#306

Closed

vijay-veeranki mentioned this issue Mar 8, 2022

Increased buffer size to 1MB, default is 32k ministryofjustice/cloud-platform-terraform-logging#46

Merged

030 mentioned this issue May 13, 2022

out_es: read port if present in cloud_host #3923

Open

030 added a commit to 030/fluent-bit that referenced this issue May 16, 2022

[fluentGH-4260] Resolve domain name not found in conjunction with por…

0e5577a

…t using cloud_id.

030 added a commit to 030/fluent-bit that referenced this issue May 16, 2022

[fluentGH-4260] Resolve domain name not found in conjunction with por…

32dd7e3

…t using cloud_id.

030 added a commit to 030/fluent-bit that referenced this issue May 16, 2022

[fluentGH-4260] Resolve domain name not found in conjunction with por…

c2abc9f

…t using cloud_id.

030 added a commit to 030/fluent-bit that referenced this issue May 16, 2022

[fluentGH-4260] Resolve domain name not found in conjunction with por…

fea66c0

…t using cloud_id.

030 added a commit to 030/fluent-bit that referenced this issue May 16, 2022

[fluentGH-4260] Resolve domain name not found in conjunction with por…

94b58b0

…t using cloud_id.

030 mentioned this issue May 16, 2022

out_es: resolve domain name not found issue in cloud_id as it contains a port. #5458

Merged

1 task

030 added a commit to 030/fluent-bit that referenced this issue May 16, 2022

[fluentGH-4260] Resolve domain name not found in conjunction with por…

cf63572

…t using cloud_id. Signed-off-by: 030 <chocolatey030@gmail.com>

030 added a commit to 030/fluent-bit that referenced this issue May 16, 2022

[fluentGH-4260] Resolve domain name not found in conjunction with por…

6199bfe

…t using cloud_id. Signed-off-by: 030 <chocolatey030@gmail.com>

nkinkade mentioned this issue Jul 27, 2022

Sets ndots=2 in dnsPolicy for fluentbit and disco DaemonSets + fix flannel m-lab/k8s-support#707

Merged

github-actions bot added the Stale label Dec 12, 2022

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 18, 2022

gavenkoa mentioned this issue Apr 2, 2023

New ASYNC net.dns.resolver fails with getaddrinfo(err=12): Timeout while contacting DNS servers with Elasticsearch shipper #7105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNS resolution timeout/failure in 1.8.9 #4260

DNS resolution timeout/failure in 1.8.9 #4260

stevenarvar commented Oct 29, 2021

stevenarvar commented Oct 29, 2021

matthewfala commented Oct 29, 2021

sdwerwed commented Nov 15, 2021 •

edited

Loading

bensta commented Nov 22, 2021 •

edited

Loading

urpyLLIKa commented Nov 23, 2021

ehelvacikoylu commented Feb 7, 2022

jcamu commented Mar 2, 2022

patrick-stephens commented Mar 14, 2022

030 commented May 13, 2022 •

edited

Loading

030 commented May 13, 2022

leonardo-albertovich commented Jul 19, 2022

PettitWesley commented Jul 19, 2022

nkinkade commented Jul 27, 2022

leonardo-albertovich commented Aug 12, 2022

PettitWesley commented Aug 12, 2022

leonardo-albertovich commented Aug 12, 2022

PettitWesley commented Aug 12, 2022 •

edited

Loading

nkinkade commented Aug 12, 2022 •

edited

Loading

leonardo-albertovich commented Aug 12, 2022

edsiper commented Aug 12, 2022

030 commented Sep 12, 2022 •

edited

Loading

github-actions bot commented Dec 12, 2022

github-actions bot commented Dec 18, 2022

DNS resolution timeout/failure in 1.8.9 #4260

DNS resolution timeout/failure in 1.8.9 #4260

Comments

stevenarvar commented Oct 29, 2021

Bug Report

stevenarvar commented Oct 29, 2021

matthewfala commented Oct 29, 2021

sdwerwed commented Nov 15, 2021 • edited Loading

bensta commented Nov 22, 2021 • edited Loading

urpyLLIKa commented Nov 23, 2021

ehelvacikoylu commented Feb 7, 2022

jcamu commented Mar 2, 2022

patrick-stephens commented Mar 14, 2022

030 commented May 13, 2022 • edited Loading

030 commented May 13, 2022

leonardo-albertovich commented Jul 19, 2022

PettitWesley commented Jul 19, 2022

nkinkade commented Jul 27, 2022

leonardo-albertovich commented Aug 12, 2022

PettitWesley commented Aug 12, 2022

leonardo-albertovich commented Aug 12, 2022

PettitWesley commented Aug 12, 2022 • edited Loading

nkinkade commented Aug 12, 2022 • edited Loading

leonardo-albertovich commented Aug 12, 2022

edsiper commented Aug 12, 2022

030 commented Sep 12, 2022 • edited Loading

github-actions bot commented Dec 12, 2022

github-actions bot commented Dec 18, 2022

sdwerwed commented Nov 15, 2021 •

edited

Loading

bensta commented Nov 22, 2021 •

edited

Loading

030 commented May 13, 2022 •

edited

Loading

PettitWesley commented Aug 12, 2022 •

edited

Loading

nkinkade commented Aug 12, 2022 •

edited

Loading

030 commented Sep 12, 2022 •

edited

Loading