Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rejected connection in ETCD logs from localhost every 10 seconds #9285

Closed
KashifSaadat opened this issue Feb 6, 2018 · 22 comments

Comments

@KashifSaadat
Copy link

commented Feb 6, 2018

Hi,

I've recently been testing out a Kubernetes Cluster deployment with ETCD v3.3.0, and noticed the following messages in the logs:

2018-02-06 12:41:06.905234 I | embed: rejected connection from "127.0.0.1:35574" (error "EOF", ServerName "")
2018-02-06 12:41:07.211189 I | embed: rejected connection from "127.0.0.1:35578" (error "EOF", ServerName "")
2018-02-06 12:41:16.905274 I | embed: rejected connection from "127.0.0.1:35602" (error "EOF", ServerName "")
2018-02-06 12:41:17.211320 I | embed: rejected connection from "127.0.0.1:35606" (error "EOF", ServerName "")

There are two of these log lines generated every 10 seconds. The cluster itself is operating without any issues, I can create / delete resources and I haven't noticed/experienced any problems so far. Just curious if anyone has an idea what may be making the requests or if it's even anything of concern?

Additional notes:

  • I haven't managed to recreate the issue with ETCD v3.2.15.
  • There are no additional deployments in the cluster, aside from what kops may create on a standard install

Versions:
ETCD Version: v3.3.0
Kubernetes Version: v1.8.7
Kops Version: master + PR
CoreOS Version: v1632.2.1

@hexfusion

This comment has been minimized.

Copy link
Member

commented Feb 6, 2018

Hi @KashifSaadat I assume you are using TLS? Can we see your full etcd startup flags?

@hexfusion

This comment has been minimized.

Copy link
Member

commented Feb 6, 2018

I haven't managed to recreate the issue with ETCD v3.2.15.

So are you saying you can recreate the issue with v3.3.0? Can you provide full steps?

@KashifSaadat

This comment has been minimized.

Copy link
Author

commented Feb 6, 2018

Hi @hexfusion, apologies forgot to include that info! ETCD is started up with the following env vars (pulled from one of the hosts):

ETCD_ADVERTISE_CLIENT_URLS='https://etcd-a2.internal.<KOPS-CLUSTER-NAME>:4001'
ETCD_CERT_FILE='/srv/kubernetes/etcd.pem'
ETCD_DATA_DIR='/var/etcd/data'
ETCD_INITIAL_ADVERTISE_PEER_URLS='https://etcd-a2.internal.<KOPS-CLUSTER-NAME>:2380'
ETCD_INITIAL_CLUSTER='etcd-a1=https://etcd-a1.internal.<KOPS-CLUSTER-NAME>:2380,etcd-a2=https://etcd-a2.internal.<KOPS-CLUSTER-NAME>:2380,etcd-b1=https://etcd-b1.internal.<KOPS-CLUSTER-NAME>:2380,etcd-b2=https://etcd-b2.internal.<KOPS-CLUSTER-NAME>:2380,etcd-c1=https://etcd-c1.internal.<KOPS-CLUSTER-NAME>:2380'
ETCD_INITIAL_CLUSTER_STATE='new'
ETCD_INITIAL_CLUSTER_TOKEN='etcd-cluster-token-etcd'
ETCD_KEY_FILE='/srv/kubernetes/etcd-key.pem'
ETCD_LISTEN_CLIENT_URLS='https://0.0.0.0:4001'
ETCD_LISTEN_PEER_URLS='https://0.0.0.0:2380'
ETCD_NAME='etcd-a2'
ETCD_PEER_CERT_FILE='/srv/kubernetes/etcd.pem'
ETCD_PEER_KEY_FILE='/srv/kubernetes/etcd-key.pem'
ETCD_PEER_TRUSTED_CA_FILE='/srv/kubernetes/ca.crt'
ETCD_TRUSTED_CA_FILE='/srv/kubernetes/ca.crt'

I've attached an example Kops ClusterSpec you can use for testing:
example-clusterSpec.yaml

^ Will need some tweaking such as setting the kops state bucket, cluster name, tightening down Kube API & SSH Access, etc.. General guide on this here: https://github.com/kubernetes/kops/blob/master/docs/aws.md#creating-your-first-cluster

As mentioned in the initial post I've built kops from source, to make use of particular contributions. A guide on building kops from source is here: https://github.com/kubernetes/kops/blob/master/docs/development/building.md

In particular, you would need to build new versions of kops, nodeup & protokube. This can be done by running similar to the following:

go get k8s.io/kops
cd go/src/k8s.io/kops
git remote add appvia git@github.com:appvia/kops.git
git checkout appvia/bootstrapscript-etcdClusterSpec
make upload S3_BUCKET=s3://<KOPS-S3-BUCKET>

Then to build a cluster using kops (make sure you use the kops binary that is built in the above steps, located in go/src/k8s.io/kops/.build/upload/kops/1.8.0/):

export NODEUP_URL=https://s3.<REGION>.amazonaws.com/<KOPS-S3-BUCKET>/<PATH-TO-NODEUP>
export PROTOKUBE_IMAGE=https://s3.<REGION>.amazonaws.com/<KOPS-S3-BUCKET>/<PATH-TO-PROTOKUBE>
kops create -f clusterSpec.yaml # based off the attached file with your amendments
kops create secret --name <KOPS-CLUSTER-NAME> sshpublickey admin -i ~/.ssh/id_rsa.pub
kops update cluster <KOPS-CLUSTER-NAME> --yes

Once the cluster has started, you'll be able to ssh into one of the master instances using the default user (core for the CoreOS AMI I've used in the attached ClusterSpec), and the etcd logs are located at:

  • /var/log/etcd.log
  • /var/log/etcd-events.log
@hexfusion

This comment has been minimized.

Copy link
Member

commented Feb 6, 2018

@KashifSaadat great details thanks for that.

There are two of these log lines generated every 10 seconds.

I wonder if this is some kind of health check failing to connect? Also can you provide more logging?

@hexfusion

This comment has been minimized.

Copy link
Member

commented Feb 6, 2018

Actually, the 10 seconds could be TLSHandshakeTimeout: 10 * time.Second. We will want probably logs for all peers. You also should verify your certs. It seems that it perhaps should also include 127.0.0.1 in your SAN?

openssl x509 -in $cert_name.pem -text -noout

https://github.com/coreos/etcd/blob/master/Documentation/op-guide/security.md

@gyuho

This comment has been minimized.

Copy link
Member

commented Feb 6, 2018

Probably there are other client requests coming with wrong certs, thus being closed.
Still the error message is not descriptive, which seems addressed in https://go-review.googlesource.com/c/go/+/86715/7/src/crypto/tls/tls.go.

@KashifSaadat

This comment has been minimized.

Copy link
Author

commented Feb 9, 2018

Hey, apologies for the late update!

I've been doing some debugging and found the following:

  • It appears that the calls are originating from kubelet. When I stop the kubelet service and leave all other services operating, the rejected connection logs in etcd immediately stop.
  • When running Kubernetes(/kubelet) v1.8.7 and ETCD v3.2.15, I do not see the logs.
  • When running Kubernetes(/kubelet) v1.8.7 and ETCD v3.3.0, I see the logs.
  • I bumped up the verbosity for kubelet and noticed various HTTP Probe logs, so tested them out:
kubelet[26610]: I0209 14:42:52.556308   26610 prober.go:160] HTTP-Probe Host: http://127.0.0.1, Port: 8080, Path: /healthz
kubelet[26610]: I0209 14:42:52.556338   26610 prober.go:163] HTTP-Probe Headers: map[]
kubelet[26610]: I0209 14:42:52.557460   26610 http.go:96] Probe succeeded for http://127.0.0.1:8080/healthz, Response: {200 OK 200 HTTP/1.1 1 1 map[Date:[Fri, 09 Feb 2018 14:42:52 GMT] Content-Length:[2] Content-Type:[text/plain; charset=utf-8]] 0xc4206a8da0 2 [] true false map[] 0xc42197dc00 <nil>}
kubelet[26610]: I0209 14:42:52.557499   26610 prober.go:113] Liveness probe for "kube-apiserver-ip-10-100-0-134.eu-west-2.compute.internal_kube-system(1ab242f3aecc4235bc25d68db8832ba5):kube-apiserver" succeeded

With both ETCD v3.2.15 and ETCD v3.3.0 Clusters I ran curl http://127.0.0.1:8080/healthz on a master node. On both, I get a HTTP 200 OK response and the response content ok is returned. By making this curl request, ETCD v3.3.0 returns the logs:

2018-02-09 14:42:52.557170 I | embed: rejected connection from "127.0.0.1:38132" (error "EOF", ServerName "")

So it's caused when kubelet is performing liveness probes against different services and hits the kube-apiserver unauthenticated port. I'm now lost from here :P

@gyuho

This comment has been minimized.

Copy link
Member

commented Feb 9, 2018

I ran curl http://127.0.0.1:8080/healthz on a master node. On both, I get a HTTP 200 OK response and the response content ok is returned. By making this curl request, ETCD v3.3.0 returns the logs:

@xiang90 Does api server probes ping etcd?

@KashifSaadat

This comment has been minimized.

Copy link
Author

commented Feb 14, 2018

Been doing a little more debugging, it appears to be this call here:

My kops master nodes start kube-apiserver up with the flag --etcd-servers=https://127.0.0.1:4001. Trawling through the code, the request ends up as my below example:

import (
	"fmt"
	"net"
	"time"
)

func main() {
	conn, err := net.DialTimeout("tcp", "127.0.0.1:4001", 1*time.Second)
	if err != nil {
		fmt.Printf("failed, %v", err)
	} else {
		conn.Close()
		fmt.Printf("successful\n") // I return successfully here, no errors.
	}
}

I can execute that to replicate the behaviour and logs in etcd exactly, where the above code gets a successful response however etcd reports the rejected connection.

@yangunang

This comment has been minimized.

Copy link

commented Mar 6, 2018

I run etcd 3.3.1 have the same problem on log : etcd: rejected connection from "localIP:54124" (error "EOF", ServerName "")

@jpiper

This comment has been minimized.

Copy link

commented Mar 14, 2018

I've noticed that this happens when you hit the "/healthz" endpoint on the kubernetes API Server for some reason. It doesn't seem to be an issue but when you're trying to debug something it wastes a few hours of your time ha

@JinsYin

This comment has been minimized.

Copy link

commented May 29, 2018

Solved.

I replaced the server certificate with the peer certificate. Although it seems unreasonable.

#9785

@eugene-chow

This comment has been minimized.

Copy link

commented Jun 7, 2018

I can confirm that the errors occur on kubernetes v1.10.3 with etcd v3.3.5. The error appears when kubelet calls apiserver's healthz endpoing like what @KashifSaadat said. Manually calling curl -k https://localhost:6443/healthz triggers this as well.

Does anyone know why this is happening?

@hectorj2f

This comment has been minimized.

Copy link

commented Jun 12, 2018

We see the same log entries here with Kubernetes 1.10.4 and etcdv3.3.7.

@ttyS0

This comment has been minimized.

Copy link

commented Jun 24, 2018

I'm still seeing this with Kubernetes 1.10.5 and etcd v3.3.8.

@cflee

This comment has been minimized.

Copy link

commented Jun 26, 2018

See kubernetes/kubernetes#63316 and kubernetes/kubernetes#64909.

It looks like the PR kubernetes/kubernetes#65027 "Use actual etcd client for /healthz/etcd checks" will replace apiserver's TCP dial with an actual etcd client trying to list etcd members, so I guess this will be resolved in Kubernetes v1.12.

@akutz

This comment has been minimized.

Copy link

commented Sep 27, 2018

Thank you for this issue. I've been tolling for two days to understand why kubectl exec fails on my 1.10 deployment (works on 11, 12, 13..). Kept seeing the EOF errors regarding health checks. Really bad when an AWS ALB is fronting your cluster, hitting the health checks every few seconds. If this is a false positive, it's a doozy! Put me on a deep-dive this morning that sounds like I can ignore. Back to the other issue :)

@n0made1

This comment has been minimized.

Copy link

commented Oct 4, 2018

I also see this with Kubernetes 1.11.1 and etcd 3.3.9.
Is there any known workaround ?

@KashifSaadat

This comment has been minimized.

Copy link
Author

commented Oct 4, 2018

No worries @akutz, glad it helped!

@n0made1 the above Kubernetes PR was released and available in v1.12.0 (changelog), I've not yet got around to testing it.

Edit: Tested and resolved in Kubernetes v1.12

@n0made1

This comment has been minimized.

Copy link

commented Oct 4, 2018

@wenjiaswe

This comment has been minimized.

Copy link
Contributor

commented Oct 23, 2018

Agree with @cflee that kubernetes/kubernetes#65027 should have addressed the issue. I am looking into if it's feasible to backport to k8s 1.10 and 1.11.

/assign

@wenjiaswe

This comment has been minimized.

Copy link
Contributor

commented Oct 23, 2018

Just talked to the author @liggitt of the fix mentioned, as we need to balance between the severity of the issue vs. the risk of back porting, we are not planning to back port it k8s 1.10 and 1.11. If you have a strong use case that really need the back port in k8s, please open an issue in kubernetes, so we could have more experts evaluating it. Otherwise I am closing this issue since it's not an etcd issue.

@wenjiaswe wenjiaswe closed this Oct 23, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.