Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: unexpected EOF, http: unexpected EOF reading trailer, write: connection reset by peer #1449

Closed
AdamDorwart opened this issue Aug 7, 2017 · 3 comments

Comments

@AdamDorwart
Copy link

AdamDorwart commented Aug 7, 2017

Bug Report

I've been seeing an intermittent issue where my pipeline starts erroring out on jobs and than resolves itself. On 8/7 starting at 11:40a on a pipeline a bunch of jobs that were triggered in close succession all failed with similar errors.

(don't mind the streak of red failures following up to the event)
image

Build # | Errors
93 | Put /volumes/42615995-7b9d-48af-75ea-c9c61b524a3c/stream-in?path=.: unexpected EOF
94 | Put /volumes/fb3c1dde-a9df-4292-41b2-6dbd94c90562/stream-in?path=.: http: unexpected EOF reading trailer
95 | Put /volumes/be66f2b7-82f3-49db-7dc3-f8abdce5380e/stream-in?path=.: unexpected EOF
96 | Put /volumes/26f1514f-8a49-4049-4e0f-b106edcdd8e9/stream-in?path=.: http: unexpected EOF reading trailer
97 | Put /volumes/0356294f-165c-413a-49f9-a96af4568cc0/stream-in?path=.: http: unexpected EOF reading trailer
99 | Put /volumes/bf63b10d-8a94-4ec5-6ad1-13f287051adb/stream-in?path=.: write tcp 127.0.0.1:49000->127.0.0.1:33561: write: connection reset by peer

~12:02p @ worker-1

{"timestamp":"1502107349.425066948","source":"guardian","message":"guardian.api.garden-server.destroy.destroyed","log_level":1,"data":{"handle":"5b371a5f-9d33-49ba-6d39-ea55453b76e8","session":"3.1.111336"}}
{"timestamp":"1502107354.649356604","source":"baggageclaim","message":"baggageclaim.api.volume-server.stream-out.done","log_level":0,"data":{"session":"4.1.91795","volume":"ae831976-e1f8-4961-7593-6c6377159c97"}}
{"timestamp":"1502107357.741293907","source":"guardian","message":"guardian.api.garden-server.run.exited","log_level":1,"data":{"handle":"dabd5591-79c5-4f00-7114-f459c50eef64","id":"task","session":"3.1.111335","status":1}}
{"timestamp":"1502107370.631527662","source":"worker","message":"worker.beacon.keepalive.failed","log_level":2,"data":{"error":"EOF","session":"4.9"}}
{"timestamp":"1502107370.631785393","source":"worker","message":"worker.beacon.restarting","log_level":2,"data":{"error":"wait: remote command exited without exit status or exit signal","session":"4"}}
{"timestamp":"1502107375.768595457","source":"tsa","message":"tsa.connection.channel.forward-worker.register.start","log_level":1,"data":{"remote":"192.168.8.1:42946","session":"121.1.1.5","worker-address":"127.0.0.1:40437","worker-platform":"linux","worker-tags":""}}
{"timestamp":"1502107375.771958351","source":"guardian","message":"guardian.list-containers.starting","log_level":1,"data":{"session":"79247"}}

~11:49a @ worker-2

{"timestamp":"1502131778.239357471","source":"guardian","message":"guardian.destroy.finished","log_level":1,"data":{"handle":"0b4b62b2-f5a8-46b1-4a72-c81aaaba96ee","session":"16919"}}
{"timestamp":"1502131778.239391804","source":"guardian","message":"guardian.api.garden-server.destroy.destroyed","log_level":1,"data":{"handle":"0b4b62b2-f5a8-46b1-4a72-c81aaaba96ee","session":"3.1.17088"}}
{"timestamp":"1502131785.931120634","source":"worker","message":"worker.beacon.restarting","log_level":2,"data":{"error":"wait: remote command exited without exit status or exit signal","session":"4"}}
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
{"timestamp":"1502131785.932104349","source":"baggageclaim","message":"baggageclaim.api.volume-server.stream-in.bad-stream-payload","log_level":1,"data":{"error":"exit status 2","session":"4.1.17910","volume":"bf63b10d-8a94-4ec5-6ad1-13f287051adb"}}
{"timestamp":"1502131785.932196379","source":"baggageclaim","message":"baggageclaim.api.volume-server.stream-in.done","log_level":0,"data":{"session":"4.1.17910","volume":"bf63b10d-8a94-4ec5-6ad1-13f287051adb"}}

~11:48a @ worker-3

{"timestamp":"1502131709.111960411","source":"guardian","message":"guardian.destroy.start","log_level":1,"data":{"handle":"51d6edca-04e2-4042-5ccf-2cbb7d9639df","session":"27912"}}
{"timestamp":"1502131709.112079144","source":"guardian","message":"guardian.destroy.finished","log_level":1,"data":{"handle":"51d6edca-04e2-4042-5ccf-2cbb7d9639df","session":"27912"}}
{"timestamp":"1502131709.112115145","source":"guardian","message":"guardian.api.garden-server.destroy.failed","log_level":2,"data":{"error":"unknown handle: 51d6edca-04e2-4042-5ccf-2cbb7d9639df","handle":"51d6edca-04e2-4042-5ccf-2cbb7d9639df","session":"3.1.28059"}}
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
{"timestamp":"1502131734.684698820","source":"baggageclaim","message":"baggageclaim.api.volume-server.stream-in.bad-stream-payload","log_level":1,"data":{"error":"exit status 2","session":"4.1.17860","volume":"42615995-7b9d-48af-75ea-c9c61b524a3c"}}
{"timestamp":"1502131734.684844494","source":"baggageclaim","message":"baggageclaim.api.volume-server.stream-in.done","log_level":0,"data":{"session":"4.1.17860","volume":"42615995-7b9d-48af-75ea-c9c61b524a3c"}}

~11:51 @ worker-3

{"timestamp":"1502131866.838005304","source":"guardian","message":"guardian.run.finished","log_level":1,"data":{"handle":"f6e0af69-2491-49da-77d2-6d68094336e5","path":"code/pikachu/ci/clean_build_and_unit_test","session":"27929"}}
{"timestamp":"1502131866.838015795","source":"guardian","message":"guardian.api.garden-server.run.spawned","log_level":1,"data":{"handle":"f6e0af69-2491-49da-77d2-6d68094336e5","id":"task","session":"3.1.28080","spec":{"Path":"code/pikachu/ci/clean_build_and_unit_test","Dir":"/tmp/build/eae1f940","User":"root","Limits":{},"TTY":{}}}}
{"timestamp":"1502131880.435690165","source":"worker","message":"worker.beacon.restarting","log_level":2,"data":{"error":"wait: remote command exited without exit status or exit signal","session":"4"}}
{"timestamp":"1502131880.435745001","source":"worker","message":"worker.beacon.keepalive.failed","log_level":2,"data":{"error":"EOF","session":"4.15"}}
{"timestamp":"1502131885.488456726","source":"tsa","message":"tsa.connection.channel.forward-worker.register.start","log_level":1,"data":{"remote":"192.168.8.1:53120","session":"132.1.1.5","worker-address":"127.0.0.1:36753","worker-platform":"linux","worker-tags":"compiler,demo_sim"}}
{"timestamp":"1502131885.488703012","source":"guardian","message":"guardian.list-containers.starting","log_level":1,"data":{"session":"27930"}}
{"timestamp":"1502131885.488749743","source":"guardian","message":"guardian.list-containers.finished","log_level":1,"data":{"session":"27930"}}

~11:49a @ worker-4

{"timestamp":"1502131720.101345539","source":"baggageclaim","message":"baggageclaim.api.volume-server.stream-in.start","log_level":0,"data":{"session":"4.1.18126","volume":"0356294f-165c-413a-49f9-a96af4568cc0"}}
{"timestamp":"1502131720.101536751","source":"baggageclaim","message":"baggageclaim.repository.stream-in.namespace.start","log_level":0,"data":{"full-path":"/home/ci/concourse-ci/data/volumes/live/0356294f-165c-413a-49f9-a96af4568cc0/volume","path":"/home/ci/concourse-ci/data/volumes/live/0356294f-165c-413a-49f9-a96af4568cc0/volume","session":"3.391262.1","sub-path":".","volume":"0356294f-165c-413a-49f9-a96af4568cc0"}}
{"timestamp":"1502131720.101616621","source":"baggageclaim","message":"baggageclaim.repository.stream-in.namespace.done","log_level":0,"data":{"full-path":"/home/ci/concourse-ci/data/volumes/live/0356294f-165c-413a-49f9-a96af4568cc0/volume","path":"/home/ci/concourse-ci/data/volumes/live/0356294f-165c-413a-49f9-a96af4568cc0/volume","session":"3.391262.1","sub-path":".","volume":"0356294f-165c-413a-49f9-a96af4568cc0"}}
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
{"timestamp":"1502131734.682876825","source":"baggageclaim","message":"baggageclaim.api.volume-server.stream-in.bad-stream-payload","log_level":1,"data":{"error":"exit status 2","session":"4.1.18126","volume":"0356294f-165c-413a-49f9-a96af4568cc0"}}
{"timestamp":"1502131734.682912588","source":"baggageclaim","message":"baggageclaim.api.volume-server.stream-in.done","log_level":0,"data":{"session":"4.1.18126","volume":"0356294f-165c-413a-49f9-a96af4568cc0"}}
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
{"timestamp":"1502131734.684317589","source":"baggageclaim","message":"baggageclaim.api.volume-server.stream-in.bad-stream-payload","log_level":1,"data":{"error":"exit status 2","session":"4.1.18106","volume":"be66f2b7-82f3-49db-7dc3-f8abdce5380e"}}
{"timestamp":"1502131734.684339523","source":"baggageclaim","message":"baggageclaim.api.volume-server.stream-in.done","log_level":0,"data":{"session":"4.1.18106","volume":"be66f2b7-82f3-49db-7dc3-f8abdce5380e"}}
{"timestamp":"1502131748.612316370","source":"worker","message":"worker.beacon.keepalive.failed","log_level":2,"data":{"error":"EOF","session":"4.11"}}
{"timestamp":"1502131748.612370968","source":"worker","message":"worker.beacon.restarting","log_level":2,"data":{"error":"wait: remote command exited without exit status or exit signal","session":"4"}}
{"timestamp":"1502131753.759077072","source":"tsa","message":"tsa.connection.channel.forward-worker.register.start","log_level":1,"data":{"remote":"192.168.8.1:53996","session":"130.1.1.5","worker-address":"127.0.0.1:41389","worker-platform":"linux","worker-tags":"compiler,dockerfile"}}
{"timestamp":"1502131753.759771585","source":"guardian","message":"guardian.list-containers.starting","log_level":1,"data":{"session":"16949"}}

~11:49 @ web-atc-tsa

{"log":"{\"timestamp\":\"1502131785.933176041\",\"source\":\"atc\",\"message\":\"atc.syncer.master:scheduler.tick.scheduling.try-start-next-pending-build.do.task.failed-to-create-container-in-garden\",\"log_level\":2,\"data\":{\"build-id\":228,\"build-name\":\"99\",\"container\":\"7f390717-384b-40b6-7eb8-048632981963\",\"error\":\"Put /volumes/bf63b10d-8a94-4ec5-6ad1-13f287051adb/stream-in?path=.: write tcp 127.0.0.1:49000-\\u003e127.0.0.1:33561: write: connection reset by peer\",\"session\":\"16.4.42886.3.5.4.2\"}}\n","stream":"stdout","time":"2017-08-07T18:49:45.93414368Z"}
{"log":"{\"timestamp\":\"1502131785.933877468\",\"source\":\"atc\",\"message\":\"atc.syncer.master:scheduler.tick.scheduling.try-start-next-pending-build.do.task.failed-to-run-action\",\"log_level\":2,\"data\":{\"build-id\":228,\"build-name\":\"99\",\"error\":\"Put /volumes/bf63b10d-8a94-4ec5-6ad1-13f287051adb/stream-in?path=.: write tcp 127.0.0.1:49000-\\u003e127.0.0.1:33561: write: connection reset by peer\",\"session\":\"16.4.42886.3.5.4.2\"}}\n","stream":"stdout","time":"2017-08-07T18:49:45.934521425Z"}
{"log":"{\"timestamp\":\"1502131785.932796955\",\"source\":\"tsa\",\"message\":\"tsa.connection.tcpip-forward.interrupted\",\"log_level\":1,\"data\":{\"remote\":\"192.168.8.1:32986\",\"session\":\"127.2\"}}\n","stream":"stdout","time":"2017-08-07T18:49:45.934542911Z"}
{"log":"{\"timestamp\":\"1502131785.934935570\",\"source\":\"tsa\",\"message\":\"tsa.connection.channel.forward-worker.wait-for-process.failed-to-close-channel\",\"log_level\":2,\"data\":{\"error\":\"EOF\",\"remote\":\"192.168.8.1:32986\",\"session\":\"127.1.1.4\"}}\n","stream":"stdout","time":"2017-08-07T18:49:45.936218943Z"}
{"log":"{\"timestamp\":\"1502131785.933032990\",\"source\":\"tsa\",\"message\":\"tsa.connection.tcpip-forward.failed-to-accept\",\"log_level\":2,\"data\":{\"error\":\"accept tcp [::]:33561: use of closed network connection\",\"remote\":\"192.168.8.1:32986\",\"session\":\"127.3\"}}\n","stream":"stdout","time":"2017-08-07T18:49:45.936513946Z"}
{"log":"{\"timestamp\":\"1502131785.937350035\",\"source\":\"atc\",\"message\":\"atc.syncer.master:scheduler.tick.scheduling.try-start-next-pending-build.do.task.errored\",\"log_level\":1,\"data\":{\"build-id\":228,\"build-name\":\"99\",\"error\":\"Put /volumes/bf63b10d-8a94-4ec5-6ad1-13f287051adb/stream-in?path=.: write tcp 127.0.0.1:49000-\\u003e127.0.0.1:33561: write: connection reset by peer\",\"session\":\"16.4.42886.3.5.4.2\"}}\n","stream":"stdout","time":"2017-08-07T18:49:45.93771808Z"}
{"log":"{\"timestamp\":\"1502131785.948861122\",\"source\":\"atc\",\"message\":\"atc.syncer.master:scheduler.tick.scheduling.try-start-next-pending-build.finish.errored\",\"log_level\":1,\"data\":{\"build-id\":228,\"build-name\":\"99\",\"error\":\"Put /volumes/bf63b10d-8a94-4ec5-6ad1-13f287051adb/stream-in?path=.: write tcp 127.0.0.1:49000-\\u003e127.0.0.1:33561: write: connection reset by peer\",\"session\":\"16.4.42886.3.5.5\"}}\n","stream":"stdout","time":"2017-08-07T18:49:45.949145862Z"}
  • Concourse version: 3.3.4
  • Deployment type (BOSH/Docker/binary): Binary
  • Infrastructure/IaaS: On-site. Mixture of VMs and bare metal. All Ubuntu 16.04, 4.10.0-30. All machines exist together in our LAN on the same switch.
  • Browser (if applicable): n/a
  • Did this used to work? This is an intermittent issue that's been happening for a while (hard to say when it started)
@AdamDorwart
Copy link
Author

AdamDorwart commented Aug 7, 2017

Looking into the code, it appears worker.beacon.keepalive.failed which appears in a few of the workers logs at the incident is caused by a heartbeat request failing. More specifically it's a SendRequest returning EOF for golangs ssh library.

@vito vito added this to Backlog in Runtime Aug 14, 2017
@vito vito moved this from Backlog to Needs Validation in Runtime Aug 14, 2017
@topherbullock
Copy link
Member

@AdamDorwart Looking to validate this as a Concourse specific issue or a deployment specific issue. Is it possible that there is some network traffic / congestion / other network related jitter which could be causing the ATC to lose ability to talk to workers or vice-versa?

@AdamDorwart
Copy link
Author

I discovered that this issue was related to iptables rules were being set to drop packets on some of the problematic workers.

@topherbullock topherbullock moved this from Needs Validation to Accepted in Runtime Oct 24, 2017
@jama22 jama22 modified the milestone: patch Oct 25, 2017
@jama22 jama22 added this to the v3.6.0 milestone Oct 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Runtime
Accepted
Development

No branches or pull requests

3 participants