Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic 502 response only when running through traefik #3237

Open
koliyo opened this issue Apr 26, 2018 · 44 comments

Comments

@koliyo
Copy link

@koliyo koliyo commented Apr 26, 2018

Do you want to request a feature or report a bug?

Bug

What did you do?

I have a graphql API running on NodeJS using Apollo and Express, with traefik in front.

When proxying through traefik I get sporadic 502 responses that I have not been able to resolve.

I does never happen when I bypass the proxy and connect directly to the backend node server.

I am running all tests locally on my dev machine.

My first attempt to force the error was load testing with the locust framework. However, even when sending large amounts of request through the proxy I was unable to replicate the error. It only happens when I use the frontend application in the browser.

After reading this oxy issue I started suspecting cancelled connections.

I added a custom HTTP header with a UUID to be able to trace all requests, which I print on the backend.

  app.use((req, res, next) => {
    const id = req.headers['x-request-id'];
    if (id) {
      console.log(`Request id: ${id}`);
    }
    next();
  });

Then I also added the following event listener to the express server to detect cancelled requests

  app.use((req, res, next) => {
    req.connection.on('close', () => {
      const id = req.headers['x-request-id'];
      console.log(`Cancelled request ${id}`);
    });

    next();
  });

What I can see is that I do get cancelled requests when running the application in the browser, and at some point i get a 502 response from traefik. And in the traefik log this is

DEBU[2018-04-26T13:43:51+02:00] vulcand/oxy/forward/http: Round trip: http://localhost:6543, code: 502, Length: 11, duration: 66.352475ms 

And the nodejs backend log looks something like this:

...
Request id: 7455804b-490a-4361-98e5-43d12bf4aca8
Request id: 737f8d9d-3300-461b-858b-07006582a937
POST /graphql 200 83.542 ms - 310
POST /graphql 200 16.441 ms - 682
Request id: 096e0e39-90e6-475c-b8ad-0aa2dfd2e345
POST /graphql 200 5.338 ms - 163
Request id: 69f17cb2-cdf1-4db5-a9f5-08e46d795892
Request id: 50d3aec6-5cda-4e8b-ac0e-a30a57fa94c9
POST /graphql 200 58.596 ms - 310
POST /graphql 200 15.526 ms - 682
Request id: 1d051f3a-7d80-464b-b50f-6d8e733d1940
<------------- Here I get the 502
Cancelled request 2e0a8e14-9880-46e7-8e51-ad528d55a81d
Cancelled request b9489e71-7fc5-4f1c-b30a-668aac4652f9
Cancelled request 249c529c-b9cb-4b48-a491-8e38a7ee50d8
Cancelled request a5be4a66-9d43-4e30-a92d-862b355399a0
Cancelled request 3721fe71-fe18-4389-812a-a90cc2f4f0f1
Cancelled request 71b74750-8078-471e-91b8-a8119e5db797
Cancelled request 34fb6b91-9fa5-4d68-92da-c267089f5910
Cancelled request 692770b1-61c3-49c2-8309-8e7be629dca1
Cancelled request 05790579-8290-4787-a7b7-82596ad24520
Cancelled request c8edcc39-30c7-4812-941c-a1899298acf7
Cancelled request 2ba9e715-ab7c-48ee-9d35-b5609179de6e
Cancelled request b34f4725-665f-4b27-b3e1-cefec20c2ade
Cancelled request 04bd3718-f6aa-4318-a469-fa3e17f54a20
Cancelled request 4aedc60c-269a-420c-b083-1ea8f2e3243e
Cancelled request 25be7334-43f9-4135-9537-36b0e36e698c
Cancelled request 47bc1f9f-55c7-4f31-9957-7f0ad4285314
Cancelled request bae3237c-efc8-4831-8260-6edbcedef28f
Cancelled request 54685ecb-4d34-4698-b956-d0602b74a2e4
Cancelled request 965f6ff2-da91-423c-a8e4-c2f4252f25fc
Cancelled request 95c77d5c-230d-4875-8b25-fc0673c8e595
Cancelled request 01658960-4627-42f8-a496-d29408a9579b
Cancelled request 38221ac3-47ed-42f2-a56e-31deacdbfd62
Cancelled request e73bec6b-744c-47bc-b001-0d914f03e976
Cancelled request 73fade75-a943-45df-8b21-f8c50a480170
Cancelled request 02688ad9-e947-415f-b70c-3cda16c50cf2
Cancelled request 5d7d26c2-8c69-4083-a2d3-f0e1ae23bd0f
Cancelled request f81a0258-085d-462f-9fcb-8a8b47918d04
...

The failed request that gets a 502 response in the browser never reach the node server backend.

I get a whole bunch of canceled request after the 502 occurs. These request IDs have been successfully served by the nodejs application at an earlier point.

The canceling of the request seem to indicate some kind of connection leak? Or maybe just a sideffect of having chrome developer tools open?

Anyway I never get any error response when bypassing the traefik instance.

As the oxy issue describes, if I just could get some other response than 502 for cancelled requests I could handle this better on the client side.

Output of traefik version: (What version of Traefik are you using?)

Get the problem with the docker release as well as my homebrew install

Homebrew traefik version:

Version:      dev
Codename:     cheddar
Go version:   go1.10
Built:        I don't remember exactly
OS/Arch:      darwin/amd64

Docker traefik version:

Version:      v1.5.2
Codename:     cancoillotte
Go version:   go1.9.4
Built:        2018-02-12_10:56:31AM
OS/Arch:      linux/amd64

What is your environment & configuration (arguments, toml, provider, platform, ...)?

debug = true

logLevel = "DEBUG"
defaultEntryPoints = ["http"]

[entryPoints]
  [entryPoints.ping]
  address = ":8082"

  [entryPoints.api]
  address = ":8080"

  [entryPoints.http]
  address = ":80"

[retry]

[ping]
entryPoint = "ping"

[api]
entryPoint = "api"
  [api.statistics]

[file]
[backends]
  [backends.bct]
    [backends.bct.servers]
      [backends.bct.servers.server0]
        # url = "http://docker.for.mac.host.internal:6543"
        url = "http://localhost:6543"


[frontends]
  [frontends.bct]
    entryPoints = ["http"]
    backend = "bct"

[docker]
  endpoint = "unix:///var/run/docker.sock"
  # domain = "docker.for.mac.host.internal"
  domain = "localhost"
  watch = true
  exposedbydefault = false
@poVoq

This comment has been minimized.

Copy link

@poVoq poVoq commented Apr 28, 2018

Might be unrelated, but I have similar seemingly random 502 responses on my server running Nextcloud in docker (the Apache image) since upgrading to Traefik 1.6rc4-6.

@fabiocmazzo

This comment has been minimized.

Copy link

@fabiocmazzo fabiocmazzo commented Jul 7, 2018

I get 502 error using traefik with Keycloak on port 8443:


time="2018-07-07T02:07:01Z" level=debug msg="vulcand/oxy/forward/http: begin ServeHttp on request" Request="{\"Method\":\"GET\",\"URL\":{\"Scheme\":\"http\",\"Opaque\":\"\",\"User\":null,\"Host\":\"10.0.1.145:8443\",\"Path\":\"\",\"RawPath\":\"\",\"ForceQuery\":false,\"RawQuery\":\"\",\"Fragment\":\"\"},\"Proto\":\"HTTP/2.0\",\"ProtoMajor\":2,\"ProtoMinor\":0,\"Header\":{\"Accept\":[\"application/json\"],\"Accept-Encoding\":[\"gzip, deflate, br\"],\"Accept-Language\":[\"en-US,en;q=0.9\"],\"Cookie\":[\"__zlcmid=n8hRxJL8qzyg92; __zlcprivacy=1; _ga=GA1.2.136602004.1530667337\"],\"Referer\":[\"https://users.myserver.com/auth/admin/master/console/\"],\"User-Agent\":[\"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36\"]},\"ContentLength\":0,\"TransferEncoding\":null,\"Host\":\"users.myserver.com\",\"Form\":null,\"PostForm\":null,\"MultipartForm\":null,\"Trailer\":null,\"RemoteAddr\":\"10.255.0.2:59804\",\"RequestURI\":\"/auth/admin/master/console/config\",\"TLS\":null}"

time="2018-07-07T02:07:01Z" level=debug msg="vulcand/oxy/forward/http: Round trip: http://10.0.1.145:8443, code: 502, Length: 11, duration: 689.711µs tls:version: 303, tls:resume:false, tls:csuite:c02f, tls:server:users.myserver.com"

@juliens

This comment has been minimized.

Copy link
Member

@juliens juliens commented Jul 11, 2018

Now, with the last 1.6.5 and the new 1.7-rc1, you will have more logs about 502 (in loglevel debug). Could you provide those logs?
Thx

@l0b0

This comment has been minimized.

Copy link

@l0b0 l0b0 commented Jul 17, 2018

I am also seeing an issue like this with both the 1.6.5 and 1.7-rc1 official Docker images. I believe it started happening after updating traefik locally yesterday, especially because colleagues and CI running the same code do not see the same issue.

The relevant lines from the Docker logs:

time="2018-07-17T03:57:54Z" level=debug msg="'502 Bad Gateway' caused by: EOF"
time="2018-07-17T03:57:54Z" level=debug msg="vulcand/oxy/forward/http: Round trip: http://172.18.0.5:8000, code: 502, Length: 11, duration: 297.756355ms tls:version: 303, tls:resume:false, tls:csuite:c02f, tls:server:projectname.localhost"

A tight while curl --fail loop talking to the proxy reliably fails on the second request, but the same loop talking directly to the backend doesn't fail. Adding sleep 1 inside the (proxy) loop doesn't help, but adding sleep 5 does make it reliable.

@juliens

This comment has been minimized.

Copy link
Member

@juliens juliens commented Jul 17, 2018

@l0b0 Thanks for your interest in the project.

I can't reproduce only with while curl --fail (I tried with an httpbin as service). Could you provide your configuration, and maybe more information about your service?

@l0b0

This comment has been minimized.

Copy link

@l0b0 l0b0 commented Jul 17, 2018

@juliens The URL I'm hitting is a Django OAuth2 API endpoint, part of the same Docker Compose configuration as traefik. Following is the relevant (anonymised) configuration.

Dockerfile
FROM traefik

EXPOSE 80
EXPOSE 8080

ADD traefik.dev.toml traefik.toml
traefik.dev.toml
logLevel = "DEBUG"

defaultEntryPoints = ["https","http"]

[entryPoints]
    [entryPoints.http]
        address = ":80"
        [entryPoints.http.redirect]
            entryPoint = "https"
    [entryPoints.https]
        address = ":443"
        [entryPoints.https.tls]
            [[entryPoints.https.tls.certificates]]
                certFile = "/certs/servicename.localhost.crt"
                keyFile = "/certs/servicename.localhost.key"

[web]

[docker]
    endpoint = "unix:///var/run/docker.sock"
    domain = "docker.localhost"
    watch = true
    exposedByDefault = false
Docker Compose configuration
version: "3.5"

services:
  reverse-proxy:
    ports:
      - 80:80
      - 443:443
      - 8080:8080
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - ./tls:/certs

  backend:
    expose:
      - 8000
    labels:
      traefik.backend: backend
      traefik.enable: true
      traefik.frontend.rule: "Host:${HOSTNAME}; PathPrefix: /api/,/admin/"
@cgoeller

This comment has been minimized.

Copy link

@cgoeller cgoeller commented Jul 27, 2018

I have the same issues when running Portainer 1.18.1 behind traefik 1.5.4
After the first 502, it takes approx. 5 seconds to recover.

The issues came up after upgrading to portainer 1.18.1 (with agent containers).

When connecting to the portainer UI directly (by using a port mapping in docker), everythings works fine.

Env: docker swarm mode (18.03 CE) on RHEL 7.4

@Xachman

This comment has been minimized.

Copy link

@Xachman Xachman commented Aug 4, 2018

Im running 1.6.4 and I get bad gateway in a browser that was logged into a wordpress site probably after the session expired or something. After I clear the cookies it works fine. If i open in another browser site shows as normal as well.

@arnulfojr

This comment has been minimized.

Copy link

@arnulfojr arnulfojr commented Aug 4, 2018

I'm also having this somehow random 502,

time="2018-08-04T15:25:04Z" level=debug msg="vulcand/oxy/roundrobin/rr: Forwarding this request to URL" Request="{\"Method\":\"GET\",\"URL\":{\"Scheme\":\"\",\"Opaque\":\"\",\"User\":null,\"Host\":\"\",\"Path\":\"/ar-studio-payment-api/graphql/\",\"RawPath\":\"\",\"ForceQuery\":false,\"RawQuery\":\"\",\"Fragment\":\"\"},\"Proto\":\"HTTP/2.0\",\"ProtoMajor\":2,\"ProtoMinor\":0,\"Header\":{\"Accept\":[\"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8\"],\"Accept-Encoding\":[\"gzip, deflate, br\"],\"Accept-Language\":[\"en-US,en;q=0.9,es;q=0.8\"],\"Cache-Control\":[\"max-age=0\"],\"Cookie\":[\"uslk_e=MDI2NTgxNzEtZjViZS1jMTNjLWQ2Y2ItZWIzYjgwMGQ1NGRk~~~~~~~1~; ajs_group_id=null; ajs_anonymous_id=%22f3e6b5d3-9708-469d-a479-89aea1bb82c3%22; ajs_user_id=%22MDI2NTgxNzEtZjViZS1jMTNjLWQ2Y2ItZWIzYjgwMGQ1NGRk%22; uslk_s=Idle%3B0~~0~0~0~~\"],\"Upgrade-Insecure-Requests\":[\"1\"],\"User-Agent\":[\"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36\"]},\"ContentLength\":0,\"TransferEncoding\":null,\"Host\":\"my.server.com\",\"Form\":null,\"PostForm\":null,\"MultipartForm\":null,\"Trailer\":null,\"RemoteAddr\":\"172.18.0.1:49730\",\"RequestURI\":\"/prefixPath/graphql/\",\"TLS\":null}" ForwardURL="http://172.18.0.4:4000"
time="2018-08-04T15:25:04Z" level=debug msg="vulcand/oxy/forward: begin ServeHttp on request" Request="{\"Method\":\"GET\",\"URL\":{\"Scheme\":\"http\",\"Opaque\":\"\",\"User\":null,\"Host\":\"172.18.0.4:4000\",\"Path\":\"\",\"RawPath\":\"\",\"ForceQuery\":false,\"RawQuery\":\"\",\"Fragment\":\"\"},\"Proto\":\"HTTP/2.0\",\"ProtoMajor\":2,\"ProtoMinor\":0,\"Header\":{\"Accept\":[\"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8\"],\"Accept-Encoding\":[\"gzip, deflate, br\"],\"Accept-Language\":[\"en-US,en;q=0.9,es;q=0.8\"],\"Cache-Control\":[\"max-age=0\"],\"Cookie\":[\"uslk_e=MDI2NTgxNzEtZjViZS1jMTNjLWQ2Y2ItZWIzYjgwMGQ1NGRk~~~~~~~1~; ajs_group_id=null; ajs_anonymous_id=%22f3e6b5d3-9708-469d-a479-89aea1bb82c3%22; ajs_user_id=%22MDI2NTgxNzEtZjViZS1jMTNjLWQ2Y2ItZWIzYjgwMGQ1NGRk%22; uslk_s=Idle%3B0~~0~0~0~~\"],\"Upgrade-Insecure-Requests\":[\"1\"],\"User-Agent\":[\"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36\"]},\"ContentLength\":0,\"TransferEncoding\":null,\"Host\":\"my.server.com\",\"Form\":null,\"PostForm\":null,\"MultipartForm\":null,\"Trailer\":null,\"RemoteAddr\":\"172.18.0.1:49730\",\"RequestURI\":\"/prefixPath/graphql/\",\"TLS\":null}"
time="2018-08-04T15:25:04Z" level=debug msg="vulcand/oxy/forward/http: begin ServeHttp on request" Request="{\"Method\":\"GET\",\"URL\":{\"Scheme\":\"http\",\"Opaque\":\"\",\"User\":null,\"Host\":\"172.18.0.4:4000\",\"Path\":\"\",\"RawPath\":\"\",\"ForceQuery\":false,\"RawQuery\":\"\",\"Fragment\":\"\"},\"Proto\":\"HTTP/2.0\",\"ProtoMajor\":2,\"ProtoMinor\":0,\"Header\":{\"Accept\":[\"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8\"],\"Accept-Encoding\":[\"gzip, deflate, br\"],\"Accept-Language\":[\"en-US,en;q=0.9,es;q=0.8\"],\"Cache-Control\":[\"max-age=0\"],\"Cookie\":[\"uslk_e=MDI2NTgxNzEtZjViZS1jMTNjLWQ2Y2ItZWIzYjgwMGQ1NGRk~~~~~~~1~; ajs_group_id=null; ajs_anonymous_id=%22f3e6b5d3-9708-469d-a479-89aea1bb82c3%22; ajs_user_id=%22MDI2NTgxNzEtZjViZS1jMTNjLWQ2Y2ItZWIzYjgwMGQ1NGRk%22; uslk_s=Idle%3B0~~0~0~0~~\"],\"Upgrade-Insecure-Requests\":[\"1\"],\"User-Agent\":[\"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36\"]},\"ContentLength\":0,\"TransferEncoding\":null,\"Host\":\"my.server.com\",\"Form\":null,\"PostForm\":null,\"MultipartForm\":null,\"Trailer\":null,\"RemoteAddr\":\"172.18.0.1:49730\",\"RequestURI\":\"/prefixPath/graphql/\",\"TLS\":null}"
time="2018-08-04T15:21:59Z" level=debug msg="Upstream ResponseWriter of type *pipelining.writerWithoutCloseNotify does not implement http.CloseNotifier. Returning dummy channel."
time="2018-08-04T15:21:59Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 172.18.0.4:4000: connect: connection refused"
time="2018-08-04T15:21:59Z" level=debug msg="vulcand/oxy/forward/http: Round trip: http://172.18.0.4:4000, code: 502, Length: 11, duration: 1.205886ms tls:version: 303, tls:resume:true, tls:csuite:c02f, tls:server:my.server.com"

(I'm using signed certificates but only while using GraphQL shows this behaviour, and I have redirect from http to https)

My set up is using traefik's docker container and specifying the labels in a user-created network (GraphQL is as well a container).

# docker-compose labels
    labels:
      traefik.frontend.rule: "PathPrefix:/prefixPath"
      traefik.docker.network: "my-network"
      traefik.port: "4000"
    expose:
      - 4000
    networks:
      - my-network
      - db

If you need more logs or configuration files, I'll be happy to provide them, just let me know

@fewi

This comment has been minimized.

Copy link

@fewi fewi commented Aug 11, 2018

Having same issues with traefik:1.6.5-alpine Docker image and many backend Docker containers running nodejs and express.

Here my logs related to 502


time="2018-08-11T08:39:31Z" level=debug msg="vulcand/oxy/forward/http: Round trip: http://172.18.0.11:3000, code: 502, Length: 11, duration: 18.567411875s tls:version: 303, tls:resume:false, tls:csuite:c02f, tls:server:test.srv01.mydomain.com",
time="2018-08-11T08:39:17Z" level=debug msg="vulcand/oxy/forward/http: Round trip: http://172.18.0.9:3000, code: 502, Length: 11, duration: 18.359443477s tls:version: 303, tls:resume:false, tls:csuite:c02f, tls:server:test.srv01.mydomain.com",
time="2018-08-11T08:39:31Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 172.18.0.11:3000: connect: no route to host",
time="2018-08-11T08:39:17Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 172.18.0.9:3000: connect: no route to host"
@egilli

This comment has been minimized.

Copy link

@egilli egilli commented Aug 16, 2018

I was having the same issue with traefik:1.5.2-alpine. Stopping/removing all containers using traefik, removing traefik docker network, and re-creating everything solved my problem, and the issue didn't reappear yet (I did this a week ago).

I know it doesn't really help with the real issue, and it probably won't work for everybody, but I put this here in case it helps some people...

@Laughlete

This comment has been minimized.

Copy link

@Laughlete Laughlete commented Aug 20, 2018

I'm having the same issue with traefik:1.6 running on top of my docker swarm.

When this happens I ran docker stats on each node in my swarm to check on the containers' CPU usage and noticed there was a certain container using a lot of CPU (up to 100% at times). I've killed this container and I haven't seen the issue occur since.

@l0b0

This comment has been minimized.

Copy link

@l0b0 l0b0 commented Aug 22, 2018

My colleagues managed to fix this issue in our project. In essence Traefik doesn't support HTTP keep-alive, which causes errors when the client tries to reuse connections. This looks like the same issue as #615 and #1046.

@juliens

This comment has been minimized.

Copy link
Member

@juliens juliens commented Oct 9, 2018

@l0b0 We fix some keep-alive problem in #3740 , do you still have problem in the last version?

@mhumeSF

This comment was marked as off-topic.

Copy link

@mhumeSF mhumeSF commented Oct 24, 2018

Is this issue still ongoing in 1.7.3? I am experiencing similar behavior of sporadic 502's I learned this was due to a separate issue.

@leodouglas

This comment has been minimized.

Copy link

@leodouglas leodouglas commented Nov 1, 2018

Hey guys, I'm still having this problem.

time="2018-11-01T02:57:16Z" level=debug msg="'502 Bad Gateway' caused by: dial tcp 10.0.0.15:8083: connect: no route to host"
time="2018-11-01T02:57:16Z" level=debug msg="vulcand/oxy/forward/http: Round trip: http://10.0.0.15:8083, code: 502, Length: 11, duration: 2.993158666s"
time="2018-11-01T02:57:16Z" level=debug msg="vulcand/oxy/forward/http: completed ServeHttp on request" Request="{\"Method\":\"OPTIONS\",\"URL\":{\"Scheme\":\"http\",\"Opaque\":\"\",\"User\":null,\"Host\":\"10.0.0.15:8083\",\"Path\":\"\",\"RawPath\":\"\",\"ForceQuery\":false,\"RawQuery\":\"\",\"Fragment\":\"\"},\"Proto\":\"HTTP/1.1\",\"ProtoMajor\":1,\"ProtoMinor\":1,\"Header\":{\"Accept\":[\"*/*\"],\"Accept-Encoding\":[\"gzip\"],\"Accept-Language\":[\"pt-BR,pt;q=0.9,en-US;q=0.8,en;q=0.7,es;q=0.6,ru;q=0.5,de;q=0.4\"],\"Access-Control-Request-Headers\":[\"authorization,content-type\"],\"Access-Control-Request-Method\":[\"GET\"],\"Cf-Connecting-Ip\":[\"186.86.30.62\"],\"Cf-Ipcountry\":[\"CO\"],\"Cf-Ray\":[\"472b17b83b23b590-BOG\"],\"Cf-Visitor\":[\"{\\\"scheme\\\":\\\"https\\\"}\"],\"Connection\":[\"Keep-Alive\"],\"Origin\":[\"https://quejateaqui.com.co\"],\"Referer\":[\"https://quejateaqui.com.co/\"],\"User-Agent\":[\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36\"],\"X-Forwarded-For\":[\"186.86.30.62\"],\"X-Forwarded-Proto\":[\"https\"]},\"ContentLength\":0,\"TransferEncoding\":null,\"Host\":\"api.quejateaqui.com.co\",\"Form\":null,\"PostForm\":null,\"MultipartForm\":null,\"Trailer\":null,\"RemoteAddr\":\"10.255.0.2:54504\",\"RequestURI\":\"/api/v1/complaint/top-complaints\",\"TLS\":null}"

I am using traefik V1.7.4 / MAROILLES.

@deric

This comment has been minimized.

Copy link

@deric deric commented Nov 15, 2018

We're having the same issue with Traefik v1.7.4 (same problem appeared in previous versions we've tried). Traefik is "consistently" throwing 502 for approximately 0.5%-2% of requests. The issue becomes more apparent with a lower number of backend instances.

I wasn't able to reproduce the issue on simple GET requests. With more complex payload, 30 concurrent connections on 4 instances of Traefik with 2 backends and 1000 requests in total:

Average request time: 0.8031490905199999
0. precentile: 0.0697
10. precentile: 0.0987
20. precentile: 0.1109
30. precentile: 0.1275
40. precentile: 0.5045
50. precentile: 0.8564
60. precentile: 1.0508
70. precentile: 1.0612
80. precentile: 1.0687
90. precentile: 1.7269
100. precentile: 3.0648
Errors: 20/1000 2.00%

same experiment with HAProxy

Average request time: 0.9934568203530013
0. precentile: 0.0741
10. precentile: 0.1035
20. precentile: 0.1366
30. precentile: 0.4125
40. precentile: 0.9118
50. precentile: 1.0552
60. precentile: 1.0616
70. precentile: 1.0667
80. precentile: 1.0824
90. precentile: 2.1481
100. precentile: 3.0761
Errors: 0/1000 0.00%

The performance is somewhat comparable, but haproxy is not throwing randomly 502s.

With more backend instances there are fewer errors, but still it hits at least 0.5% portion of traffic. Which is a huge problem when you're targeting for availability starting with 99.9%

@leodouglas

This comment has been minimized.

Copy link

@leodouglas leodouglas commented Nov 15, 2018

I fixed my problem by updating the docker version.
I was using 18.03.1, my problem was solved with version 18.09 CE.

@huihuiyang

This comment has been minimized.

Copy link

@huihuiyang huihuiyang commented Dec 2, 2018

+1
We met the same issue here. Hitting "502".
I upgraded traefik to V1.7.4, then less than 1% of requests throw 502 bad gateway.

I have tried to recreate all the traefik pods and down to V1.6.4, still doesn't work.
Trying to recreate the k8s cluster, doesn't work.

No idea on that. Luckily, it happens in our Dev cluster, still have a chance to fix it or stop using traefik.

@amclay

This comment has been minimized.

Copy link

@amclay amclay commented Dec 4, 2018

+1 we've had a similar issue. Happens when we have high request load, resulting in ~ 1% 500s (but no backend's reporting 500s themselves)

@huihuiyang

This comment has been minimized.

Copy link

@huihuiyang huihuiyang commented Dec 4, 2018

+1 we've had a similar issue. Happens when we have high request load, resulting in ~ 1% 500s (but no backend's reporting 500s themselves)

Hi, @amclay

Did you upgrade to V1.7.4?

@jhinklew

This comment has been minimized.

Copy link

@jhinklew jhinklew commented Dec 5, 2018

We're seeing the same thing @huihuiyang and @amclay described. K8s traefik ingress throwing 502s.

Upgrade to 1.7.5 didn't seem to help

K8s v1.10.3 on AWS installed via Kops

@mhumeSF

This comment has been minimized.

Copy link

@mhumeSF mhumeSF commented Dec 5, 2018

@jhinklew Are you utilizing traefik health checks on the endpoint?

@huihuiyang

This comment has been minimized.

Copy link

@huihuiyang huihuiyang commented Dec 5, 2018

@jhinklew Same here. We use K8S V1.9.4 on AWS. We may resolve this issue yesterday.
What we did is

    • Recreate the entire k8s cluster
    • Redeploy all the traefik with the following parameters
      --retry.attempts=10
      --kubernetes.namespaces=xxx
      --idletimeout=60s
      --forwardingtimeouts.responseheadertimeout=60s

After that, all the "500" and "502" disappeared. We are still working on another issue atm.

@CumpsD

This comment has been minimized.

Copy link

@CumpsD CumpsD commented Dec 6, 2018

We are having the same problem with 1.7.5 (had the problem on 1.6.6 and then we upgraded)

We tried those settings about timeouts, without any success.

Our logs look like this:

time="2018-12-06T12:45:44Z" level=debug msg="'502 Bad Gateway' caused by: EOF"
time="2018-12-06T12:45:44Z" level=debug msg="vulcand/oxy/forward/http: Round trip: http://xxx:26607, code: 502, Length: 11, duration: 1.023706805s tls:version: 303, tls:resume:true, tls:csuite:c02f, tls:server:crab-import.postinfo.staging-xxxx.vlaanderen"

No k8s involved, we are running on EC2 with Nomad as scheduler, using all Docker containers.

@GarethLewin

This comment has been minimized.

Copy link

@GarethLewin GarethLewin commented Dec 6, 2018

@CumpsD what version of docker are you running? There was a report earlier in this thread that a newer docker helps but I can't validate it.

@CumpsD

This comment has been minimized.

Copy link

@CumpsD CumpsD commented Dec 6, 2018

@CumpsD what version of docker are you running? There was a report earlier in this thread that a newer docker helps but I can't validate it.

18.09, I read the report earlier in this thread, but didn't make a difference, I was already on the latest

@huihuiyang

This comment has been minimized.

Copy link

@huihuiyang huihuiyang commented Dec 6, 2018

We have fixed, no idea what's the resolution.

Now we revert all the traefik version into V1.6.4 and isolate the traefik per namespace. ( --kubernetes.namespaces=xxx). We also roll out the k8s cluster to fix the potential network-proxy issue and add "--retry.attempts=10".

  • ( Upgrade to V1.6.6 fix the 500 error, still 502
  • ( Upgrade to V1.7.4 or V1.7.5 throws 502
  • Even a fresh deployment didn't help

Hope this may help someone.

BTW: Our docker version is 18.06 on k8s. We didn't touch that as this works for a long time.

@nicolas-geniteau

This comment has been minimized.

Copy link

@nicolas-geniteau nicolas-geniteau commented Dec 27, 2018

We had exactly the same problem. In our case, it was between traefik V1.7.6 and multiple Node.js instances:

  • 502 in traefik logs
  • no trace of these requests on node instances logs

tcpdump show that the connection of this request is closed by RST tcp:

15:55:01.107956 IP proxy.38934 > node.20101: Flags [P.], seq 3508916203:3508916609, ack 1893329136, win 547, length 406
E.....@.@.:...d...d...N..%..p...P..#!...POST /someroute HTTP/1.1
Host: example.com
User-Agent: curl/7.54.0
Content-Length: 49
Accept: */*
Accept-Encoding: deflate, gzip
Content-Type: application/json
Counter: 1779114827
X-Forwarded-For: 1.2.3.4
X-Forwarded-Host: example.com
X-Forwarded-Port: 443
X-Forwarded-Proto: https
X-Forwarded-Server: proxy
X-Real-Ip: 1.2.3.4

15:55:01.108095 IP node.20101 > proxy.38934: Flags [.], ack 3508916609, win 555, length 0
E..(..@.@./...d...d.N...p....%..P..+.[........
15:55:01.108331 IP proxy.38934 > node.20101: Flags [P.], seq 3508916609:3508916658, ack 1893329136, win 547, length 49
E..Y..@.@.<M..d...d...N..%..p...P..# ...{"token": "token"}
15:55:01.108449 IP node.20101 > proxy.38934: Flags [.], ack 3508916658, win 555, length 0
E..(..@.@./...d...d.N...p....%..P..+.*........
15:55:01.471175 IP node.20101 > proxy.38933: Flags [F.], seq 2831110990, ack 34277271, win 530, length 0
E..(UO@.@..@..d...d.N.....SN....P....6........
15:55:01.471369 IP proxy.38933 > node.20101: Flags [F.], seq 34277271, ack 2831110991, win 530, length 0
E..(.7@.@..X..d...d...N.......SOP... [..
15:55:01.471471 IP node.20101 > proxy.38933: Flags [.], ack 34277272, win 530, length 0
E..(UP@.@..?..d...d.N.....SO....P....5........
15:55:01.478389 IP node.20101 > proxy.38934: Flags [R.], seq 1893329136, ack 3508916658, win 555, length 0
E..(..@.@./...d...d.N...p....%..P..+.&........

The connection (port 38934, from proxy) is already open (keep alive). The request arrive to node, is ACKed by node (with the good seq) and just after reset. We can see a second connection (port 38933), which has no problem.

I suspected a race condition if the request arrive to Node exactly at the moment it decided to close the connection after the keep alive timeout (5s on node>=8):

import requests
import time

start = 4.982
end = 5.00
step = 0.0001

session = requests.Session()
payload = '{"token": "token"}'

sleep_seconds = start
while sleep_seconds < end:
    r = session.post("https://example.com/someroute", data=payload)
    print(r.status_code)
    sleep_seconds += step
    print(f'wait {sleep_seconds}')
    time.sleep(sleep_seconds)

I tested this script on environment with only one node backend and only these requests (to have one request every ~5s)

Bingo ! I got many 502 errors.

The solution was to set node keep alive timeout higher than traefik (90s, not configurable on traefik #727 ). In this case, Traefik will close connection before
Node and the race condition won't be possible any more.

I am not sure, but I think it is more a nodejs bug (it should not ACKnowledge the packet and reset connection just after). Howewer, it would be nice to be able to configure traefik keep alive timeout.

@Jitsusama

This comment has been minimized.

Copy link

@Jitsusama Jitsusama commented Feb 14, 2019

As an FYI for others stumbling upon this ticket; I ran into a similar but different issue. I was getting the intermittent 502 errors like everyone else, and they appeared to be consistent for a number of seconds before clearing. The back-end Docker Swarm services that Traefik forwards traffic to have multiple network interfaces defined. I had the wrong label applied to them to fix them on a certain Docker network. Intermittently, Traefik would setup the back-end route for a network he did not have connectivity to, which would cause the following debug log to get generated in Traefik:

msg="'502 Bad Gateway' caused by: dial tcp 10.0.1.38:8000: connect: network is unreachable"

By passing the --docker.network=<my-network-name> parameter to Traefik upon launching, Traefik is now consistently sending traffic to the write network on the back-end services and the 502 errors have gone away.

@RamaSuda

This comment was marked as off-topic.

Copy link

@RamaSuda RamaSuda commented Feb 14, 2019

Same 502 bad gateway issue using Traefik V1.7.7. I am using Kubernetes

@aimanparvaiz

This comment was marked as off-topic.

Copy link

@aimanparvaiz aimanparvaiz commented Feb 22, 2019

Using Traefik 1.7.9 with EKS 1.11 and facing the same issue - 502 bad gateway for almost 50% of the requests.

@tigrankhachikyan

This comment has been minimized.

Copy link

@tigrankhachikyan tigrankhachikyan commented Feb 28, 2019

Looks like I had the same issue.
Using traefik in a Swarm.
I have a backend service configured as

backend:
    labels:
      - traefik.frontend.rule=PathPrefixStrip:/api/

So the call to localhost/api/cities/all was failing on each second request (502).

Updated my treafik docker container form v1.6 to the latest one 1.7.9-alpine rebuild everything and now seems everything is working

tigrankhachikyan added a commit to tigrankhachikyan/swarm-strapi-nuxt-stack that referenced this issue Feb 28, 2019
The ref ticket: containous/traefik#3237

And my comment:

Looks like I had the same issue.
Using traefik in a Swarm.
I have a backend service configured as

backend:
    labels:
      - traefik.frontend.rule=PathPrefixStrip:/api/

So the call to localhost/api/cities/all was failing on each second request (502).

Updated my treafik docker container form v1.6 to the latest one 1.7.9-alpine rebuild everything and now seems everything is working
@DanielJoyce

This comment has been minimized.

Copy link

@DanielJoyce DanielJoyce commented May 23, 2019

On the node side, setting keepAliveTimeout to 0 ( disable ) drastically reduced 502s, but we still get them.

@DanielJoyce

This comment has been minimized.

Copy link

@DanielJoyce DanielJoyce commented May 24, 2019

We're now dealing with 502s likely caused by Google Load Balancer having a Idle Connection Timeout of 10 minutes.

Traefik has a setting for this:

[respondingTimeouts]

  # idleTimeout is the maximum duration an idle (keep-alive) connection will remain idle before closing itself.
  # This needs to be set longer than the GCP load balancer timeout
  idleTimeout = "620s"

EDIT

Okay, its really unclear but it seems RespondingTimeouts/idleTimeout is the same as the old freestanding "idleTimeout", its just been moved under Respondingtimeouts?

@natlibfi-arlehiko

This comment has been minimized.

Copy link

@natlibfi-arlehiko natlibfi-arlehiko commented Jun 11, 2019

The solution was to set node keep alive timeout higher than traefik (90s, not configurable on traefik #727 ).

@nicolas-geniteau Where is that 90s timeout set? Do you mean that's Traefik's timeout?

@nicolas-geniteau

This comment has been minimized.

Copy link

@nicolas-geniteau nicolas-geniteau commented Jun 11, 2019

@natlibfi-arlehiko https://github.com/containous/traefik/blob/master/pkg/server/roundtripper.go#L50 (on master) and https://github.com/containous/traefik/blob/v1.7/server/server_loadbalancer.go#L248 (on v1.7)

90s is the not configurable timeout for traefik before it close the connection to the backend. You have to set an higher value on your backend to be sure that traefik will close the connection before the backend.

@glebsam

This comment has been minimized.

Copy link

@glebsam glebsam commented Jul 23, 2019

May be useful: to disable keepalives on Traefik side just set in Traefilk config:

MaxIdleConnsPerHost = -1

I was unable to find any documentation according to negative values for this parameter, but looks like this link explains why it works: https://github.com/golang/go/blob/release-branch.go1.12/src/net/http/transport.go#L743

In such way I've fixed integration with Sentry (it uses UWSGI in its official docker image). This parameter works with Traefik v1.7.12 and v2.0.0-beta1 as well.
By the way, I cannot fix the issue just decreasing idleConnTimeout (from #4983)

@zq-david-wang

This comment has been minimized.

Copy link

@zq-david-wang zq-david-wang commented Sep 30, 2019

I experience similar issue and there are a lot of insert_failed on the host running traefik while no insert_failed on other node without traefik. (My kernel is 4.4.190-1.el7.elrepo.x86_64)

I have a theory that this may have something to do with tcp_tw_reuse ( and tcp_tw_recycle on old kernels), I am running test to confirm this.

Could you guys help check on your traefik host

  1. the insert_failed count in conntrack -S
  2. where tcp_tw_reuse enabled. sysctl -a | grep _tw_

[edit]
Ignore my comments, still have 502 response according to traefik log...

@dduportal

This comment has been minimized.

Copy link
Contributor

@dduportal dduportal commented Sep 30, 2019

Hi @koliyo , do you have a reproduction case with latest Traefik v2.0.1?

We are removing the priority "P1" as this issue is old and not really reproducible.

@BenjaminBeck

This comment has been minimized.

Copy link

@BenjaminBeck BenjaminBeck commented Oct 19, 2019

The update from 1.7.6 to 1.7.18 did not solve this issue for me, but the MaxIdleConnsPerHost = -1 config did. I happened with many (>100) DELETE requests at the same time.

@jachzen

This comment has been minimized.

Copy link

@jachzen jachzen commented Nov 2, 2019

had the same issue with traefik 2.0.4. After setting --providers.docker.network=XXXX the issue didn't came up anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.