Performance drop when using HTTPS together with compression in Kubernetes #1908

miadabrin · 2017-08-01T14:57:05Z

Do you want to request a feature or report a bug?

Report a bug

What did you do?

I have been using the version 1.3.4 with kubernetes as an ingress controller and compression enabled.
The performance becomes very bad after 3 or 4 hours

What did you expect to see?

I expected the performance to remain the same .

What did you see instead?

requests take too long to respond. especially static resources like js files. The issue is gone when I restart the pod

Output of `traefik version`: (What version of Traefik are you using?)

1.3.4

What is your environment & configuration (arguments, toml, provider, platform, ...)?

defaultEntryPoints = ["http"]
logLevel = "WARN"
MaxIdleConnsPerHost = 1000

[entryPoints]
  [entryPoints.http]
  address = ":80"
  compress = true

The text was updated successfully, but these errors were encountered:

ldez · 2017-08-01T15:48:39Z

Could you provide more information about CPU, RAM consumption?

miadabrin · 2017-08-01T18:22:48Z

I have been running the pod on a machine with 4 core cpus and 8 Gbs of RAM . the cpu usage is not high most of the time but has surges from time to time. ram is 6 GB used and 2GB remaining most of the time. I have no resource limits on the traefik pod

timoreimann · 2017-08-01T20:20:25Z

@miadabrin thanks for reporting.

Could you provide metrics on Traefik, especially CPU and memory consumption over time?

Are you possibly able to send the same kind of requests directly towards the proxy? If so, do you see any difference in performance behavior? What happens if you swap out Traefik with Nginx or HAProxy?

miadabrin · 2017-08-02T05:13:57Z

Unfortunately the metrics are gone after restarts but I will see if I can provide one .I haven't tried nginx or haproxy on this workload. I might try it and put the results here. I can't serve the same workload from proxy since all traffic is ingress based so it would not be an accurate metric.
Whatever the problem is it might be happening as time passes or it might be happening because at some points the traefik is cpu deprived for a number of seconds

ConorNevin · 2017-08-27T15:47:30Z

I've also seen similar behaviour since updating to 1.3.X. Traefik behaves as expected for several hours/days handling around 30k requests per minute but during a load test that pushed it to 120k all the traefik pods became unresponsive and the requests all timed out - ill dig out the metrics when I get back into work.

kachkaev · 2017-09-10T13:47:56Z

I might also be facing something similar on a single-machine test cluster (KVM 2CPUs/2GB RAM). Traefik works fine, but constantly enters some odd state when I load one specific static page through it. There is nothing special about the page itself, just an html with a few images served by nginx (about 2MB total).

Serving that page via a port (without ingress) does not demonstrate any anomalies, however, doing that via traefik makes the netdata CPU charts look crazy. Here is the CPU load after the page has been hit (the load is normally around 10% since no one except me is visiting the server):

The peak sometimes lasts over one minute and only killing traefik pod brings things back to normal. The CPU seems to be so busy because of some excessive io operations (iotop on host):

RAM and swap don't look scarce and don't change much during the peaks:

Saw this both with 1.3 and 1.4-rc2. Ingress controller flavour: official helm chart.

timoreimann · 2017-09-10T20:08:03Z

@kachkaev thanks a lot for the detailed analysis, very much appreciated. 👏

You mention that "only killing traefik pod brings things back to normal". Does that also hold when you stop letting Traefik serve any traffic at all? Or would the CPU utlization level off as soon as Traefik is given time to recover?

Do you possibly have a chance to send a SIGQUIT to one locked up Traefik pod and paste the output? It should give an overview of all running goroutines. I'm kind of speculating we might end up in a situation where goroutines pile up like crazy and never terminate (i.e., goroutine leaking).
Note that a side effect of sending SIGQUIT to Traefik (or any Go program in general) is that the process will die hard with an exit code of 2. If you can't afford that to happen, you might have the possibility to take one pod out of rotation first (which is doable by adjusting the selecting labels from one pod if you have your Traefik pods fronted by a Kubernetes Service).

Thanks again -- I'm super keen in figuring out what's going wrong here.

kachkaev · 2017-09-10T22:20:52Z

Hi @timoreimann, really glad you're keen to investigate this! I'm running traefik on a single-VM cluster for personal projects, so any experiments are easy to run.

Regarding "killing traefik pod". It seems that once traefik has started producing a lot of iowait, stopping serving things via it does not help, at least immediately. The CPU chart in netdata was still pink 3 minutes after closing the browser tab (i.e. cancelling all requests). Although I was not contacting traefik myself, there was a small chance of usual hits from healthcheckers and bots, but these were supposed to be very rare and to different uris, which never triggered that much load.

Going to k8s dashboard and clicking on "delete traefik pod" button restarted traefik and reduced the CPU load to 5-10 percent immediately:

Interestingly, RAM and swap charts look like there's been nothing happening at all:

Doing things like killing processes with signals is too advanced for me, but looks like I've managed how to do this :–) Typing docker kill -s SIGQUIT ${CONTAINER_ID} and then docker logs ${CONTAINER_ID} gave me 4K lines of stack traces after the mentioning of SIGQUIT: quit. See traefik-sigquit.txt. How does this fit your hypothesis?

One other thing I tried earlier today was manually adding a docker backend to toml and linking a container via this backend (i.e. by using the same traefik instance, but bypassing k8s). I did this when I was not sure if the load is to do with traefik or k8s logging, docker layers etc. Unfortunately, the same odd state kept being triggered when I was opening my static page via the docker backend, just the same as in k8s. So the bug might not be just ingress-specific. Bypassing traefik either by externalising docker port or k8s service to NodePort does not produce any peaks in the netdata charts.

The most recent tests were done with traefik 1.4.0-rc2. What other data would you be interested to dig out?

timoreimann · 2017-09-10T22:45:23Z

Thanks for the stack traces; I'll try to find some time next week to analyze the dump.

One more thing: I suppose that web page of yours is non-public. Do you think it's possible to generalize/simplify/anonymize it so that it can be used by others (i.e., me) while still being able to reproduce the bug? That'd make things tremendously easier.

timoreimann · 2017-09-10T22:48:29Z

@kachkaev also, have you possibly set the MaxIdleConnsPerHost Traefik option to anything else but the default? If so, which value did you choose?

kachkaev · 2017-09-10T22:49:43Z

Great!

The page is somewhat private, yes. I'm not sure how much time it will take me to anonymise it and at what point it will stop producing the bug, but if you want, I can share the image with you via a private channel. Feel free to email me or DM on twitter and ask for the details.

Re MaxIdleConnsPerHost: not sure what this parameter is :–) Where should I check it?

timoreimann · 2017-09-10T22:54:50Z

@kachkaev if you use the TOML file for configuring Traefik and you have customized the parameter, then you should find it in your configuration file with something like this:

MaxIdleConnsPerHost = 150

If you can't find the option (or you can only find it commented out), then you're running off of the default setting (which would be good to know too).

I'll reach out to you via email.

kachkaev · 2017-09-10T23:43:55Z

MaxIdleConnsPerHost is not a part of the helm chart I've been using and I also haven't found it in the generated configmap. Assuming this parameter is not set.

Thanks for the email!

timoreimann · 2017-09-11T09:36:49Z

The dump seems to confirm the iowait problem: goroutines sitting at internal/poll.runtime_pollWait() are leading the board with 28 out of 150 goroutines in total.

I'm wondering if the very exact number of 150 is just coincidence or happens to indicate at some artificial limit. @kachkaev, could you SIGQUIT another Traefik pod being stuck and check the number of goroutines? (If it's easier for you, a grep goroutine <your dump> | wc -l is all I need.)

timoreimann · 2017-09-11T09:38:47Z

     28 internal/poll.runtime_pollWait()
     15 net/http.(*http2serverConn).writeHeaders()
     12 net/http.(*persistConn).writeLoop()
     10 net/http.(*persistConn).roundTrip()
      9 net/http.(*http2serverConn).writeDataFromHandler()
      7 sync.runtime_notifyListWait()
      2 sync.runtime_Semacquire()
      2 sync.(*Mutex).Unlock()
      2 reflect.Value.Field()
      2 net/textproto.canonicalMIMEHeaderKey()
      2 net/http.(*persistConn).roundTrip.func1()
      2 net/http.(*persistConn).readLoop()
      2 net/http.(*http2serverConn).readFrames()
      2 net/http.(*Transport).getConn.func2.1()
      1 vendor/golang_org/x/net/http2/hpack.(*Decoder).parseHeaderFieldRepr()
      1 unicode.SimpleFold()
      1 time.Time.Add()
      1 time.Sleep()
      1 syscall.Read()
      1 sync.runtime_SemacquireMutex()
      1 sync.(*Map).Load()
      1 runtime.usleep()
      1 runtime.gopark()
      1 reflect.valueInterface()
      1 os/signal.signal_recv()
      1 net/textproto.MIMEHeader.Set()
      1 net/http.http2cloneHeader()
      1 net/http.fixPragmaCacheControl()
      1 net/http.Header.Del()
      1 net/http.(*transferWriter).probeRequestBody()
      1 net/http.(*http2sorter).Keys()
      1 net/http.(*http2serverConn).serve()
      1 net/http.(*http2serverConn).newWriterAndRequestNoBody()
      1 net/http.(*http2priorityWriteScheduler).CloseStream()
      1 net/http.(*Transport).getConn.func4()
      1 net/http.(*Transport).dialConn()
      1 net/http.(*Request).write()
      1 net.IP.To4()
      1 net.(*netFD).init()
      1 net.(*Resolver).goLookupIPCNAMEOrder()
      1 main.run.func2()
      1 github.com/containous/traefik/vendor/github.com/vulcand/oxy/forward.(*websocketForwarder).serveHTTP()
      1 github.com/containous/traefik/vendor/github.com/thoas/stats.(*Stats).Data()
      1 github.com/containous/traefik/vendor/github.com/golang/glog.(*loggingT).flushDaemon()
      1 github.com/containous/traefik/vendor/github.com/eapache/channels.(*InfiniteChannel).infiniteBuffer()
      1 github.com/containous/traefik/types.CanonicalDomain()
      1 github.com/containous/traefik/server.(*Server).listenSignals()
      1 github.com/containous/traefik/server.(*Server).listenProviders()
      1 github.com/containous/traefik/server.(*Server).listenConfigurations()
      1 github.com/containous/traefik/server.(*Server).Wait()
      1 github.com/containous/traefik/provider/kubernetes.(*clientImpl).WatchAll.func1()
      1 github.com/containous/traefik/acme.(*ACME).CreateLocalConfig.func1()
      1 fmt.Fprintf()
      1 encoding/json.stringEncoder()
      1 encoding/json.intEncoder()
      1 context.WithCancel()
      1 context.(*valueCtx).Err()
      1 compress/flate.(*huffmanBitWriter).writeCode()
      1 bytes.(*Buffer).Write()

kachkaev · 2017-09-11T11:30:52Z

I just killed traefik container two more times with docker kill -s SIGQUIT ${CONTAINER_ID} and got these new dumps:

traefik-sigquit-02.txt - 128 occurrences of goroutine
traefik-sigquit-03.txt - 104 occurrences

Each new test was a bit faster than the previous (I'm getting skilled in this 😃) and the number of goroutine occurrences went down too. Might be a hint for something, but might also be just a coincidence.

Will be happy to run more tests if needed!

emilevauge · 2017-09-11T12:15:31Z

@kachkaev ouch, that's unfortunate :/

Another way to debug goroutines and memory in traefik: enable debug mode with --debug arg, and the web provider (on port 8080 for example).
You will then be able to access goroutines and mem stats on the url http://localhost:8080/debug/vars.

You can also use https://github.com/divan/expvarmon to follow these stats in time expvarmon -ports="8080" -vars="Goroutines,mem:memstats.Alloc":

kachkaev · 2017-09-11T12:46:24Z

Thanks @emilevauge! I added debug=true to the configmap toml and recreated the pod. Now a json gets returned when I go to https://traefik.example.com/debug/vars.

Originally the reported number of Goroutines was between 45 and 55. I hit the 'magic page' three times and although the iowait went crazy again, the CPU load stabilized after a short while:

Not sure this is related to debug because I saw this behaviour previously too (the pod did not always freeze completely in all cases).

Here are the numbers of Goroutines I observed during the test by refreshing the url:

1st hit: 179 151 134
2nd hit: 180 178 132
3rd hit: 178 137 130 
afterwards: 122, then after some time 46

kachkaev · 2017-09-11T14:13:36Z

I just commented out these lines in toml and restarted the pod:

# [entryPoints.http.redirect]
      # entryPoint = "https"

The results are interesting: hitting the page via the http traefik endpoint for the first time does not produce any iowait artefacts. Then hitting the same url with https creates a peak (not too long in this specific pod reboot, but something like this happened before too). Then using http immediately afterwards does create a peak, but a smaller one:

After a short while http requests stop causing much io, but once an https request hits that page again, http ones become impactful again.

kachkaev · 2017-09-11T14:39:01Z

One more insight: removing compress = true from the [entryPoints.https] toml section solved the problem! No more iowait in the CPU charts until this option is turned on again!

Can the issue be related to gzipping jpegs? The demo page I privately shared with @timoreimann was a static html with a photo gallery. The container's nginx is configured to gzip a few filetypes, jpegs excluded. Perhaps traefik decides not to double-gzip the text files, but tries gzipping jpegs and finds it hard, especially when combined with tls?

FYI compress = true is on by default in the official helm chart via gzip.enabled option (=true). Details are in configmap.yml template.

timoreimann · 2017-09-11T15:25:28Z

@kachkaev interesting finding! I'm still struggling to repro the problem locally even with compression enabled and running over TLS. My Traefik configuration file looks like this:

logLevel = "DEBUG"
checknewversion = false
graceTimeOut = 0

defaultEntryPoints = ["http", "https"]

[entryPoints]
    [entryPoints.http]
    address = ":8000"
    compress = true
    [entryPoints.https]
    address = ":8443"
    compress = true
        [entryPoints.https.tls]
            [[entryPoints.https.tls.certificates]]
            CertFile = "server.crt"
            KeyFile = "server.key"

[web]
address = ":8080"
readonly = true

[file]
filename = "rules.toml"

Not sure if either the file provider or the fact that I'm using a self-signed certificate makes a difference.

timoreimann · 2017-09-11T15:40:14Z

/cc @ldez who has worked on the compression middleware some time ago.

emilevauge · 2017-09-11T16:41:08Z

@kachkaev Well done 👏 ! Could you test with compression enabled and v1.3.8 ?

timoreimann · 2017-09-11T16:51:59Z

@emilevauge FWIW, the problem is also on 1.4 (see #1908 (comment)).

kachkaev · 2017-09-11T18:41:15Z

Good point @emilevauge! I originally discovered this issue both in 1.3.8 and 1.4.0-rc.2, but was only testing 1.4.0-rc.2 afterwards.

The peaks in 1.3.8 are confirmed. I manually changed the deployment image to traefik:1.3.8, which recreated the pod. Hitting the same page with jpegs over ssl did generate iowait, similarly to 1.4.0-rc.2. Commenting out compress = true in the traefik's configmap and then manually restarting the pod made the CPU charts look flat too!

Shame that I don't know go and so can't look through the code, but I hope my findings can still help!

ldez · 2017-09-12T13:33:00Z

I think the main problem is the compression of the jpeg files.

Do you have tested without https?

timoreimann · 2017-09-12T13:51:25Z

@ldez HTTP seems to work fine; it's only the HTTPS + compression combo that is problematic.

kachkaev · 2017-09-12T15:38:55Z

I believe I have a solution! After looking at my traefik Deployment yaml once again, I noticed that the it had CPU and memory limits automatically configured by the helm chart. These were 200mCPU cores and 30MB of RAM. My previous shallow knowledge of Docker suggested me that if traefik would exceed 30 MB, it would be simply killed, but I still replaced 30MB with 100MB just to check. And TLS finally worked despite compress = true! Don't know why, but instead of killing traefik, docker just started moving its pages to swap and this is what created a lot of iowait. When the limit was set to 100MB, there was enough room both for TLS handler and gzip, so the demo container I privately shared with @timoreimann loaded with no peaks! I replaced 100MB with 30MB again and saw my peaks as before – so it's unlikely to be anything else!

This memory limit & swap theory seems to explain everything I've been observing so far, including why the iowait periods were very long after two-three successive page load attempts and why @timoreimann could not reproduce the bug. @miadabrin and @ConorNevin might be also facing the same memory limit problem, just at a higher load level.

Although it turned out to be not a bug in traefik's code itself, the situation is probably still worth writing about somewhere in the docs. I'll also speak to the official k8s chart community to make sure helm users do not end up with not very optimal defaults. They should be able to smoothly load photo galleries via TLS with gzip :–)

Wish you guys all the best in preparing 1.4.0, about which I'm very excited!

ldez · 2017-09-12T15:56:06Z

@kachkaev great job !!! 👏 👏 👏

timoreimann · 2017-09-12T16:28:18Z

@kachkaev thanks for the details -- your explanation makes perfect sense. I even discovered the same problem (limits too strict) with another user a few weeks ago but totally forgot about it again. :(

I'll make sure we leave a remark in our Kubernetes guide. Keeping the issue open until the PR is ready.

To complete the analysis: Docker does kill containers exceeding their memory limits (or, rather, the OOM killer does). However, by default both memory and swapping are taken into account. There's an open issue somewhere on the Kubernetes repository asking to take swapping into account in terms of resource specification. However, the official advice (especially from Google) is to disable swapping completely: Once you get to the point where your applications swap, all bets are off performance-wise. Hence, it's usually better to know when an application runs out of (pure) memory than to let it go into the limbo state that swapping means.

Thanks once more! 🎉

timoreimann · 2017-09-12T16:48:29Z

PR is out. ^

kachkaev · 2017-09-12T22:01:46Z

I guess this issue can be closed now 🎉

timoreimann · 2017-09-13T05:17:51Z

Hmm strange, the PR should have closed the issue but didn't. Oh well, closing myself then.

juliens added kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. area/provider/k8s/ingress labels Aug 1, 2017

ldez added the priority/P0 needs hot fix label Sep 10, 2017

timoreimann removed the area/provider/k8s/ingress label Sep 11, 2017

kachkaev mentioned this issue Sep 12, 2017

[stable/traefik] Disable compression by default? helm/charts#2098

Closed

timoreimann changed the title ~~Performance drop after using with kubernetes as ingress controller~~ Performance drop when using HTTPS together with compression Sep 12, 2017

timoreimann mentioned this issue Sep 12, 2017

Add guide section on production advice, esp. CPU. #2113

Merged

ldez removed the priority/P0 needs hot fix label Sep 12, 2017

timoreimann closed this as completed Sep 13, 2017

ldez added this to the 1.4 milestone Sep 13, 2017

ldez added the area/provider/k8s/ingress label Sep 13, 2017

timoreimann changed the title ~~Performance drop when using HTTPS together with compression~~ Performance drop when using HTTPS together with compression in Kubernetes Sep 13, 2017

This was referenced May 17, 2018

High disk read usage #3341

Closed

compress for certain content-type (not simply true/false) #2124

Closed

kachkaev mentioned this issue Sep 19, 2018

High CPU usage #2673

Closed

traefik locked and limited conversation to collaborators Sep 1, 2019

traefiker added the status/5-frozen-due-to-age label Sep 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance drop when using HTTPS together with compression in Kubernetes #1908

Performance drop when using HTTPS together with compression in Kubernetes #1908

miadabrin commented Aug 1, 2017 •

edited

ldez commented Aug 1, 2017

miadabrin commented Aug 1, 2017

timoreimann commented Aug 1, 2017

miadabrin commented Aug 2, 2017

ConorNevin commented Aug 27, 2017

kachkaev commented Sep 10, 2017 •

edited

timoreimann commented Sep 10, 2017

kachkaev commented Sep 10, 2017 •

edited

timoreimann commented Sep 10, 2017

timoreimann commented Sep 10, 2017 •

edited

kachkaev commented Sep 10, 2017 •

edited

timoreimann commented Sep 10, 2017

kachkaev commented Sep 10, 2017 •

edited

timoreimann commented Sep 11, 2017

timoreimann commented Sep 11, 2017 •

edited

kachkaev commented Sep 11, 2017 •

edited

emilevauge commented Sep 11, 2017 •

edited by ldez

kachkaev commented Sep 11, 2017 •

edited

kachkaev commented Sep 11, 2017 •

edited

kachkaev commented Sep 11, 2017 •

edited

timoreimann commented Sep 11, 2017 •

edited by ldez

timoreimann commented Sep 11, 2017

emilevauge commented Sep 11, 2017

timoreimann commented Sep 11, 2017

kachkaev commented Sep 11, 2017 •

edited

ldez commented Sep 12, 2017

timoreimann commented Sep 12, 2017 •

edited

kachkaev commented Sep 12, 2017 •

edited

ldez commented Sep 12, 2017

timoreimann commented Sep 12, 2017

timoreimann commented Sep 12, 2017

kachkaev commented Sep 12, 2017

timoreimann commented Sep 13, 2017

Performance drop when using HTTPS together with compression in Kubernetes #1908

Performance drop when using HTTPS together with compression in Kubernetes #1908

Comments

miadabrin commented Aug 1, 2017 • edited

Do you want to request a feature or report a bug?

What did you do?

What did you expect to see?

What did you see instead?

Output of traefik version: (What version of Traefik are you using?)

What is your environment & configuration (arguments, toml, provider, platform, ...)?

ldez commented Aug 1, 2017

miadabrin commented Aug 1, 2017

timoreimann commented Aug 1, 2017

miadabrin commented Aug 2, 2017

ConorNevin commented Aug 27, 2017

kachkaev commented Sep 10, 2017 • edited

timoreimann commented Sep 10, 2017

kachkaev commented Sep 10, 2017 • edited

timoreimann commented Sep 10, 2017

timoreimann commented Sep 10, 2017 • edited

kachkaev commented Sep 10, 2017 • edited

timoreimann commented Sep 10, 2017

kachkaev commented Sep 10, 2017 • edited

timoreimann commented Sep 11, 2017

timoreimann commented Sep 11, 2017 • edited

kachkaev commented Sep 11, 2017 • edited

emilevauge commented Sep 11, 2017 • edited by ldez

kachkaev commented Sep 11, 2017 • edited

kachkaev commented Sep 11, 2017 • edited

kachkaev commented Sep 11, 2017 • edited

timoreimann commented Sep 11, 2017 • edited by ldez

timoreimann commented Sep 11, 2017

emilevauge commented Sep 11, 2017

timoreimann commented Sep 11, 2017

kachkaev commented Sep 11, 2017 • edited

ldez commented Sep 12, 2017

timoreimann commented Sep 12, 2017 • edited

kachkaev commented Sep 12, 2017 • edited

ldez commented Sep 12, 2017

timoreimann commented Sep 12, 2017

timoreimann commented Sep 12, 2017

kachkaev commented Sep 12, 2017

timoreimann commented Sep 13, 2017

miadabrin commented Aug 1, 2017 •

edited

Output of `traefik version`: (What version of Traefik are you using?)

kachkaev commented Sep 10, 2017 •

edited

kachkaev commented Sep 10, 2017 •

edited

timoreimann commented Sep 10, 2017 •

edited

kachkaev commented Sep 10, 2017 •

edited

kachkaev commented Sep 10, 2017 •

edited

timoreimann commented Sep 11, 2017 •

edited

kachkaev commented Sep 11, 2017 •

edited

emilevauge commented Sep 11, 2017 •

edited by ldez

kachkaev commented Sep 11, 2017 •

edited

kachkaev commented Sep 11, 2017 •

edited

kachkaev commented Sep 11, 2017 •

edited

timoreimann commented Sep 11, 2017 •

edited by ldez

kachkaev commented Sep 11, 2017 •

edited

timoreimann commented Sep 12, 2017 •

edited

kachkaev commented Sep 12, 2017 •

edited