Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance drop when using HTTPS together with compression in Kubernetes #1908

Closed
miadabrin opened this issue Aug 1, 2017 · 33 comments
Closed
Labels
area/provider/k8s/ingress kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. status/5-frozen-due-to-age
Milestone

Comments

@miadabrin
Copy link

miadabrin commented Aug 1, 2017

Do you want to request a feature or report a bug?

Report a bug

What did you do?

I have been using the version 1.3.4 with kubernetes as an ingress controller and compression enabled.
The performance becomes very bad after 3 or 4 hours

What did you expect to see?

I expected the performance to remain the same .

What did you see instead?

requests take too long to respond. especially static resources like js files. The issue is gone when I restart the pod

Output of traefik version: (What version of Traefik are you using?)

1.3.4

What is your environment & configuration (arguments, toml, provider, platform, ...)?

defaultEntryPoints = ["http"]
logLevel = "WARN"
MaxIdleConnsPerHost = 1000

[entryPoints]
  [entryPoints.http]
  address = ":80"
  compress = true
@juliens juliens added kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. area/provider/k8s/ingress labels Aug 1, 2017
@ldez
Copy link
Member

ldez commented Aug 1, 2017

Could you provide more information about CPU, RAM consumption?

@miadabrin
Copy link
Author

I have been running the pod on a machine with 4 core cpus and 8 Gbs of RAM . the cpu usage is not high most of the time but has surges from time to time. ram is 6 GB used and 2GB remaining most of the time. I have no resource limits on the traefik pod

@timoreimann
Copy link
Contributor

@miadabrin thanks for reporting.

Could you provide metrics on Traefik, especially CPU and memory consumption over time?

Are you possibly able to send the same kind of requests directly towards the proxy? If so, do you see any difference in performance behavior? What happens if you swap out Traefik with Nginx or HAProxy?

@miadabrin
Copy link
Author

Unfortunately the metrics are gone after restarts but I will see if I can provide one .I haven't tried nginx or haproxy on this workload. I might try it and put the results here. I can't serve the same workload from proxy since all traffic is ingress based so it would not be an accurate metric.
Whatever the problem is it might be happening as time passes or it might be happening because at some points the traefik is cpu deprived for a number of seconds

@ConorNevin
Copy link

I've also seen similar behaviour since updating to 1.3.X. Traefik behaves as expected for several hours/days handling around 30k requests per minute but during a load test that pushed it to 120k all the traefik pods became unresponsive and the requests all timed out - ill dig out the metrics when I get back into work.

@kachkaev
Copy link
Contributor

kachkaev commented Sep 10, 2017

I might also be facing something similar on a single-machine test cluster (KVM 2CPUs/2GB RAM). Traefik works fine, but constantly enters some odd state when I load one specific static page through it. There is nothing special about the page itself, just an html with a few images served by nginx (about 2MB total).

Serving that page via a port (without ingress) does not demonstrate any anomalies, however, doing that via traefik makes the netdata CPU charts look crazy. Here is the CPU load after the page has been hit (the load is normally around 10% since no one except me is visiting the server):

screen shot 2017-09-10 at 13 57 31

The peak sometimes lasts over one minute and only killing traefik pod brings things back to normal. The CPU seems to be so busy because of some excessive io operations (iotop on host):

screen shot 2017-09-10 at 13 57 11

RAM and swap don't look scarce and don't change much during the peaks:

screen shot 2017-09-10 at 14 37 04

Saw this both with 1.3 and 1.4-rc2. Ingress controller flavour: official helm chart.

@ldez ldez added the priority/P0 needs hot fix label Sep 10, 2017
@timoreimann
Copy link
Contributor

@kachkaev thanks a lot for the detailed analysis, very much appreciated. 👏

You mention that "only killing traefik pod brings things back to normal". Does that also hold when you stop letting Traefik serve any traffic at all? Or would the CPU utlization level off as soon as Traefik is given time to recover?

Do you possibly have a chance to send a SIGQUIT to one locked up Traefik pod and paste the output? It should give an overview of all running goroutines. I'm kind of speculating we might end up in a situation where goroutines pile up like crazy and never terminate (i.e., goroutine leaking).
Note that a side effect of sending SIGQUIT to Traefik (or any Go program in general) is that the process will die hard with an exit code of 2. If you can't afford that to happen, you might have the possibility to take one pod out of rotation first (which is doable by adjusting the selecting labels from one pod if you have your Traefik pods fronted by a Kubernetes Service).

Thanks again -- I'm super keen in figuring out what's going wrong here.

@kachkaev
Copy link
Contributor

kachkaev commented Sep 10, 2017

Hi @timoreimann, really glad you're keen to investigate this! I'm running traefik on a single-VM cluster for personal projects, so any experiments are easy to run.

Regarding "killing traefik pod". It seems that once traefik has started producing a lot of iowait, stopping serving things via it does not help, at least immediately. The CPU chart in netdata was still pink 3 minutes after closing the browser tab (i.e. cancelling all requests). Although I was not contacting traefik myself, there was a small chance of usual hits from healthcheckers and bots, but these were supposed to be very rare and to different uris, which never triggered that much load.

Going to k8s dashboard and clicking on "delete traefik pod" button restarted traefik and reduced the CPU load to 5-10 percent immediately:

screen shot 2017-09-10 at 23 30 16

Interestingly, RAM and swap charts look like there's been nothing happening at all:

screen shot 2017-09-10 at 23 30 27

Doing things like killing processes with signals is too advanced for me, but looks like I've managed how to do this :–) Typing docker kill -s SIGQUIT ${CONTAINER_ID} and then docker logs ${CONTAINER_ID} gave me 4K lines of stack traces after the mentioning of SIGQUIT: quit. See traefik-sigquit.txt. How does this fit your hypothesis?

One other thing I tried earlier today was manually adding a docker backend to toml and linking a container via this backend (i.e. by using the same traefik instance, but bypassing k8s). I did this when I was not sure if the load is to do with traefik or k8s logging, docker layers etc. Unfortunately, the same odd state kept being triggered when I was opening my static page via the docker backend, just the same as in k8s. So the bug might not be just ingress-specific. Bypassing traefik either by externalising docker port or k8s service to NodePort does not produce any peaks in the netdata charts.

The most recent tests were done with traefik 1.4.0-rc2. What other data would you be interested to dig out?

@timoreimann
Copy link
Contributor

Thanks for the stack traces; I'll try to find some time next week to analyze the dump.

One more thing: I suppose that web page of yours is non-public. Do you think it's possible to generalize/simplify/anonymize it so that it can be used by others (i.e., me) while still being able to reproduce the bug? That'd make things tremendously easier.

@timoreimann
Copy link
Contributor

timoreimann commented Sep 10, 2017

@kachkaev also, have you possibly set the MaxIdleConnsPerHost Traefik option to anything else but the default? If so, which value did you choose?

@kachkaev
Copy link
Contributor

kachkaev commented Sep 10, 2017

Great!

The page is somewhat private, yes. I'm not sure how much time it will take me to anonymise it and at what point it will stop producing the bug, but if you want, I can share the image with you via a private channel. Feel free to email me or DM on twitter and ask for the details.

Re MaxIdleConnsPerHost: not sure what this parameter is :–) Where should I check it?

@timoreimann
Copy link
Contributor

@kachkaev if you use the TOML file for configuring Traefik and you have customized the parameter, then you should find it in your configuration file with something like this:

MaxIdleConnsPerHost = 150

If you can't find the option (or you can only find it commented out), then you're running off of the default setting (which would be good to know too).

I'll reach out to you via email.

@kachkaev
Copy link
Contributor

kachkaev commented Sep 10, 2017

MaxIdleConnsPerHost is not a part of the helm chart I've been using and I also haven't found it in the generated configmap. Assuming this parameter is not set.

Thanks for the email!

@timoreimann
Copy link
Contributor

The dump seems to confirm the iowait problem: goroutines sitting at internal/poll.runtime_pollWait() are leading the board with 28 out of 150 goroutines in total.

I'm wondering if the very exact number of 150 is just coincidence or happens to indicate at some artificial limit. @kachkaev, could you SIGQUIT another Traefik pod being stuck and check the number of goroutines? (If it's easier for you, a grep goroutine <your dump> | wc -l is all I need.)

@timoreimann
Copy link
Contributor

timoreimann commented Sep 11, 2017

Goroutine count run-down per grep -A1 -E ^goroutine traefik-sigquit.txt |grep -v goroutine | grep -v \-| sed 's/\(.*\)([a-fx0-9, .]*)$/\1()/' | sort | uniq -c | sort -nr (removes the value addresses in order to enable grouping):

     28 internal/poll.runtime_pollWait()
     15 net/http.(*http2serverConn).writeHeaders()
     12 net/http.(*persistConn).writeLoop()
     10 net/http.(*persistConn).roundTrip()
      9 net/http.(*http2serverConn).writeDataFromHandler()
      7 sync.runtime_notifyListWait()
      2 sync.runtime_Semacquire()
      2 sync.(*Mutex).Unlock()
      2 reflect.Value.Field()
      2 net/textproto.canonicalMIMEHeaderKey()
      2 net/http.(*persistConn).roundTrip.func1()
      2 net/http.(*persistConn).readLoop()
      2 net/http.(*http2serverConn).readFrames()
      2 net/http.(*Transport).getConn.func2.1()
      1 vendor/golang_org/x/net/http2/hpack.(*Decoder).parseHeaderFieldRepr()
      1 unicode.SimpleFold()
      1 time.Time.Add()
      1 time.Sleep()
      1 syscall.Read()
      1 sync.runtime_SemacquireMutex()
      1 sync.(*Map).Load()
      1 runtime.usleep()
      1 runtime.gopark()
      1 reflect.valueInterface()
      1 os/signal.signal_recv()
      1 net/textproto.MIMEHeader.Set()
      1 net/http.http2cloneHeader()
      1 net/http.fixPragmaCacheControl()
      1 net/http.Header.Del()
      1 net/http.(*transferWriter).probeRequestBody()
      1 net/http.(*http2sorter).Keys()
      1 net/http.(*http2serverConn).serve()
      1 net/http.(*http2serverConn).newWriterAndRequestNoBody()
      1 net/http.(*http2priorityWriteScheduler).CloseStream()
      1 net/http.(*Transport).getConn.func4()
      1 net/http.(*Transport).dialConn()
      1 net/http.(*Request).write()
      1 net.IP.To4()
      1 net.(*netFD).init()
      1 net.(*Resolver).goLookupIPCNAMEOrder()
      1 main.run.func2()
      1 github.com/containous/traefik/vendor/github.com/vulcand/oxy/forward.(*websocketForwarder).serveHTTP()
      1 github.com/containous/traefik/vendor/github.com/thoas/stats.(*Stats).Data()
      1 github.com/containous/traefik/vendor/github.com/golang/glog.(*loggingT).flushDaemon()
      1 github.com/containous/traefik/vendor/github.com/eapache/channels.(*InfiniteChannel).infiniteBuffer()
      1 github.com/containous/traefik/types.CanonicalDomain()
      1 github.com/containous/traefik/server.(*Server).listenSignals()
      1 github.com/containous/traefik/server.(*Server).listenProviders()
      1 github.com/containous/traefik/server.(*Server).listenConfigurations()
      1 github.com/containous/traefik/server.(*Server).Wait()
      1 github.com/containous/traefik/provider/kubernetes.(*clientImpl).WatchAll.func1()
      1 github.com/containous/traefik/acme.(*ACME).CreateLocalConfig.func1()
      1 fmt.Fprintf()
      1 encoding/json.stringEncoder()
      1 encoding/json.intEncoder()
      1 context.WithCancel()
      1 context.(*valueCtx).Err()
      1 compress/flate.(*huffmanBitWriter).writeCode()
      1 bytes.(*Buffer).Write()

@kachkaev
Copy link
Contributor

kachkaev commented Sep 11, 2017

I just killed traefik container two more times with docker kill -s SIGQUIT ${CONTAINER_ID} and got these new dumps:

Each new test was a bit faster than the previous (I'm getting skilled in this 😃) and the number of goroutine occurrences went down too. Might be a hint for something, but might also be just a coincidence.

Will be happy to run more tests if needed!

@emilevauge
Copy link
Member

emilevauge commented Sep 11, 2017

@kachkaev ouch, that's unfortunate :/

Another way to debug goroutines and memory in traefik: enable debug mode with --debug arg, and the web provider (on port 8080 for example).
You will then be able to access goroutines and mem stats on the url http://localhost:8080/debug/vars.

You can also use https://github.com/divan/expvarmon to follow these stats in time expvarmon -ports="8080" -vars="Goroutines,mem:memstats.Alloc":

screenshot from 2017-09-11 14-14-09

@kachkaev
Copy link
Contributor

kachkaev commented Sep 11, 2017

Thanks @emilevauge! I added debug=true to the configmap toml and recreated the pod. Now a json gets returned when I go to https://traefik.example.com/debug/vars.

Originally the reported number of Goroutines was between 45 and 55. I hit the 'magic page' three times and although the iowait went crazy again, the CPU load stabilized after a short while:

screenshot from 2017-09-11 13-43-36

Not sure this is related to debug because I saw this behaviour previously too (the pod did not always freeze completely in all cases).

Here are the numbers of Goroutines I observed during the test by refreshing the url:

1st hit: 179 151 134
2nd hit: 180 178 132
3rd hit: 178 137 130 
afterwards: 122, then after some time 46

@kachkaev
Copy link
Contributor

kachkaev commented Sep 11, 2017

I just commented out these lines in toml and restarted the pod:

# [entryPoints.http.redirect]
      # entryPoint = "https"

The results are interesting: hitting the page via the http traefik endpoint for the first time does not produce any iowait artefacts. Then hitting the same url with https creates a peak (not too long in this specific pod reboot, but something like this happened before too). Then using http immediately afterwards does create a peak, but a smaller one:

screenshot from 2017-09-11 15-03-32 edited

After a short while http requests stop causing much io, but once an https request hits that page again, http ones become impactful again.

@kachkaev
Copy link
Contributor

kachkaev commented Sep 11, 2017

One more insight: removing compress = true from the [entryPoints.https] toml section solved the problem! No more iowait in the CPU charts until this option is turned on again!

Can the issue be related to gzipping jpegs? The demo page I privately shared with @timoreimann was a static html with a photo gallery. The container's nginx is configured to gzip a few filetypes, jpegs excluded. Perhaps traefik decides not to double-gzip the text files, but tries gzipping jpegs and finds it hard, especially when combined with tls?

FYI compress = true is on by default in the official helm chart via gzip.enabled option (=true). Details are in configmap.yml template.

@timoreimann
Copy link
Contributor

timoreimann commented Sep 11, 2017

@kachkaev interesting finding! I'm still struggling to repro the problem locally even with compression enabled and running over TLS. My Traefik configuration file looks like this:

logLevel = "DEBUG"
checknewversion = false
graceTimeOut = 0

defaultEntryPoints = ["http", "https"]

[entryPoints]
    [entryPoints.http]
    address = ":8000"
    compress = true
    [entryPoints.https]
    address = ":8443"
    compress = true
        [entryPoints.https.tls]
            [[entryPoints.https.tls.certificates]]
            CertFile = "server.crt"
            KeyFile = "server.key"

[web]
address = ":8080"
readonly = true

[file]
filename = "rules.toml"

Not sure if either the file provider or the fact that I'm using a self-signed certificate makes a difference.

@timoreimann
Copy link
Contributor

/cc @ldez who has worked on the compression middleware some time ago.

@emilevauge
Copy link
Member

@kachkaev Well done 👏 ! Could you test with compression enabled and v1.3.8 ?

@timoreimann
Copy link
Contributor

@emilevauge FWIW, the problem is also on 1.4 (see #1908 (comment)).

@kachkaev
Copy link
Contributor

kachkaev commented Sep 11, 2017

Good point @emilevauge! I originally discovered this issue both in 1.3.8 and 1.4.0-rc.2, but was only testing 1.4.0-rc.2 afterwards.

The peaks in 1.3.8 are confirmed. I manually changed the deployment image to traefik:1.3.8, which recreated the pod. Hitting the same page with jpegs over ssl did generate iowait, similarly to 1.4.0-rc.2. Commenting out compress = true in the traefik's configmap and then manually restarting the pod made the CPU charts look flat too!

Shame that I don't know go and so can't look through the code, but I hope my findings can still help!

@ldez
Copy link
Member

ldez commented Sep 12, 2017

I think the main problem is the compression of the jpeg files.

Do you have tested without https?

@timoreimann
Copy link
Contributor

timoreimann commented Sep 12, 2017

@ldez HTTP seems to work fine; it's only the HTTPS + compression combo that is problematic.

@timoreimann timoreimann changed the title Performance drop after using with kubernetes as ingress controller Performance drop when using HTTPS together with compression Sep 12, 2017
@kachkaev
Copy link
Contributor

kachkaev commented Sep 12, 2017

I believe I have a solution! After looking at my traefik Deployment yaml once again, I noticed that the it had CPU and memory limits automatically configured by the helm chart. These were 200mCPU cores and 30MB of RAM. My previous shallow knowledge of Docker suggested me that if traefik would exceed 30 MB, it would be simply killed, but I still replaced 30MB with 100MB just to check. And TLS finally worked despite compress = true! Don't know why, but instead of killing traefik, docker just started moving its pages to swap and this is what created a lot of iowait. When the limit was set to 100MB, there was enough room both for TLS handler and gzip, so the demo container I privately shared with @timoreimann loaded with no peaks! I replaced 100MB with 30MB again and saw my peaks as before – so it's unlikely to be anything else!

This memory limit & swap theory seems to explain everything I've been observing so far, including why the iowait periods were very long after two-three successive page load attempts and why @timoreimann could not reproduce the bug. @miadabrin and @ConorNevin might be also facing the same memory limit problem, just at a higher load level.

Although it turned out to be not a bug in traefik's code itself, the situation is probably still worth writing about somewhere in the docs. I'll also speak to the official k8s chart community to make sure helm users do not end up with not very optimal defaults. They should be able to smoothly load photo galleries via TLS with gzip :–)

Wish you guys all the best in preparing 1.4.0, about which I'm very excited!

@ldez
Copy link
Member

ldez commented Sep 12, 2017

@kachkaev great job !!! 👏 👏 👏

@timoreimann
Copy link
Contributor

@kachkaev thanks for the details -- your explanation makes perfect sense. I even discovered the same problem (limits too strict) with another user a few weeks ago but totally forgot about it again. :(

I'll make sure we leave a remark in our Kubernetes guide. Keeping the issue open until the PR is ready.

To complete the analysis: Docker does kill containers exceeding their memory limits (or, rather, the OOM killer does). However, by default both memory and swapping are taken into account. There's an open issue somewhere on the Kubernetes repository asking to take swapping into account in terms of resource specification. However, the official advice (especially from Google) is to disable swapping completely: Once you get to the point where your applications swap, all bets are off performance-wise. Hence, it's usually better to know when an application runs out of (pure) memory than to let it go into the limbo state that swapping means.

Thanks once more! 🎉

@timoreimann
Copy link
Contributor

PR is out. ^

@kachkaev
Copy link
Contributor

I guess this issue can be closed now 🎉

@ldez ldez removed the priority/P0 needs hot fix label Sep 12, 2017
@timoreimann
Copy link
Contributor

Hmm strange, the PR should have closed the issue but didn't. Oh well, closing myself then.

@ldez ldez added this to the 1.4 milestone Sep 13, 2017
@timoreimann timoreimann changed the title Performance drop when using HTTPS together with compression Performance drop when using HTTPS together with compression in Kubernetes Sep 13, 2017
@traefik traefik locked and limited conversation to collaborators Sep 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/provider/k8s/ingress kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. status/5-frozen-due-to-age
Projects
None yet
Development

No branches or pull requests

8 participants