Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.4.0 (official hhvm-3.4.0~trusty package) eats all memory+swap #4268

Closed
tat opened this issue Nov 18, 2014 · 75 comments
Closed

3.4.0 (official hhvm-3.4.0~trusty package) eats all memory+swap #4268

tat opened this issue Nov 18, 2014 · 75 comments

Comments

@tat
Copy link

tat commented Nov 18, 2014

I upgraded my aws instances (c3.large) to 3.4.0 (official packages got from http://dl.hhvm.com/ubuntu) and all of them get killed by oom-killer after eating all RAM and swap in about 5 minutes (getting about 300 requests per minute).

Is there anything I can check to track down the issue?

My server.ini:
pid = /var/run/hhvm/pid
hhvm.server.port = 9000
hhvm.server.type = fastcgi
hhvm.server.default_document = index.php
hhvm.log.use_log_file = true
hhvm.log.file = /var/log/hhvm/error.log
hhvm.repo.central.path = /var/run/hhvm/hhvm.hhbc
hhvm.resource_limit.max_socket = 10000
hhvm.log.header = true

Thanks,
stefano

@tat tat changed the title 3.4.0 (official hhvm-3.4.0~trusty package) memory 3.4.0 (official hhvm-3.4.0~trusty package) eats all memory+swap Nov 18, 2014
@mklooss
Copy link

mklooss commented Nov 18, 2014

can confirm same senario here, but also on HHVM 3.3.
we have to restart the hhvm process every 6 hours to keep the server online
were are using an Dedicated Server
auswahl_010

@tat
Copy link
Author

tat commented Nov 18, 2014

In my case hhvm eats RAM+swap (about 5gigs in total) in about 5/7 minutes.

@mklooss
Copy link

mklooss commented Nov 18, 2014

yesterday we had the same on 64 GB RAM and 8 GB SWAP in about 6 hours :/

@jwatzman
Copy link
Contributor

@tat, this is an increase from 3.3 to 3.4? That's interesting. Can you get a heap profile for us? The process is unfortunately somewhat involved.

cc @paulbiss

@mklooss, what you're experiencing is unfortunately somewhat expected, and is a long-term issue we've been slowly looking into. It's not indicative of server instability, we just haven't optimized for a super-long-running server very much, since FB pushes twice a day. (Though 6 hours is still quite short.)

@fredemmott
Copy link
Contributor

The admin server speaks FastCGI now, not HTTP - you'll also need to configure your webserver to give you access to it.

@tat
Copy link
Author

tat commented Nov 18, 2014

Thanks for the feedback, I've got the admin interface working but I'm getting an error from the activate command:
Error 2 in mallctl("prof.active", ...)

do you know what's the issue? where is the file supposed to be written to? /tmp ?

Here's the jmalloc-stats output I captured from the admin interface, http://pastebin.com/vBvfPiP5

btw @jwatzman 3.3 is working fine for me, RAM usage is stable at about 350MB; it has been running for days without restarts.

@mklooss
Copy link

mklooss commented Nov 19, 2014

jemalloc Stats: https://gist.github.com/mklooss/8091e48c4551f40d05c8
currently the HHVM Process eats ~10 GB RAM, process runs ~ 2 hours

@frankh
Copy link

frankh commented Nov 20, 2014

I'm getting the same problem running a large wordpress site on HHVM. Memory usage starts at ~450mb and climbs to 1.2mb before restarting (not 100% if OOM killed or crashes yet) every ~2 hours

This is HHVM 3.4.0 on ubuntu/trusty

@jwatzman
Copy link
Contributor

I just cherry-picked a memory leak fix into the 3.4 branch -- can someone who's experiencing this build that branch and report back? If it fixes it, we can roll a 3.4.1 release. The issue is that if you are passing invalid arguments to some builtin functions, such that the builtin raises a warning, we leak a small amount of memory each time -- and it looks new in 3.4. If your PHP app generates a lot of warnings from builtins, then this could easily be your bug :)

Thanks for the feedback, I've got the admin interface working but I'm getting an error from the activate command:
Error 2 in mallctl("prof.active", ...)

do you know what's the issue? where is the file supposed to be written to? /tmp ?

I don't, sorry -- @fredemmott, @paulbiss, can either of you advise better?

@jwatzman
Copy link
Contributor

can someone who's experiencing this build that branch and report back? If it fixes it, we can roll a 3.4.1 release.

I went ahead and built a deb for trusty with this patch: http://dl.hhvm.com/ubuntu/hhvm_3.4.1-devtest~trusty_amd64.deb You can manually install that so you don't have to build HHVM yourself; let me know if it works better.

@denji
Copy link
Contributor

denji commented Nov 21, 2014

configure:

./configure -DENABLE_SSP=ON -DDEBUG_MEMORY_LEAK=ON -DDEBUG_APC_LEAK=ON

  -DDEBUG_APC_LEAK=ON|OFF : Allow easier debugging of apc leaks : Default: OFF
  -DDEBUG_MEMORY_LEAK=ON|OFF : Allow easier debugging of memory leaks : Default: OFF
  -DENABLE_SSP=ON|OFF : Enabled GCC/LLVM stack-smashing protection : Default: OFF

@levixie
Copy link

levixie commented Nov 21, 2014

@jwatzman which change you cherry-pick?
I only see some doc update
Thanks

@jwatzman
Copy link
Contributor

edf53c1 is the relevant cherry-pick. It does look like only a doc update, but AIUI we have a script that parses that file (in particular, lines of the form of the one changed) to generate a bunch of data about opcode semantics, and the change is thus relevant. It confused me as well until it was explained to me this morning :-P

@levixie
Copy link

levixie commented Nov 21, 2014

Thank you! We are building hhvm ourselves because we need some specific version of lib. I will pick the change and try it out to see how it goes

@frankh
Copy link

frankh commented Nov 21, 2014

Thanks for the patch and build, I'm trying it out now but unfortunately it looks like it's still leaking memory.

There are no warning/errors in my hhvm log so it doesn't look like this was the cause of the leak for me.

@staabm
Copy link
Contributor

staabm commented Nov 21, 2014

maybe you are using create_function ? it seems this one is leaky, too - #4250

@paulbiss
Copy link
Contributor

@staabm: that's been leaky for awhile, we're looking for a leak that was recently introduced

@jwatzman
Copy link
Contributor

Spent most of the morning looking at this. I wasn't able to reproduce it with the "representative WordPress" install from https://github.com/hhvm/oss-performance, unfortunately. However, I was able to reproduce the heap profiling failure, and can help you get us a heap profile. It's a little messy.

  • The reason that turning profiling on and off is failing is that the default Ubuntu jemalloc lib doesn't have profiling enabled. I built one for you: download http://dl.hhvm.com/ubuntu/libjemalloc.so.1 and replace /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 with that file. (You probably want to move the system one out of the way instead of overwriting it so you can easily restore it later.) (If anyone reading this isn't on Ubuntu Trusty, the important thing is to build jemalloc with ./configure --enable-prof.)
  • That's still not enough to work; you need to make sure HHVM runs with the following environment variable: MALLOC_CONF="prof:true,prof_active:false"
  • Then, set up the admin server as detailed above. My local install just passes -v AdminServer.Port=8093 to the HHVM command line, but you can put that in the config too. You also need nginx in front of that; I did it with something like this below the normal server stanza:
  server {
    listen 8091 default_server;
    access_log            /dev/shm/hhvm-nginxCnchqi/admin-access.log main;
    client_body_temp_path /dev/shm/hhvm-nginxCnchqi/admin-client_temp;
    proxy_temp_path       /dev/shm/hhvm-nginxCnchqi/admin-proxy_temp;
    fastcgi_temp_path     /dev/shm/hhvm-nginxCnchqi/admin-fastcgi_temp;
    uwsgi_temp_path       /dev/shm/hhvm-nginxCnchqi/admin-uwsgi_temp;
    scgi_temp_path        /dev/shm/hhvm-nginxCnchqi/admin-scgi_temp;

    location / {
      fastcgi_pass 127.0.0.1:8093;
      include fastcgi_params;
    }
  }
  • Start up and warm up your server.
  • curl localhost:8091/jemalloc-prof-activate to turn on profiling.
  • curl localhost:8091/jemalloc-prof-dump?file=/tmp/dump1 to get an initial dump.
  • Let your server run for a little while. You say it takes about 6 hours to fall over? Let it run for 3-4 or so.
  • curl localhost:8091/jemalloc-prof-dump?file=/tmp/dump2 to get a second dump.
  • Let it run until it almost falls over.
  • curl localhost:8091/jemalloc-prof-dump?file=/tmp/dump3 to get a final dump.
  • Send me those three files out of /tmp, along with both which version of HHVM you installed (i.e., the Trusty 3.4.0 package, which I think you're using) and the output of hhvm --version. Feel free to post it here, or in case you think there's anything sensitive, my email address is my GitHub username at fb.com.

@SiebelsTim
Copy link
Contributor

@jwatzman Put this in the wiki or somewhere! 👍

@jwatzman
Copy link
Contributor

Yeah, good idea, will do if this ends up producing useful results :)

@jwatzman
Copy link
Contributor

jwatzman commented Dec 2, 2014

Have any of you that are experiencing this been able to get any more info? Just confirming that the 3.4.1-devtest deb linked above does or does not help would be useful -- and if it doesn't help, a heap dump as above would be even more useful. This is going to eventually hit human timeout which would be unfortunate, since it seems to be a real issue -- but since we can't repro it, we need more info to track it down :(

@liayn
Copy link

liayn commented Dec 2, 2014

I'll install the devtest thing now on the live-server now. lets hope

@liayn
Copy link

liayn commented Dec 2, 2014

hm, apt-get keeps nagging me tell me a newer version is available... how can I avoid that?

@jwatzman
Copy link
Contributor

jwatzman commented Dec 2, 2014

You can directly download the deb and then sudo dpkg --install path/to/deb.

@liayn
Copy link

liayn commented Dec 2, 2014

that's what I did. It replaced the installed hhvm, but now apt-get reports that updates are available and that triggers reporting systems and that triggers mails....

@jwatzman
Copy link
Contributor

jwatzman commented Dec 2, 2014

Can you just silence that for a little while? The package is deliberately built out-of-band, since it's unclear if it will help. (Though it's signed with the same GPG key as the official ones so you can tel it does come from us.) I'm not sure what reporting system you are using to tell you how to shut it up; you may try just commenting out the HHVM repo from /etc/apt/sources.list or /etc/apt/sources.list.d/ wherever it is.

@tat
Copy link
Author

tat commented Dec 10, 2014

Debug files sent to jwatzman. Let me know if you find anything, thanks!

@swtaarrs
Copy link
Contributor

Thanks, that log file was very helpful. It looks like the JIT is just trying to compile an incredibly large chunk of code in a function with an abnormally large amount of locals, and we're using a lot of memory as a result. There are a few things you can do that should help. The problem is that you have a large amount of code in a pseudomain (code that isn't in any function, just at the top level of a file) and the way we compile those is pretty suboptimal. The quickest fix will be disabling compilation of those with the hhvm.jit_pseudomain = 0 ini option. That will negatively impact performance but should reduce the crazy memory usage.

A better fix would be putting all of that code inside a function, rather than leaving it at the top level. I can't tell which file it was, but it looks like there are at least 395 local variables in it, and some of the functions it calls are array_map, trc, convert_height_to_text, convert_size_to_text, and convert_weight_to_text. If that's not easily possible since you make heavy use of global variables, or if neither of these help, your best bet is probably the hhvm.jit_max_region_isntrs option I mentioned in a previous comment.

It's of course possible that there's a real leak somewhere, but so far all signs are pointing to the massive compilation unit being the problem.

@jwatzman
Copy link
Contributor

@tat and this bit of code started consuming more memory in 3.4, which is why this probably just now hit you. I hear the memory usage will be somewhat improved in 3.5 or 3.6, but no promises -- and that clearly doesn't help you now :)

@swtaarrs is there any tweaking we could do of defaults, or anything like that, which would make this failure more at least more debuggable, or hopefully go away, for external folks? This seems like something that folks will hit from time to time, and ideally they shouldn't, or at the very least it shouldn't be this hard to debug.

@liayn
Copy link

liayn commented Dec 10, 2014

@swtaarrs: I just deployed that option 'hhvm.jit_pseudomain = 0' to our server and restarted hhvm.
Unfortunately, memory usage keeps building. Not that fast, but we're not having rush hour currently. :-(

@liayn
Copy link

liayn commented Dec 10, 2014

I checked our Wordpress installation and couldn't find any of the aforementioned methods in the code. So this is not part of the Wordpress Core.

@tat
Copy link
Author

tat commented Dec 10, 2014

@jwatzman @swtaarrs we definitely have a jumbo file with about 4k lines and most of them are top level.

I tried it with hhvm.jit_pseudomain=0 and the memory consumed peaked at 21% (~600MB) but of course that is not useful for us as the jumbo file is the one that gets 95% of requests and it's not compiled with that setting.

I'm trying now with hhvm.jit_max_region_instrs=500 and it didn't get oom'd yet but it's consuming 78% of memory at the moment (2.8gigs). the memory growth seems slower but still there imho (also we don't have much traffic at this hours).

so it seems to be jit related but does it sound possible to you that it was using ~500mb on 3.3 and can't run with 3.5 gigs on 3.4?

@jwatzman
Copy link
Contributor

@liayn it's extremely likely that your issue is unrelated to what we're tracking down with @tat and so it's unsurprising that the options mentioned didn't help -- HHVM has some very longstanding slow memory leaks; a particularly bad regression went in to 3.4, that was fixed in the 3.4.1-devtest I posted above. I'll be rolling 3.4.1 with that fix and some other fixes that I'm waiting on being finished up, probably in the next couple of weeks. But most have been there well before 3.4, for years, and what you've said unfortunately sounds in line with them. That's just to say it's expected, not good -- I think some folks are going to try to go after those leaks some time after the holidays.

so it seems to be jit related but does it sound possible to you that it was using ~500mb on 3.3 and can't run with 3.5 gigs on 3.4?

@bertmaher @swtaarrs @alexmalyshev any of you have any idea how much the RAM usage of FrameState increased from 3.3 to 3.4? I know someone said that it did, but 5-10x seems like a lot.

@jwatzman
Copy link
Contributor

Or maybe something that isn't a "leak" per-say -- the memory will eventually get cleaned up, but will stick around somewhat longer than before, causing an earlier OOM when combined with the FrameState size increase?

@bertmaher
Copy link
Contributor

@jwatzman, @tat: something that's still confusing to me is that FrameState should not be long-lived; we should allocate them and either (a) OOM immediately, which is what we think is happening here, or (b) finish translating and free the FrameStates. So something still feels weird here, unless this server never stops translating until it OOMs...

@jwatzman
Copy link
Contributor

Yeah, something weird is going on. https://gist.github.com/jwatzman/5c25aa6732e849df13e2 manages to reproduce the issue -- run the output of that on 3.4, refresh 12 times until we JIT, then watch RAM spike -- up to 1G on my machine. An idle heap dump looks very similar to what @tat sent -- with things that should never be running simultaneously, and shouldn't be running at all when idle. Looks either like a leak or our heap profiling is lying to us :)

The issue actually looks much worse on master, though it could be a separate issue. (We peg the CPU and keep consuming RAM until we OOM with my above script -- we're at least stable at 1G on 3.4.)

@swtaarrs and @bertmaher are continuing to look into this.

@jwatzman
Copy link
Contributor

We found it!

It appears to be a bug in boost flat_map, versions before 1.55 (trusty has 1.54). One of these two, not clear which:

This was triggered by the new usage of flat_map in 4a8ee81, which is a rev that is new in 3.4.

I'll write a change tomorrow to work around this for old versions of boost, and get that merged into 3.4.1.

@tat
Copy link
Author

tat commented Dec 12, 2014

Great!!! if you can upload a new .deb I'll test it tonight.

On a side note it would be nice if the apt Package file would keep old versions listed so we could stay on a version that we know it works properly and upgrade manually after testing new versions (really needed when autoscaling is used).

Thank you!

@dmytroleonenko
Copy link

Hey,
I'm also experiencing memory leaks on Ubuntu 14.10 with hhvm 3.5 (and nightly as well) when running vBulletin 3 forum software.
Have sent dumps to jwatzman. Please advice if I should open separate issue.

@jwatzman
Copy link
Contributor

Some notes:

  • A change is up for review to fix this -- https://reviews.facebook.net/D30183 if anyone wants to follow along. I'll merge it into the 3.4 branch and roll 3.4.1 packages once it's in.
  • The bug is actually in flat_set, I misspoke above. The upstream issue is likely to be https://svn.boost.org/trac/boost/ticket/9166 but it's not totally clear.
  • It was fixed in boost 1.55, which ubuntu 14.10 ships with, so unfortunately if you're running 14.10 this won't help you. We are separately tracking some other memory issues with 3.5-dev, we'll see if they're related.
  • If you installed my libjemalloc above to get a heap dump, you should revert to the system version when you're done with the tests. There's a very slight incompatibility between it and the version of folly in 3.4, which can trigger a very rare issue that leads to memory corruption. It's so rare that it's probably fine, but to be totally safe I'd revert it.

On a side note it would be nice if the apt Package file would keep old versions listed

Unfortunately reprepro which we use to manage the repo doesn't do this. The debs are all still up on dl.hhvm.com (specifically ubuntu lives here) if you want to manually install something. But yeah, I'd definitely recommend installing new versions onto a single development machine and testing before upgrading your production server(s). (Not just for HHVM, but for any upgrade of anything :))

I'm also experiencing memory leaks on Ubuntu 14.10 with hhvm 3.5 (and nightly as well) when running vBulletin 3 forum software.

This is likely to be a separate issue, since I'm pretty sure 14.10 has a fixed boost library. HHVM is known to leak small amounts of memory over a long period of time (i.e., needing to restart the HHVM process every couple of days isn't unexpected). If the leak is worse than that, please open a separate issue -- this one is specifically about a regression from 3.3 to 3.4.

@jwatzman
Copy link
Contributor

@dmytroleonenko the heap dump you sent me was pretty clearly not from 14.10 -- are you sure you're not on 14.04? It looks a lot like you are, in which case you are likely hitting the same or a similar leak. Try the new nightly tonight (2014.12.13 or newer).

@jwatzman
Copy link
Contributor

Will tag and roll 3.4.1 shortly. I'll make sure to build 14.04 first, should be available in a few hours. Nightly builds won't have the fix until the 2014.12.13 builds, as noted above.

jwatzman added a commit that referenced this issue Dec 12, 2014
Summary: This is very likely to be the memory leak reported to be new in
HHVM 3.4. See code comment and linked GitHub issue for full explanation.

Fixes #4268

{sync, type="child", parent="internal", parentrevid="1736543", parentrevfbid="1584241511789266", parentdiffid="5935230"}

Reviewed By: @paulbiss

Differential Revision: D1736543
@jwatzman
Copy link
Contributor

The release version of 3.4.1 for trusty (14.04) is up, which was the hardest hit and what I think everyone on this thread was using. Building debug version, and the other OSes, over the next day or two.

Thanks for the info from everyone about this! @dmytroleonenko if you're still hitting problems after trying the 2014.12.13 nightly (which won't exist for another 12 hours or so) please file a new issue.

@tat
Copy link
Author

tat commented Dec 13, 2014

great, thank you guys! I'll try it out and report back if I find any issues.

@pjv
Copy link

pjv commented Dec 13, 2014

3.4.1 on ubuntu trusty is looking good for me. 9 hours of stable memory use by HHVM.

2014-12-13 at 2 49 am

@liayn
Copy link

liayn commented Dec 13, 2014

Looks good with us as well. Initial memory usage already lower. So far memory seems stable. Thanks a lot to everyone!

@dmytroleonenko
Copy link

It looks much better. I'll try to see if the issue is still there and file a new bug if still there

@HumanWorks
Copy link

I had the same problem with HHVM + wordpress, the server was crashing every time we were posting something new. It turns out that the simple solution that works is just disable the "Try to automatically compress the sitemap if the requesting client supports it" setting in XML sitemaps plugin.
Hope this will help (I know it's not a real solution to hhvm but at least it will work for many wordpress installations out there)

@paulbiss
Copy link
Contributor

paulbiss commented Mar 5, 2015

@HumanWorks the problem we were tracking here turned out to be a memory leak in boost that was being triggered in the JIT, it's been fixed since 3.4.1 and isn't present in 3.3. If you're seeing a different leak I would suggest opening a new issue.

@csdougliss
Copy link
Contributor

@paulbiss Any idea on when 3.6.0 will make it's way out of the door?

@paulbiss
Copy link
Contributor

paulbiss commented Mar 5, 2015

@craigcarnell I think the plan is start rolling packages today or tomorrow (our packager isn't the fastest box...), I've got one more cherry-pick I need to push. It's been a busy week for everyone and we haven't had a chance to update the packaging system to push a second LTS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests