Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
daily crashes on hhvm 3.7.2 (and previous versions, at least 3.7.0) #5528
hhvm crashes regularly on a high load webapp (/www.XXX/) once or twice a day, quite more often when switching another app (/other.XXX/) sharing the same database and controller/model from php-cgi to hhvm too. There's no way to reproduce this, as it affects pdf files generated by mPDF, normal sites, GET and POST methods on random pages. Since it's monitored and automatically restarted, that's not so bad. Not sure if this relates to bug #2721. But sometimes, like once or twice a week, the hhvm daemon just get unresponsible for 10, 15, 20 minutes until manually restarted without consuming any additional cpu or memory. Clients receive a 502 error after timeout connecting to hhvm backend (nginx 1.9.2 behind nghttp2 1.0.4 as tls offloader).
let me know if you need more information.
kern.log & stacktrace from the last few days:
As it just happened for the second time today, an update on the (likely separate) "unresponsive issue". I just noticed that hhvm backend could not be reached for about 15 minutes before I manually restarted it, and it created a stacktrace.log in /tmp, but without any further information. Also no kern.log or hhvm/error.log. So basicly it segfaulted without terminating the running process. I'm not entirely sure if this is more likely related to nghttp2`s stream implementation (an upgrade from 1.0.2 to 1.0.4 to which fixes nghttp2/nghttp2#264 were the only changes made today)?
I think the calls to
are ancillary to the problem here. The ini request shutdown function is called at the end of every request and any ini setting that was set via
So, it looks to me like someone called
well, that's interesting. ini_set is used:
error_log is watched with swatch, but, in production, where those crashes happen, error.log is always empty, since errors only occur in dev-env when making typos :-)
The thing is, the app has a frontend and a backend. Frontend running on hhvm just crashes once, maybe twice a day, somedays not at all. I tried pen-testing dev-env, but no luck reproducing this. Setup is exactly the same, the only variable missing is user-(mis)behaviour.
Any ideas for ini tweaks or reproducing crashes? ab/h2load/siege seem quite useless to simulate user behaviour.
When I think about why backend crashes more likely than frontend, this might be related: in frontend, MariaDB qps is insignificantly low, since most of the data is stored indefinitely in redis until a change in a dataset is made in backend, while backend has a high qps. For a few month now, starting with mariadb 10.0.x, mysqloptimize/mysqlcheck/mysqlanalyze cronjobs mark random tables as crashed, as it cannot create a tmp table (this relates to the latest mysqlatfacebook rant about "sheer incompetence", mysql bug #77439), because it already exists (created by well ordered queries running at the same time). This happens on all machines running hhvm with mariadb/ariadb/mysqli. Never reported that though, because: unreproducible.
I was able to pinpoint this to mPDF Library (mpdf1.com, v6.0, unrelated to previous mPDF issues), which explains why HHVM crashed more often on backend (heavy usage of mpdf). To be fair, this affects < 0,1% of all mPDF requests, but this makes it even harder to understand.
At first, I set the mPDF timeout from 480s (?!) to 30s, so I could sleep again. Then I went through all related log files. One entry in HHVM error.log showed that mPDF was not able to load a 300dpi (~150k) image from a CDN host (resolver timeout), so I wondered if there are libc/HHVM issues with getaddrinfo(), since I'm on debian testing with hhvm 3.7.2 for jessie, and added the CDN hosts to /etc/hosts (nsswitch preferes files before dns). That didn't happen again, but the issues remains (but never with php-cgi up to 5.6.9)
and the stacktrace for above:
When I restart HHVM as it becomes unresponsive, and before script times out, the stacktrace will be empty after DebuggerCount.
What I don't understand:
Regardless of yet unknown mPDF code issues, if one client is locking up HHVM with a mPDF request, why are all other requests to HHVM through nginx being stalled (and 503'ed after timeout)? Shouldn't HHVM be threaded, and should not only the one "spawned" thread running the script crashing? Did I miss something in server.ini?
@64616E69656C Hi. I am on vacation right now, but a quick glance. You have the server thread count at 16, so that should be fine. I looked at an issue with similar errors as yours....
There are some timeout settings there, but I am not sure that is the problem, but maybe, I guess.
Doing a google search ... https://www.google.com/search?q=%22entire+web+request+took+longer%22
That might provide some hints.
For now I've settled for a slightly decreased performance and switched the site in question back to php-cgi due to increased agitation and panic attacks from management.
There are 3 issues for me here, which might or might not be related to each other.
And then, there is
//EDIT: I would very much appreciate deb-src for apt, so one could easyly rebuild the package with afl and more custom debugging options.
@64616E69656C Hi. I am back from vacation now. HHVM 3.8 is out now. I am curious if you could upgrade to 3.8 and check and see if you have the same issues?
@Nikerabbit It seems odd to me that a call to
referenced this issue
Aug 31, 2015
after upgrading to 3.10.0 today, it's happening again, now with crashes every few minutes.