Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart worker after configurable number of requests #280

Closed
mathieudz opened this issue Oct 30, 2023 · 25 comments · Fixed by #281
Closed

Restart worker after configurable number of requests #280

mathieudz opened this issue Oct 30, 2023 · 25 comments · Fixed by #281
Labels
bug Something isn't working documentation Improvements or additions to documentation

Comments

@mathieudz
Copy link
Contributor

As PHP was not designed for long-running processes, memory leaks are not uncommon as they usually aren't an issue. Introducing long-running processes with tools like frankenphp, memory leaks become problematic.
In my case, my application consumes around 1 GB per hour. As mitigation I have scheduled a zero-downtime restart of the application every 8 hours. However, it'd be nice if frankenphp would provide such functionality out of the box. php pm (from where I migrated) provides such a setting.
image

@dunglas
Copy link
Owner

dunglas commented Oct 30, 2023

This can usually be handled directly in the worker script. Here is an example with Laravel Octane: https://github.com/laravel/octane/pull/764/files#diff-5891b628d054a2240185344524af788b4e613e0d24b82ee3b5a52d174ac0daf7R30

As this can easily be done userland, I would prefer to add this to the documentation instead of "bloating" the C or Go code. WDYT?

@mathieudz
Copy link
Contributor Author

There's an argument to be made for either solution. On the one hand you want to keep frankenphp minimal and most PHP developers don't know go, on the other hand most applications will probably need something for mitigating memory leaks.
Since you prefer deferring it to the worker scripts, I'll have a look at implementing a similar solution in https://github.com/php-runtime/runtime/blob/main/src/frankenphp-symfony/src/Runner.php .

@mathieudz
Copy link
Contributor Author

I'm testing this change: php-runtime/frankenphp-symfony@main...mathieudz:frankenphp-symfony:max-requests
.. but memory still seems to be growing.
Any idea how I can check if the environment variable is read correctly & workers are regularly restarted?

@dunglas
Copy link
Owner

dunglas commented Oct 30, 2023

If the debug mode is enabled (set the global debug Caddy directive), you should see this in the logs every time the service restarts: https://github.com/dunglas/frankenphp/blob/main/worker.go#L101

@dunglas
Copy link
Owner

dunglas commented Oct 30, 2023

@mathieudz I just tested your code locally and it appears to work (I added the env var in the .env file of my Symfony demo project). Keep in mind that by default FrankenPHP starts many workers, and the count is per worker.

@dunglas
Copy link
Owner

dunglas commented Oct 30, 2023

By the way, you should probably use a runtime option instead of a raw en var.

@mathieudz
Copy link
Contributor Author

mathieudz commented Oct 30, 2023

  1. OK, my problem was that I provided an environment variable to my docker container instead of adding a var to .env, it works fine now.
  2. These "runtime options" are located in index.php and thus hardcoded? I'd like to keep it configurable, to avoid having to create new images when tweaking settings.
  3. I didn't have to enable debug mode, as the message is level error
  4. It would be nice to be able to differentiate between intended worker termination (like this), and unintended (others). The current message "unexpectated termination" sounds like something went wrong.
  5. Should I create a merge request for the main runtime git repository?

@mathieudz
Copy link
Contributor Author

The workers get restarted regularly (e.g. every 100 requests, running with 16 workers), but the memory usage is still increasing over time ☹️

@EvanEdwards
Copy link

Here's two cents from a different perspective, technical considerations set aside for a moment:

I realize that memory leaks are not the only reason to restart, but a maximum memory setting for the worker would be PHP-ish. It would be in line with the various similar settings for maximum POST size, memory usage, timeout, etc.

A timeout and reset-on-max-memory-use setting "feels PHP" to me. Ideally resetting after the current workers finish, which probably points to having a threshold to terminate and a second, higher threshold to hard kill.

@mathieudz
Copy link
Contributor Author

@EvanEdwards I guess that we could check the PHP memory usage with memory_get_usage after the explicit garbage collection against a configurable threshold at https://github.com/php-runtime/runtime/blob/1c6ac3dad61b35e126541190232b497ed934c170/src/frankenphp-symfony/src/Runner.php#L45
Makes me think: what value should you expect here if there are no memory leaks? Almost zero?

@dunglas
Copy link
Owner

dunglas commented Oct 30, 2023

@mathieudz I'm not able to reproduce locally.

Could you try the following things:

  1. taking several profiles at different point would help a lot. To do so you can follow these steps: https://mercure.rocks/docs/hub/debug (the environment variable is CADDY_GLOBAL_OPTIONS instead of GLOBAL_OPTIONS, and if you're using Docker you'll also have to set CADDY_ADMIN to :2019)
  2. testing if the problem also occurs in non-worker mode

Also, do you get OOM errors? If it's the not the case, that may be normal behavior, because Go (as PHP) is language using a garbage collector, but unlike PHP it will give back the memory to the system only when the system needs it (Chrome as a similar behavior for instance). A profile will help identifying if it's Go or PHP that has the memory, and if it's intended or not.

If you need help to profile, I'll be glad to help, don't hesitate to contact me privately (mail, Symfony Slack, Twitter DM...).

@dunglas dunglas reopened this Oct 30, 2023
@mathieudz
Copy link
Contributor Author

@dunglas Thanks, I'll look into it when I have some time.
In the meanwhile, my frankenphp process that runs 16 workers with each max 500 MB memory uses almost 20 GB after running for 19 hours. If this is really the normal behavior for go, then it is really bad, as it will end up hogging all memory, leaving nothing anymore for the disk cache, which is important for the PostgreSQL on the same server.
Also, I don't want to experiment with OOM as this is running on production. I haven't reproduced this on other environments yet.
This is total memory usage over 19 hours, 18 hours since last restart of frankenphp:
image

@withinboredom
Copy link
Collaborator

I'm running two instances. I restarted one to see what the memory difference was. Fresh-ish: 23mb (already up to 178mb serving very low traffic), running 4 days: 350mb. I'm not sure how to debug an already running instance of frankenphp. From what I can tell by dumping memory from /proc/PID/mem, inspecting it, and looking at actual memory usage... I think we might have a memory leak via not cleaning up something, somewhere, properly.

Ran a simple script to look for non-zero bytes in /proc/PID/mem, based on non-file mapped allocations in /proc/PID/maps:

54/178 mb used (fresh)

106/485 mb used (old) -- yes it grew by over a hundred megs in the last 20 mins...

Something fishy is going on.

@dunglas
Copy link
Owner

dunglas commented Nov 1, 2023

@withinboredom would you be able to do a profile (see instructions above)? Also, are you using the worker mode?

@dunglas
Copy link
Owner

dunglas commented Nov 1, 2023

If the problem is in C, we can do something like this to see what's going on: https://kirshatrov.com/posts/finding-memory-leak-in-cgo

@mathieudz
Copy link
Contributor Author

@dunglas See attached zip file for the results of http://localhost:2019/debug/pprof/heap of my production environment.
heap.zip

  • After the start
  • After 15 minutes
  • After 30 minutes

@KennedyTedesco
Copy link
Contributor

KennedyTedesco commented Nov 1, 2023

Just a side note: I feel that cgi.populateEnv() might be a good place to look for a bottleneck. Sadly, I don't have any Go experience, but I'll see if I can find anything there.

@withinboredom
Copy link
Collaborator

@dunglas -- these are docker containers, not worker mode. I'll see if I can get one running in debug mode. FWIW, I will probably be able to change the env while running. I assume Go knows that env vars are mutable and non-thread-safe... but there's one way to find out :evil-laugh:

@dunglas
Copy link
Owner

dunglas commented Nov 16, 2023

We could try this to detect the leak, if it hasn't been fixed in RC3: https://clang.llvm.org/docs/LeakSanitizer.html

@mathieudz
Copy link
Contributor Author

mathieudz commented Nov 18, 2023

Since the memory is still leaking with the memory patches of a few days ago (it crashing when reaching ulimit limits), I attach the pprof dumps + a history of memory usage (vmem). It's jumping up and down, but still a very clear linear long-term evolution is visible.
Frankenphp runs in worker mode with 16 workers. Workers are terminated after 10,000 requests.
heap20231118.zip
image

@dunglas
Copy link
Owner

dunglas commented Nov 18, 2023

Thanks for the profiles. Could you also sample the C heap using https://github.com/iovisor/bcc/blob/master/tools/memleak.py please?

So we'll see where the memory is used.

@dunglas
Copy link
Owner

dunglas commented Nov 19, 2023

@mathieudz According to the profiles, you're still running a version that is not the latest RC. The populateEnv() function doesn't exist anymore.
image

@mathieudz
Copy link
Contributor Author

mathieudz commented Nov 22, 2023

I have updated my docker image again.
Some observations after running ~ 2 hours:
Memory (sys_bytes) according to metrics in prometheus
image
It seems to correlate with the memory usage according to top (vmem). The difference seems to fluctuate around 3.6 GB (top shows on average 3.6 GB more than prometheus). So there must be some memory that is not accounted for by the go metrics in prometheus. Maybe the memory allocation by PHP. Anyway, since the difference is rather stable, it means that the memory leak fully accounted for by the metrics shown here above.
Then I have also run memleak.py for those two hours. "leaks" go up and down, but never over 1 GB. A few minutes after starting I saw 800 MB, but the last time (see attachment) it didn't even exceed 400 MB. So no memory leak measured there?
Lastly we have the pprof profile, which only shows a memory usage of ~ 800 MB, which is only half of what the metrics in prometheus show. This time no populateEnv anymore, so I should be on a more recent version.
memleak-20231122-2.log
pprof.adhoc.20231122.zip

@dunglas dunglas mentioned this issue Dec 11, 2023
Nyholm pushed a commit to php-runtime/runtime that referenced this issue Dec 12, 2023
Nyholm pushed a commit to php-runtime/frankenphp-symfony that referenced this issue Dec 12, 2023
@dunglas dunglas added bug Something isn't working documentation Improvements or additions to documentation labels Dec 14, 2023
@mathieudz
Copy link
Contributor Author

@dunglas It's OK for me to close this one as the subject of the issue is solved in the runtime now and keep the discussion about memory leaks in #366 . Let me know if you want more info from my side.

@dunglas
Copy link
Owner

dunglas commented Dec 15, 2023

Let's do that then. Thank you very much for your help, it has been very helpful, and I hope that we'll soon fix the leak too.

@dunglas dunglas closed this as completed Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants