New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(deleted) tmp files left around #1327

Closed
bsergean opened this Issue Aug 8, 2016 · 32 comments

Comments

5 participants
@bsergean
Contributor

bsergean commented Aug 8, 2016

Env:
gunicorn==19.6.0
OS = CentOS
Running under systemd.

A bunch of temp files are still owned by my gunicorn worker processes. The number of temp files is varying across workers, that worker has 45 left around, another one has 67, some seems to be shared. The naming of those temp files seems to be mean that they are used/owned by gunicorn. I ran into resource problems (max number of opened files > 1024) and one of our Ops guy noticed that gunicorn had many opened files.

Thanks for any hints on debugging this !!

I start/stop the service through systemd, and I have not configured how systemd should stop and reload the app. My upgrade process make it stop and start. For the record I tried to implement custom restart by sending USR2, then WINCH and TERM to the old process, but systemd was ignoring that, detecting it as non handled workflow and would restart gunicorn on its own.

gunicorn: 3378 crashzone    7u      REG                8,3         0 269227083 /tmp/wgunicorn-WUef6V (deleted)
gunicorn: 3378 crashzone    8u      REG                8,3         0 269227084 /tmp/wgunicorn-z78nQi (deleted)
gunicorn: 3378 crashzone    9u      REG                8,3         0 269227085 /tmp/wgunicorn-yBPZFy (deleted)
gunicorn: 3378 crashzone   10u      REG                8,3         0 269227116 /tmp/wgunicorn-IZ8yLv (deleted)
gunicorn: 3378 crashzone   11u      REG                8,3         0 269227118 /tmp/wgunicorn-b7RadI (deleted)
gunicorn: 3378 crashzone   12u      REG                8,3         0 269231113 /tmp/wgunicorn-QTigaF (deleted)
gunicorn: 3378 crashzone   13u      REG                8,3         0 269231114 /tmp/wgunicorn-2Vthud (deleted)
gunicorn: 3378 crashzone   14u      REG                8,3         0 269231116 /tmp/wgunicorn-8CHx_z (deleted)
gunicorn: 3378 crashzone   15u      REG                8,3         0 269231144 /tmp/wgunicorn-OIXCeX (deleted)
gunicorn: 3378 crashzone   16u      REG                8,3         0 269231145 /tmp/wgunicorn-saa7Vq (deleted)
gunicorn: 3378 crashzone   17u      REG                8,3         0 269231146 /tmp/wgunicorn-jX9a4S (deleted)
gunicorn: 3378 crashzone   18u      REG                8,3         0 269231154 /tmp/wgunicorn-Dm1KAE (deleted)
gunicorn: 3378 crashzone   19u      REG                8,3         0 269231155 /tmp/wgunicorn-RBV92D (deleted)
gunicorn: 3378 crashzone   20u      REG                8,3         0 269231157 /tmp/wgunicorn-VUD025 (deleted)
gunicorn: 3378 crashzone   21u      REG                8,3         0 269231158 /tmp/wgunicorn-VxoyId (deleted)
gunicorn: 3378 crashzone   22u      REG                8,3         0 269231159 /tmp/wgunicorn-h65JSk (deleted)
gunicorn: 3378 crashzone   23u      REG                8,3         0 269231160 /tmp/wgunicorn-yPNANx (deleted)
gunicorn: 3378 crashzone   24u      REG                8,3         0 269231161 /tmp/wgunicorn-P0YYso (deleted)
gunicorn: 3378 crashzone   25u      REG                8,3         0 269231162 /tmp/wgunicorn-ykwUxS (deleted)
gunicorn: 3378 crashzone   26u      REG                8,3         0 269231163 /tmp/wgunicorn-GhL7Yl (deleted)
gunicorn: 3378 crashzone   27u      REG                8,3         0 269231164 /tmp/wgunicorn-gV2KLt (deleted)

I have a wrapper for running gunicorn, which I ran entirely through python:

'''Start the gunicorn HTTP server
'''

import gunicorn.app.base
from gunicorn.six import iteritems


class StandaloneApplication(gunicorn.app.base.BaseApplication):
    '''A gunicorn application'''

    def __init__(self, app, options=None):
        '''Construct a gunicorn standalone app'''

        self.options = options or {}
        self.application = app
        super(StandaloneApplication, self).__init__()

    def init(self, parser, opts, args):
        '''init the Gunicorn app (the base class method is not implemented)'''

        super(StandaloneApplication, self).init(parser, opts, args)

    def load_config(self):
        '''Load our config file'''

        config = dict([(key, value) for key, value in iteritems(self.options)
                       if key in self.cfg.settings and value is not None])
        for key, value in iteritems(config):
            self.cfg.set(key.lower(), value)

    def load(self):
        '''Load our application'''

        return self.application

I run the application like that:

StandaloneApplication(app, options).run()
@benoitc

This comment has been minimized.

Owner

benoitc commented Aug 9, 2016

Right now i don't see how a worker can have more than one tempfile. It's only created when the worker instance is created and deleted when the worker crash. SO the pid in theory disappeared. Can you provide me some kind of scenario that would help me to reproduce that issue?

@xxwolfsrainxx

This comment has been minimized.

xxwolfsrainxx commented Nov 19, 2016

I am having this same issue. I don't see any updates on this.

@bsergean

This comment has been minimized.

Contributor

bsergean commented Nov 19, 2016

I did a very poor job of trying to come up with repro steps for this issue ... @xxwolfsrainxx / do you have any idea on how to answer @benoitc questions ?

@xxwolfsrainxx

This comment has been minimized.

xxwolfsrainxx commented Nov 19, 2016

I too am running a CustomApplication that extends gunicorn.app.base.BaseApplication that is started via a systemd service. The application will just continue to grow the number of open files it has due to an increasing amount of tmp/wgunicorn* files that are created. The files show "deleted" under lsof however it still shows open by many processes. I believe this is because the file is consistently opened by new gunicorn processes. For example a tmp file /tmp/wgunicorn-_fzofg23 is created, and then owned by many gunicorn worker processes (69129, 69130, 69131, etc). I think this is preventing the OS from cleaning up the actual file. Maybe my understanding is incorrect, please correct me if so. Is it possible that newly spawned processes re-use other processes temp files ?

@bsergean

This comment has been minimized.

Contributor

bsergean commented Nov 19, 2016

In my case, I don't get that many requests per second so I have set this: 'max_requests': 100
At some point I ran into very very high virtual memory usage (my wsgi handler writes files to disk), but now this is not the case anymore.

But that's for a different problem ... I think that creating a repro app would help a lot to figure out what's going on here. And maybe see if running outside of systemd changes something.

(running lsof on a gunicorn worker)
gunicorn: 27263 crashzone   22u      REG                8,3         0  271906700 /tmp/wgunicorn-nTV8qF (deleted)
gunicorn: 27263 crashzone   23u      REG                8,3         0  270268620 /tmp/wgunicorn-_hCfWy (deleted)
gunicorn: 27263 crashzone   24u      REG                8,3         0  270299152 /tmp/wgunicorn-zxYa4E (deleted)
gunicorn: 27263 crashzone   25u      REG                8,3         0  270284860 /tmp/wgunicorn-PSCD3b (deleted)
gunicorn: 27263 crashzone   26u      REG                8,3         0  270295094 /tmp/wgunicorn-SyAm4x (deleted)
gunicorn: 27263 crashzone   27u      REG                8,3         0  270299147 /tmp/wgunicorn-7R0GAq (deleted)
$ lsof -p 27263 | grep tmp | wc -l
21
@xxwolfsrainxx

This comment has been minimized.

xxwolfsrainxx commented Nov 19, 2016

Now that you mention it, my process is using an insane amount of virtual memory as well. I thought this was because of the way the OS collects the memory back from python. It seems you and I have similar issues. What did you do for the virtual memory issue? I will implement your max_requests 100 as well.

@benoitc

This comment has been minimized.

Owner

benoitc commented Nov 21, 2016

How can I reproduce it?

@tilgovi

This comment has been minimized.

Collaborator

tilgovi commented Dec 20, 2016

#1310 refactors the way socket inheritance and activation works. Since that might affect systemd, can anyone try it?

@bsergean

This comment has been minimized.

Contributor

bsergean commented Dec 21, 2016

Is #1310 part of a released version (as in available from PyPI) of gunicorn ?

@tilgovi

This comment has been minimized.

Collaborator

tilgovi commented Dec 21, 2016

@bsergean no, but I would like to merge it at some point and it helps to have people using systemd test it. I also hope that maybe it fixes your issue. Is it possible for you to test?

@tilgovi

This comment has been minimized.

Collaborator

tilgovi commented Dec 21, 2016

@bsergean actually, I apologize. I don't think #1310 is related to this issue at all. This issue is some problem with worker temp files, not socket inheritance. Please forgive the noise.

@tilgovi

This comment has been minimized.

Collaborator

tilgovi commented Dec 21, 2016

I easily reproduce this with SIGHUP. I think the workers are inheriting the arbiter's copy of old workers' tmp files.

@tilgovi

This comment has been minimized.

Collaborator

tilgovi commented Dec 21, 2016

Here's a patch to fix it. I will try to write a test and open a PR as soon as possible:

diff --git a/gunicorn/arbiter.py b/gunicorn/arbiter.py
index ff5d698..eb5c0dc 100644
--- a/gunicorn/arbiter.py
+++ b/gunicorn/arbiter.py
@@ -548,6 +548,10 @@ class Arbiter(object):
             self.WORKERS[pid] = worker
             return pid
 
+        # Do not inherit the temporary files of other workers
+        for sibling in self.WORKERS.values():
+            sibling.tmp.close()
+
         # Process Child
         worker_pid = os.getpid()
         try:
@bsergean

This comment has been minimized.

Contributor

bsergean commented Dec 21, 2016

oh oh ... great findings :) !

@bsergean

This comment has been minimized.

Contributor

bsergean commented Jan 11, 2017

Hey @tilgovi / I just tried your small patch and it works like a charm. I only see one temp file now (but I just restarted my web service).

@tilgovi

This comment has been minimized.

Collaborator

tilgovi commented Jan 11, 2017

We can commit the patch, but I worry about whether there's another problem you are experiencing. My patch only makes sure that workers don't inherit their siblings' temp files, but if you're seeing 60+ temp files open in each worker then something is very strange. Are you restarting workers very quickly? Are you running on PyPy (different file closing semantics, maybe)? Are your workers crashing a lot? Are you running with many workers?

@bsergean

This comment has been minimized.

Contributor

bsergean commented Jan 11, 2017

  1. Not running on PyPy
  2. Restarting every 100 requests (max_requests: 100)
  3. I don't think that my workers crash a lot.

Was looking at a process before the patch and it had 21 tmp files opened.

gunicorn: 33132 crashzone    7u      REG                8,3         0  270390154 /tmp/wgunicorn-jkMMo2 (deleted)
gunicorn: 33132 crashzone    8u      REG                8,3         0  270375749 /tmp/wgunicorn-1BrUH_ (deleted)
gunicorn: 33132 crashzone    9u      REG                8,3         0  270387246 /tmp/wgunicorn-cSzlVn (deleted)
gunicorn: 33132 crashzone   10u      REG                8,3         0  270387240 /tmp/wgunicorn-pE0LfJ (deleted)
gunicorn: 33132 crashzone   11u      REG                8,3         0  270387225 /tmp/wgunicorn-lyTDAS (deleted)
gunicorn: 33132 crashzone   12u      REG                8,3         0  270390155 /tmp/wgunicorn-VczLhk (deleted)
gunicorn: 33132 crashzone   13u      REG                8,3         0  270390148 /tmp/wgunicorn-5uaV4r (deleted)
gunicorn: 33132 crashzone   14u      REG                8,3         0  270375756 /tmp/wgunicorn-ePV2Gs (deleted)
gunicorn: 33132 crashzone   15u      REG                8,3         0  270390156 /tmp/wgunicorn-O331_Y (deleted)
gunicorn: 33132 crashzone   16u      REG                8,3         0  270390160 /tmp/wgunicorn-4pB_LX (deleted)
gunicorn: 33132 crashzone   17u      REG                8,3         0  270390150 /tmp/wgunicorn-y9Vk6M (deleted)
gunicorn: 33132 crashzone   18u      REG                8,3         0  270390157 /tmp/wgunicorn-DQwSgD (deleted)
gunicorn: 33132 crashzone   19u      REG                8,3         0  270375770 /tmp/wgunicorn-wjK9N_ (deleted)
gunicorn: 33132 crashzone   20u      REG                8,3         0  270390153 /tmp/wgunicorn-Eyt9ea (deleted)
gunicorn: 33132 crashzone   21u      REG                8,3         0  270375752 /tmp/wgunicorn-ScHxO2 (deleted)
gunicorn: 33132 crashzone   22u      REG                8,3         0  270375760 /tmp/wgunicorn-Q5HdNp (deleted)
gunicorn: 33132 crashzone   23u      REG                8,3         0  270375771 /tmp/wgunicorn-LbpTqY (deleted)
gunicorn: 33132 crashzone   24u      REG                8,3         0  270390161 /tmp/wgunicorn-Qm9qeB (deleted)
gunicorn: 33132 crashzone   25u      REG                8,3         0  270390146 /tmp/wgunicorn-bKEY9z (deleted)
gunicorn: 33132 crashzone   26u      REG                8,3         0  270375754 /tmp/wgunicorn-ad2nC5 (deleted)
gunicorn: 33132 crashzone   27u      REG                8,3         0  270390168 /tmp/wgunicorn-z1giqH (deleted)
@bsergean

This comment has been minimized.

Contributor

bsergean commented Jan 11, 2017

FYI

gunicorn$ python -V
Python 2.7.5
gunicorn$ cat /etc/redhat-release 
CentOS Linux release 7.1.1503 (Core) 
@tilgovi

This comment has been minimized.

Collaborator

tilgovi commented Jan 11, 2017

@bsergean how many workers do you run with?

@bsergean

This comment has been minimized.

Contributor

bsergean commented Jan 11, 2017

Either 20 or 40 (one worker per core, and sometimes our machines are configured with hyper-threading so it looks like they have 40 cores).

@tilgovi

This comment has been minimized.

Collaborator

tilgovi commented Jan 11, 2017

I'm surprised you managed to get so many temp files left around, although with graceful shutdowns after max requests and a high number of requests per second I could imagine there being as many temp files as you see.

If you would like to alter your patch to set the close on exec flag (util.py has a close_on_exec function) after the worker is forked, you can do that in the parent process inside spawn_worker in the arbiter and I think that would be a safer and more elegant fix than the original patch I posted that you have in the PR.

@benoitc

This comment has been minimized.

Owner

benoitc commented Jan 11, 2017

Hi, could you give more details?

  • what are the permissions on the tmp folder? Is there any sticky bit set?
  • did you try with the debug log enabled I would be interested to see if something is displayed
  • did you try with gunicorn standalone?
@tilgovi

This comment has been minimized.

Collaborator

tilgovi commented Jan 11, 2017

@benoitc did you mean to comment on this issue? It seems as though we have the root problem and a fix identified. Temp file descriptors for the workers are inherited by child processes—of course, because the worker needs to inherit it from the arbiter–but then every worker inherits any open file descriptors intended for other workers. The solution is to have the arbiter set the close on exec flag after forking, so that the new worker's temp file descriptor isn't inherited by any other worker or other process.

@benoitc

This comment has been minimized.

@tilgovi

This comment has been minimized.

Collaborator

tilgovi commented Jan 11, 2017

Okay. I'm sorry if I came off rude. I just didn't see the connection.

Yes, that line is there. When worker A executes that line it ensures that no other process inherits it's socket or temp file, but the arbiter still has a copy of the temp file. When the arbiter spawns worker B, it inherits its own temp file and worker A temp file.

@tilgovi

This comment has been minimized.

Collaborator

tilgovi commented Jan 11, 2017

The Arbiter needs to set close on exec also after passing the fd to a worker so that no other worker will inherit it.

@benoitc

This comment has been minimized.

Owner

benoitc commented Jan 11, 2017

@tilgovi i mean sometimes in 2010, this fix was added for such error. Also we have:

https://github.com/benoitc/gunicorn/blob/master/gunicorn/workers/workertmp.py#L30-L36

That should unlink it and then exactly do what the close was about. We didn't have the error since. not sure why we now have it. Just want to make sure we don't miss anything.

@tilgovi

This comment has been minimized.

Collaborator

tilgovi commented Jan 11, 2017

Still, the problem is slightly different and exactly as I describe. This issue is not about leaking temp files, but leaking file descriptors to other workers. The files are gone once all the workers that inherit it quit.

You can see the behavior by spawning two workers and listing the open files of the second one. It will have both file descriptors for the temp files. With my patch, it doesn't.

I don't see any lines to prevent this exact problem, which is only an issue if you have many workers and maybe graceful restarts and max requests, such that there are a great number of worker processes tracked by the arbiter.

But instead of my patch, I think setting close on exec in the arbiter process after the fork is the right fix.

@benoitc

This comment has been minimized.

Owner

benoitc commented Jan 11, 2017

irc?

@tilgovi

This comment has been minimized.

Collaborator

tilgovi commented Jan 12, 2017

As I commented just now on the PR #1428, close_on_exec is not sufficient because the workers don't exec, they just fork and continue.

I want to do more thinking on this. Please continue with release without this.

@benoitc benoitc added this to Acknowledged in Bugs Feb 26, 2017

@xxwolfsrainxx

This comment has been minimized.

xxwolfsrainxx commented Mar 13, 2017

I would like to note this is still not fixed and i am getting tons of open files that are deleted but they are still considered open. lsof shows that there are 228577 files open that were deleted but python shows them open. This is making it so that I cannot login as this service account.
lsof shows results such as:
python3 1816 test_user 11u REG 253,0 0 2494528 /tmp/wgunicorn-2gq1xq_i (deleted)
python3 1816 test_user 12u REG 253,0 0 2494531 /tmp/wgunicorn-7h7rr535 (deleted)
python3 1816 test_user 13u REG 253,0 0 2494547 /tmp/wgunicorn-9d0n81g2 (deleted)
python3 1816 test_user 14u REG 253,0 0 2494549 /tmp/wgunicorn-s681max8 (deleted)
python3 1816 test_user 15u REG 253,0 0 2494550 /tmp/wgunicorn-sau57lfm (deleted)

@jkemp101

This comment has been minimized.

Contributor

jkemp101 commented Jul 15, 2017

I believe PR #1428 is the right fix for this. It is always the responsibility of the child process to close the file descriptors of the parent process that it does not need. In fact, Popen has close_fds which makes this easier for subprocesses that just want to close all of the parent's handles. But of course that doesn't help here.

Here are test results from running gunicorn (gevent) on Ubuntu 14.04 with 5 workers. This is ps and lsof output after a clean restart of gunicorn.

root     21890  1123  4 13:40 ?        00:00:01 python gunicorn -c config.py example.gunicorn:app
root     21947 21890 38 13:40 ?        00:00:08 python gunicorn -c config.py example.gunicorn:app
root     21948 21890 39 13:40 ?        00:00:09 python gunicorn -c config.py example.gunicorn:app
root     21949 21890 41 13:40 ?        00:00:09 python gunicorn -c config.py example.gunicorn:app
root     21950 21890 39 13:40 ?        00:00:09 python gunicorn -c config.py example.gunicorn:app
root     21951 21890 39 13:40 ?        00:00:09 python gunicorn -c config.py example.gunicorn:app
gunicorn  21890             root    6u      REG              202,1        0         57 /tmp/wgunicorn-Q7XEDK (deleted)
gunicorn  21890             root    7u      REG              202,1        0        339 /tmp/wgunicorn-5aRR70 (deleted)
gunicorn  21890             root    9u      REG              202,1        0       2762 /tmp/wgunicorn-Y1KSMn (deleted)
gunicorn  21890             root   10u      REG              202,1        0       2780 /tmp/wgunicorn-d2oaKx (deleted)
gunicorn  21890             root   11u      REG              202,1        0       2789 /tmp/wgunicorn-IxFK2w (deleted)
gunicorn  21947             root    6u      REG              202,1        0         57 /tmp/wgunicorn-Q7XEDK (deleted)
gunicorn  21948             root    6u      REG              202,1        0         57 /tmp/wgunicorn-Q7XEDK (deleted)
gunicorn  21948             root    7u      REG              202,1        0        339 /tmp/wgunicorn-5aRR70 (deleted)
gunicorn  21949             root    6u      REG              202,1        0         57 /tmp/wgunicorn-Q7XEDK (deleted)
gunicorn  21949             root    7u      REG              202,1        0        339 /tmp/wgunicorn-5aRR70 (deleted)
gunicorn  21949             root    9u      REG              202,1        0       2762 /tmp/wgunicorn-Y1KSMn (deleted)
gunicorn  21950             root    6u      REG              202,1        0         57 /tmp/wgunicorn-Q7XEDK (deleted)
gunicorn  21950             root    7u      REG              202,1        0        339 /tmp/wgunicorn-5aRR70 (deleted)
gunicorn  21950             root    9u      REG              202,1        0       2762 /tmp/wgunicorn-Y1KSMn (deleted)
gunicorn  21950             root   10u      REG              202,1        0       2780 /tmp/wgunicorn-d2oaKx (deleted)
gunicorn  21951             root    6u      REG              202,1        0         57 /tmp/wgunicorn-Q7XEDK (deleted)
gunicorn  21951             root    7u      REG              202,1        0        339 /tmp/wgunicorn-5aRR70 (deleted)
gunicorn  21951             root    9u      REG              202,1        0       2762 /tmp/wgunicorn-Y1KSMn (deleted)
gunicorn  21951             root   10u      REG              202,1        0       2780 /tmp/wgunicorn-d2oaKx (deleted)
gunicorn  21951             root   11u      REG              202,1        0       2789 /tmp/wgunicorn-IxFK2w (deleted)

The master process 21890 has FDs to the 5 worker tmp files as expected. The first child forked 21947 has a FD to its file wgunicorn-Q7XEDK which is also expected. But look at the next child that was forked 21948. It has a FD to its tmp file wgunicorn-5aRR70 as well as the previous child wgunicorn-Q7XEDK. This pattern continues for each subsequent forked child.

PEP 433 and 446 address some of this behavior. I might change the comment in the PR from:

# Do not inherit the temporary files of other workers

to

# Close inherited temporary files of other workers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment