Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some files are never closed #60

Closed
bochecha opened this issue May 17, 2012 · 28 comments
Closed

Some files are never closed #60

bochecha opened this issue May 17, 2012 · 28 comments

Comments

@bochecha
Copy link

I'm working on a script using GitPython to loop over lots of Git repos, pull remote changes, etc...

At some point, I noticed that doing repo.is_dirty() was opening some files which never got closed until the process exits, which in turns causes the tool to crash with "too many open files" after iterating over enough repos:

import os

import git

# Put here whatever Git repos you might have in the current folder
# Listing more than one makes the problem more visible
repos = ["ipset", "libhtp"]

raw_input("Check open files with `lsof -p %s | grep %s`" % (os.getpid(),
                                                            os.getcwd()))

for name in repos:
    repo = git.Repo(name)
    repo.is_dirty()

    del repo

raw_input("Check open files again")                 # files are still open

I tried digging deeper down the GitPython stack, but couldn't find the actual cause.

In case that's helpful, below is the same as the previous snippet, but using directly the lowest-level gitdb objects I could find to open the files:

import os

import git
from gitdb.util import hex_to_bin

# Put here whatever Git repos you might have in the current folder
# Listing more than one makes the problem more visible
repos = ["ipset", "libhtp"]

raw_input("Check open files with `lsof -p %s | grep %s`" % (os.getpid(),
                                                            os.getcwd()))

raw_input("Check open files again")                 # files are still open

for name in repos:
    repo = git.Repo(name)
    sha = hex_to_bin("71acab3ca115b9ec200e440188181f6878e26f08")

    for db in repo.odb._dbs:
        try:
            for item in db._entities:               # db._entities opens *.pack files
                item[2](sha)                        # opens *.idx files

        except AttributeError as e:
            # Not all DBs have this attribute
            pass

    del repo

raw_input("Check open files again")                 # files are still open

But that's just where the files are open (well, in fact they are open even lower down but I didn't manage to go deeper), not where they are supposed to be closed.

@Byron
Copy link
Member

Byron commented May 17, 2012

Thanks for reporting this.
Probably this is related to the smmap library, which opens memory mappings of handled files. Apparently, there is a chance of not closing them after being done with the access.

I suffered from similar problems in windows tests (where open handles prevented the test-cases to clean themselves up properly), and this definitely should be looked into.

Until then, you might use the GitCmdDb Backend for your repository, which uses a different implementation.

@mscherer
Copy link

I can confirm the issue on fedora 17, there is *.pack and *.idx, open as file and memory mapped.

@Byron
Copy link
Member

Byron commented May 17, 2012

I think, the lsof checks will have to go into the test-suite too, in order to protect against regression.
It will be a lesson for me, as I simply didn't verify I indeed close the handles (or that there are no dependency-cycles preventing objects to die).

Especially with something as vital (and limited) as system resources, I shouldn't leave anything to chance.

Currently I have no ETA on this, but I am planning to redirect development time to git-python once again.

@bochecha
Copy link
Author

So, learning my way through the code with pdb, I found where the file is actually opened.

In smmap/util.py, in the constructor of the class MapRegion, there's the following:

                if isinstance(path_or_fd, int):
                        fd = path_or_fd
                else:
                        fd = os.open(path_or_fd, os.O_RDONLY|getattr(os, 'O_BINARY', 0)|flags)
                #END handle fd

In the case of this bug report, path_or_fd is a path, so the fd gets opened. At this point, it is file descriptor 3.

Then later on, in the same function, I can see:

                finally:
                        if isinstance(path_or_fd, basestring):
                                os.close(fd)
                        #END only close it if we opened it

And here, the file descriptor 3 is closed. So everything is good, right?

No, because in between, there is this:

self._mf = mmap(fd, actual_size, **kwargs)

This maps the content of the file pointed by file descriptor 3 in memory... and it also opens the file again, as file descriptor 4!

It is this one which never gets closed.

Now let's consider the following:

import mmap
import os


raw_input("Check open files with `lsof -p %s | grep %s`" % (os.getpid(),
                                                            os.getcwd()))

f = open("hello", "r+b")
raw_input("Check open files again (opened 'hello' as fd %s)" % f.fileno())

mf = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
raw_input("Check open files again (mmaped fd %s)" % f.fileno())

f.close()
raw_input("Check open files again (closed 'hello')")

mf.close()
raw_input("Check open files again (closed the mmap)")

This reproduces exactly the same behaviour: a file is opened as fd 3, when mmap-ing its content we also open the same file as fd 4, then we close the fd 3, and finally closing the map will also close fd 4.

So what is missing from smmap is to close the memory map when done with it.

@Byron
Copy link
Member

Byron commented May 18, 2012

Thank you for making the effort of pointing the course of the issue out so clearly.
As this affects all git-python installation, I should definitely be on it with high priority !

In any way, the memory map needs to remain open until the last user of the respective memory region fades away, and apparently this doesn't happen, maybe because someone keeps it open by keeping a reference to it - this at least would be my guess.

AFAIK, when the mmap instance goes out of scope, it will close automatically, maybe I am relying on this. Also, smmap has a known issue with reference counting in some python versions apparently, which may also lead to the issue we see here.

@bochecha
Copy link
Author

As this affects all git-python installation, I should definitely be on it with high priority !

I'd certainly appreciate it. :)

Especially since it's not even possible to manually keep track of those files (by monkey-patching Python's open() and file()) and close them manually: the mmap module is written in C and doesn't use either open() or file().

In any way, the memory map needs to remain open until the last user of the respective memory region fades away,

I tried spending some time yesterday figuring where this should be, but that's just too much for me.

and apparently this doesn't happen, maybe because someone keeps it open by keeping a reference to it - this at least would be my guess.

Well, I tried destroying the repo object, but that didn't help. If a reference is kept on it, I really wonder where. :-/

Also, smmap has a known issue with reference counting in some python versions apparently, which may also lead to the issue we see here.

So I can confirm the issue on RHEL 6 (Python 2.6.6) and Fedora 16 (Python 2.7.2)

And mscherer above confirmed it on Fedora 17, which has Python 2.7.3.

So I'm not sure that's related to a Python version, or else we are extremely unlucky. :D

@Byron
Copy link
Member

Byron commented Jun 7, 2012

When I implemented the reference counting, I decided to try using Python's internal one, with sys.getrefcount(). However, I realized that I had to put some magic numbers into the code to account for references to my object that I didn't really have on my radar. In that moment, I might have opened the gates for plenty of bugs.

Or I have a cyclic relation somewhere, even though I was very conscious about strong references and about who keeps pointers to whom.

Anyway, its buggy, and a real problem as GitPython uses this engine heavily under the hood, and by default.

@bochecha
Copy link
Author

bochecha commented Jul 9, 2012

Any news on this issue?

@Byron
Copy link
Member

Byron commented Jul 11, 2012

No, sorry. Also I cannot currently assign any time to the project.

@virgiliu
Copy link

Any news? I'm getting the same error on Python 2.6.6 on Debian.

@Byron
Copy link
Member

Byron commented Aug 29, 2013

Wow, exactly one year ago I said I had no time, and sadly this still didn't change. However, next year I could possibly get to it, there are many things I want to improve and do with git-python.

@virgiliu
Copy link

That sounds great.
I've managed to make a workaround for the function that caused the "Too many open files" error using git commands.

Here's a simplified version of the code that caused the error:

import git

def last_commit(repo, rev, path):
  commit = repo.iter_commits(rev, path, max_count=1)

  return commit.next()


repo = git.Repo('/path/to/repo')

for i in range(1000):
  print last_commit(repo, 'master', '/path/to/some/dir/inside/repo')

After about 500 iterations it crashes with

OSError: [Errno 24] Too many open files

Is there any way of getting the lastest commit from a directory or a file on a certain branch or tag besides using iter_commits ? I'd rather use objects instead of parsing raw git output and making my own dictionary with the values.

@bochecha
Copy link
Author

@virgiliu My personal workaround is to run the GitPython stuff (at least the ones which lead to this bug) in different processes. :-/

@virgiliu
Copy link

virgiliu commented Sep 2, 2013

@bochecha could you show me an example of that?

@bochecha
Copy link
Author

bochecha commented Sep 2, 2013

@virgiliu: sure. I just hope @Byron doesn't mind that we're using his bug tracker to discuss like this. :)

So, in one of my projects I'm using the following solution. I haven't pushed it yet (mostly because I'm not entirely satisfied with it) so I can't just give you a link to the code. You can assume that the code below is MIT-licensed.

So first, I have the following decorator:

class subprocessed(object):
    """Decorator to run a function in a subprocess

    We can't use GitPython directly because it leaks memory and file
    descriptors:
        https://github.com/gitpython-developers/GitPython/issues/60

    Until it is fixed, we have to use multiple processes :(
    """
    def __init__(self, func):
        self.func = func

    def __get__(self, instance, *args, **kwargs):
        self.instance = instance
        return self

    def __call__(self, *args):
        from multiprocessing import Process, Queue
        q = Queue()

        def wrapper(queue, func, *args):
            queue.put(func(*args))

        p = Process(target=wrapper,
                    args=(q, self.func, self.instance) + args)
        p.start()
        p.join()

        return q.get()

Then, you just need to decorate a method with it, and it gets run as a subprocess!

In my case, I have a class Foobar which does stuff on various Git repositories:

class Foobar(object):
    def __init__(self):
        self.git_rooturl = "git://git.example.com"
        self.workdir = "/path/to/where/the/repo/will/be/cloned"

    def do_stuff_on_repo(self, reponame):
        curdir = os.getcwd()

        # Clone the module
        repo = git.Repo.clone_from("%s/%s" % (self.git_rooturl, reponame),
                                   os.path.join(self.workdir, reponame))
        os.chdir(repo.working_tree_dir)

        # Do your stuff here
        result = ...

        os.chdir(curdir)
        shutil.rmtree(repo.working_tree_dir)
        del repo

        return result

Unfortunately the do_stuff_on_repo method goes through the wrong code path and I call it repeatedly (even on different repositories), so it leaks memory and fds.

As a result, I'm just defining it in this way:

    @subprocessed
    def do_stuff_on_repo(self, reponame):
        ...

All I changed is that I decorated it, and now each call to the function will be run in its own process, so the leaking memory and fds get cleared when the process exits.

The only tricky part which is not handled above is what to do in case of an error: if do_some_stuff_on_repo raises an exception for example, then subprocess just exits, without affecting the main process where your application is running.

How to handle that properly is left as an exercise to the reader. (because that's the part I'm not very happy with :P)

@virgiliu
Copy link

virgiliu commented Sep 2, 2013

@bochecha Great stuff. 👍
Thanks a lot! I'll give it a try if/when I run into this problem again (for now I've circumvented it by using raw git commands) :)

@dirkgr
Copy link

dirkgr commented May 27, 2014

Bumping this. I just spent a ton of time tracking this down. It's still there :-)

@abesto
Copy link

abesto commented Jun 9, 2014

+1

@mricon
Copy link

mricon commented Jul 9, 2014

This is definitely making gitpython unusable for projects like grokmirror that must be able to operate on thousands of repositories. As this bug is now 2+ years old, it doesn't look like it'll be fixed at this point. :(

@AnthonyBrunasso
Copy link

Ahhh, I am having the same issue. Would be nice to see a fix. :)

@smn
Copy link

smn commented Oct 13, 2014

Is anyone working on this bug?

@Byron Byron added this to the v0.3.5 - bugfixes milestone Nov 14, 2014
@Byron Byron self-assigned this Jan 7, 2015
@Byron
Copy link
Member

Byron commented Jan 7, 2015

The issue presented by @virgiliu seems to be fixed now (with the latest version of smmap 0.9.0), but by artificially constraining the file limit to 64, I can still see that pipes are leaked.
For some reason, repository instances are never freed, and I will have to figure out why that would happen.
screen shot 2015-01-07 at 17 57 37

@Byron
Copy link
Member

Byron commented Jan 7, 2015

As __del__() is not reliably called in python 3, the tests leak heavily as no repository it ever used gets cleaned up, leaving plenty of the processes shown above. Each of them comes with three pipes, quickly using up the processes resources.

For example, it's possible to run all tests with ulimit -n 64 in python 2, but needs ulimit -n 140 in python 3.

@Byron Byron closed this as completed in 36cdfd3 Jan 7, 2015
@Byron
Copy link
Member

Byron commented Jan 7, 2015

It turned out I forgot to add a tear-down super class call, which was responsible for most of the test-leakage.
On the plus side, travis now checks for this as well.
An improved leak-test was added, which clearly shows that repositories dispose of their file handles correctly.

@mricon
Copy link

mricon commented Jan 7, 2015

Fantastic, thanks!

@smithsp
Copy link

smithsp commented Oct 12, 2015

I am still seeing this problem in version 1.0.1. Here is a little test script:

import git
repo = git.Repo('./')
for k in repo.iter_commits(max_count=25):
     print k

At the same time, look at the number of open files with lsof | grep ".git". I see that the number of open files increases by about 25. The only way to decrease the number of open files is to close the python session.
(I should point out that I am running on Mac OSX, with a mostly MacPorts python installation.)

@Byron
Copy link
Member

Byron commented Oct 16, 2015

I think it will work if you use the latest HEAD of this repository, which would be v1.0.2 if I hadn't lost access to my pypi account.

In v1.0.1, you should be able to instantiate the Repo type with git.Repo(path, odbt=GitCmdObjectDB) to alleviate the issue.

If this works for you, I'd be glad if you could let me know here as well.

@earwig
Copy link

earwig commented Oct 29, 2015

I have some reports of the same issue in earwig/git-repo-updater#9, I think. Will ask people if GitPython HEAD fixes it.

bavaria95 added a commit to bavaria95/jens that referenced this issue Jul 13, 2016
using GitCmdObjectDB when initiating repo(suggestion of library creator; gitpython-developers/GitPython#60 (comment) ; for the future may be better to use newer version of the library)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests