randomstats with many iterations opens too many files #38

jakebiesinger · 2011-09-09T16:12:23Z

I have two fairly small bed files (~1600 features in each, overlapping by ~500 features). I'd like to calculate the empirical overlap p-value but to get a decently small p-value, I need a lot of shuffles.

pybedtools is apparently opening a new file for each iteration (not sure how since clearly the files are unlinked and the python objects are deleted as soon as the intersect results are done. relevant code

Any way around this? I can increase the user limit for open files on this machine, but I doubt it will suffice for tens of millions of overlaps... Looks like it's crapping out at only 4541 files on this machine.

In [2]: import pybedtools
In [2]: chromsizes = {'chr1':(1,1000)}
In [3]: a = pybedtools.example_bedtool('a.bed').set_chromsizes(chromsizes)
In [4]: b = pybedtools.example_bedtool('b.bed')
In [5]: results = a.randomstats(b, 1000000, debug=True)

<type 'exceptions.OSError'>: Too many open files
The command was:

        shuffleBed -i /usr/local/lib/python2.7/dist-packages/pybedtools/test/data/a.bed -seed 4540 -g /tmp/pybedtools.DsnJBh.tmp

Things to check:

        * Too many files open -- are you creating lots of BedTool objects without calling del() on them?

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)

/home/wbiesing/<ipython console> in <module>()

/usr/local/lib/python2.7/dist-packages/pybedtools/bedtool.pyc in randomstats(self, other, iterations, **kwargs)
   1543         distribution = self.randomintersection(other, iterations=iterations,
   1544                                                **kwargs)
-> 1545         distribution = np.array(list(distribution))
   1546
   1547         # Median of distribution


/usr/local/lib/python2.7/dist-packages/pybedtools/bedtool.pyc in randomintersection(self, other, iterations, intersect_kwargs, shuffle_kwargs, debug)
   1624             if debug:
   1625                 shuffle_kwargs['seed'] = i
-> 1626             tmp = self.shuffle(**shuffle_kwargs)
   1627             tmp2 = tmp.intersect(other, **intersect_kwargs)
   1628

/usr/local/lib/python2.7/dist-packages/pybedtools/bedtool.pyc in decorated(self, *args, **kwargs)
    370             # this calls the actual method in the first place; *result* is

    371             # whatever you get back

--> 372             result = method(self, *args, **kwargs)
    373
    374             # add appropriate tags


/usr/local/lib/python2.7/dist-packages/pybedtools/bedtool.pyc in wrapped(self, *args, **kwargs)
    174             # Do the actual call

    175             stream = call_bedtools(cmds, tmp, stdin=stdin,
--> 176                                    check_stderr=check_stderr)
    177             result = BedTool(stream)
    178

/usr/local/lib/python2.7/dist-packages/pybedtools/helpers.pyc in call_bedtools(cmds, tmpfn, stdin, check_stderr)
    299         print 'Things to check:'
    300         print '\n\t' + '\n\t'.join(problems[err.errno])
--> 301         raise OSError('See above for commands that gave the error')
    302
    303     return output

OSError: See above for commands that gave the error

The text was updated successfully, but these errors were encountered:

daler · 2011-09-09T18:22:23Z

Thanks for reporting this.

Setting the user file limit is an ugly, ugly solution, and I agree it's unlikely to be sufficient.

I suspect this bug has something with the Cython-wrapped IntervalFile not getting cleaned up properly. I tried some quick fixes but no luck yet -- I'll continue working on this and post here when it's fixed.

daler · 2011-09-10T19:32:14Z

Wow, nefarious bug. It seems to have been cause by stdin/stdout/stderr filehandles not getting cleaned up by the subprocess.Popen instances created every time a BEDTools command is called.

This is now fixed in 703f7e2 but only for Python 2.7 (see the commit comment at the bottom of that page for details).

For now I'm keeping this in a separate branch until I figure out another way that works for both 2.6 and 2.7. If you're stuck on 2.6, you can redirect stderr to avoid seeing the crazy number of Popen.__del__ errors.

jakebiesinger · 2011-09-13T03:06:17Z

Hrmm. I'm getting a gcc: pybedtools/cbedtools.cpp: No such file or directory after git clone and python setup.py install. The file isn't in the git repo; is it supposed to be generated via pyrex?

Sorry for the close...

daler · 2011-09-13T13:27:44Z

Yeah, if you install from pip or easy_install, the .cpp files are already compiled to avoid a dependency on Cython. But working with the development code is a little different.

First do this to get the new filehandle branch:

git clone git@github.com:daler/pybedtools.git
git fetch origin filehandle-patch:filehandle-patch
git checkout filehandle-patch

Then do this, which ought to have Cython generate the .cpp file:

python build.py

And finally you can do:

python setup.py install

Does that work?

Also . . . I think I'd rather keep this issue open, since it isn't as elegant as a solution as I'd like.

jakebiesinger · 2011-09-13T18:52:31Z

Ah, works perfectly now. I didn't mean to close the issue-- just GitHub's "Comment & Close" button placement...

Makes sense about the Cython dependency. I haven't seen the setup.py and build.py separation before.

jakebiesinger · 2011-09-13T20:30:47Z

The build works fine, but I'm getting a strange OSError:

In [6]: a.randomstats(b, 100000)
<type 'exceptions.OSError'>: Cannot allocate memory
The command was:

    intersectBed -a stdin -b /home/wbiesing/src/pybedtools/pybedtools/test/data/b.bed -u

Things to check:

/home/wbiesing/src/pybedtools/pybedtools/helpers.pyc in call_bedtools(cmds, tmpfn, stdin, check_stderr)
    299 
    300         print 'Things to check:'
--> 301         print '\n\t' + '\n\t'.join(problems[err.errno])
    302         raise OSError('See above for commands that gave the error')
    303 

KeyError: 12

I've also seen this show up as:

<type 'exceptions.OSError'>: Cannot allocate memory
The command was:

    shuffleBed -i /home/wbiesing/src/pybedtools/pybedtools/test/data/a.bed -g /tmp/pybedtools.JPuSgj.tmp

RAM is not the issue here-- usage is very low and not near the 12gb limit on this machine. Removing your try/catch in call_bedtools gives:

OSError                                   Traceback (most recent call last)

/home/wbiesing/<ipython console> in <module>()

/home/wbiesing/src/pybedtools/pybedtools/bedtool.pyc in randomstats(self, other, iterations, **kwargs)
   1575         distribution = self.randomintersection(other, iterations=iterations,
   1576                                                **kwargs)
-> 1577         distribution = np.array(list(distribution))
   1578 
   1579         # Median of distribution


/home/wbiesing/src/pybedtools/pybedtools/bedtool.pyc in randomintersection(self, other, iterations, intersect_kwargs, shuffle_kwargs, debug, report_iterations)
   1660                     sys.stderr.write('\r%s' % i)
   1661                     sys.stderr.flush()
-> 1662             tmp = self.shuffle(stream=True, **shuffle_kwargs)
   1663             tmp2 = tmp.intersect(other, stream=True, **intersect_kwargs)
   1664 

/home/wbiesing/src/pybedtools/pybedtools/bedtool.pyc in decorated(self, *args, **kwargs)
    372             # this calls the actual method in the first place; *result* is

    373             # whatever you get back

--> 374             result = method(self, *args, **kwargs)
    375 
    376             # add appropriate tags

/home/wbiesing/src/pybedtools/pybedtools/bedtool.pyc in wrapped(self, *args, **kwargs)
    174             # Do the actual call

    175             process, stream = call_bedtools(cmds, tmp, stdin=stdin,
--> 176                                    check_stderr=check_stderr)
    177             result = BedTool(stream)
    178             result.process = process

/home/wbiesing/src/pybedtools/pybedtools/helpers.py in call_bedtools(cmds, tmpfn, stdin, check_stderr)
    255                                  stdout=subprocess.PIPE,
    256                                  stderr=subprocess.PIPE,
--> 257                                  bufsize=1)
    258             output = p.stdout
    259             stderr = None

/usr/lib/python2.7/subprocess.pyc in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags)
    670                             p2cread, p2cwrite,
    671                             c2pread, c2pwrite,
--> 672                             errread, errwrite)
    673 
    674         if mswindows:

/usr/lib/python2.7/subprocess.pyc in _execute_child(self, args, executable, preexec_fn, close_fds, cwd, env, universal_newlines, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite)
   1113                     gc.disable()
   1114                     try:
-> 1115                         self.pid = os.fork()
   1116                     except:
   1117                         if gc_was_enabled:

OSError: [Errno 12] Cannot allocate memory

and suddenly the terminal that ran this process can't fork any more jobs

wbiesing@cross:~$ ls /tmp/ | wc -l
bash: fork: Cannot allocate memory

which makes it look like too many processes have been forked without cleanup...?
Also, it would be nice if the temporary files created by pybedtools were cleaned up--

wbiesing@cross:~$ ls /tmp/ | wc -l
43386

daler · 2011-09-13T20:47:49Z

Hmm. Seems like more subprocess annoyances . . . this may take some time to debug. I'm worried pybedtools may be pushing subprocess in ways it wasn't designed for, at least for the randomintersection use-case.

As for the cleanup, things should automatically get cleaned up if everything exits normally (thanks to atexit.register(cleanup) in helpers.py). Not sure how to gracefully handle post-crash cleanup though -- I've found it useful for debugging to have the tempfiles stick around after a crash. And I'm worried that deleting tempfiles before exit could result in data loss. Any ideas?

daler · 2011-09-14T14:56:16Z

Still having trouble with the memory problem.

However, while debugging I found that in Python 2.6, if I use a processes value > 1 when calling randomstats(), the Popen.__del__ warning messages go away, providing an inadvertent fix. Sweet.

I'm not fluent with multiprocessing. I tried removing the if processes > 1 conditional, hoping to just run any call to randomstats() through a Pool albeit with one worker. Doing so results in my new favorite error:

AssertionError: daemonic processes are not allowed to have children

(probably good universal advice in addition to an assertion error)

Do you know of a way of letting a Pool run with a single process? If so, this could fix the py2.6 problem so these changes could be merged into the master branch

edit: never mind about the merging . . . everything would have to be run through a Pool to get this benefit.

jakebiesinger · 2011-09-14T16:40:59Z

You'll get infinite recursion if you remove the conditional completely. The excellent error message is catching the Pool'ed process trying to create its own Pool (which would then create its own, etc). Better to do something like:

diff --git a/pybedtools/bedtool.py b/pybedtools/bedtool.py
index 33ec207..c2d9bf6 100644
--- a/pybedtools/bedtool.py
+++ b/pybedtools/bedtool.py
@@ -1648,7 +1648,7 @@ class BedTool(object):
             [2, 2, 2, 0, 2, 3, 2, 1, 2, 3]

         """
-        if processes > 1:
+        if processes is not None:
             p = multiprocessing.Pool(processes)
             iterations_each = [iterations / processes] * processes
             iterations_each[-1] += iterations % processes
diff --git a/pybedtools/helpers.py b/pybedtools/helpers.py
index b0f8db2..e181ff9 100644
--- a/pybedtools/helpers.py
+++ b/pybedtools/helpers.py
@@ -377,4 +377,4 @@ def _call_randomintersect(_self, other, iterations, intersect_kwargs,
                                         intersect_kwargs=intersect_kwargs,
                                         shuffle_kwargs=shuffle_kwargs,
                                         report_iterations=report_iterations,
-                                        debug=False, processes=1))
+                                        debug=False, processes=None))

daler · 2011-09-14T16:47:51Z

Ah, right. Thanks for the info.

Since the memory issue, and general subprocess module woes, are proving problematic, I'm thinking of using a naive os.system call inside randomintersect rather than relying on the pybedtools subprocess mechanism -- essentially special-casing the code that will be run hundreds of thousands of times.

daler · 2011-09-17T23:06:52Z

Short answer: I think all of these issues are fixed as of 1ddb49c. Can you please test?

Long answer and notes to self:
randomintersection() calls __len__, which calls count, which calls __iter__ which dispatches to the correct iterator based on the kind of BedTool it is.

For file-based BedTools, the iterator is a Cython IntervalFile. This appears to leave an open file hanging somewhere, even when the BedTool is deleted (circular references somewhere?). This still needs to be addressed, but there's a way around it: using a stream-based BedTool.

For stream-based BedTools, calling __iter__ returns a Cython IntervalIterator. The underlying BedTool.fn, though, is the stdout of the subprocess.Popen call, rather than the filename passed to IntervalFile for file-based BedTools. That means the fn can be specifically closed, freeing up the open file. So the simple fix can be seen here: 1ddb49c#L0R1693.

However, adding logic to BedTool.__del__, like:

def __del__(self):
    if isinstance(self.fn, file):
        self.fn.close()

does not fix the problem. My guess is that this is due to circular references that need to be tracked down and broken (i.e., search object.__del__ on http://docs.python.org/reference/datamodel.html).

So: the issue is fixed, but it will be nice to have __del__ do all the work so that users creating hundreds of thousands of BedTools outside of the randomintersection method won't run into this issue.

jakebiesinger · 2011-09-19T18:43:20Z

Yeah the fix looks good to me, even on my larger files and 100's of thousands of iterations and several processes. Thanks for the update!

jakebiesinger closed this as completed Sep 13, 2011

jakebiesinger reopened this Sep 13, 2011

daler closed this as completed Sep 27, 2011

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

randomstats with many iterations opens too many files #38

randomstats with many iterations opens too many files #38

jakebiesinger commented Sep 9, 2011

daler commented Sep 9, 2011

daler commented Sep 10, 2011

jakebiesinger commented Sep 13, 2011

daler commented Sep 13, 2011

jakebiesinger commented Sep 13, 2011

jakebiesinger commented Sep 13, 2011

daler commented Sep 13, 2011

daler commented Sep 14, 2011

jakebiesinger commented Sep 14, 2011

daler commented Sep 14, 2011

daler commented Sep 17, 2011

jakebiesinger commented Sep 19, 2011

randomstats with many iterations opens too many files #38

randomstats with many iterations opens too many files #38

Comments

jakebiesinger commented Sep 9, 2011

daler commented Sep 9, 2011

daler commented Sep 10, 2011

jakebiesinger commented Sep 13, 2011

daler commented Sep 13, 2011

jakebiesinger commented Sep 13, 2011

jakebiesinger commented Sep 13, 2011

daler commented Sep 13, 2011

daler commented Sep 14, 2011

jakebiesinger commented Sep 14, 2011

daler commented Sep 14, 2011

daler commented Sep 17, 2011

jakebiesinger commented Sep 19, 2011