Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eventlet with pypy : Bad File Descriptor Error #318

Closed
ysingh7 opened this issue May 19, 2016 · 8 comments
Closed

Eventlet with pypy : Bad File Descriptor Error #318

ysingh7 opened this issue May 19, 2016 · 8 comments
Milestone

Comments

@ysingh7
Copy link
Contributor

ysingh7 commented May 19, 2016

Hi

I am running Open Stack Swift application, which is a cloud storage solution from the Open Stack application suite. The application runs fine using Cpython but when I try to run it using pypy I run into the following error :

ERROR with Object server 192.168.0.10:6000/d2 re: Trying to get final status of PUT to /v1/AUTH_e51c08553a6e4333aeac3f943577ca76/ssbench_000013/small_093888:
Traceback (most recent call last):
File "/home/intel/Workspace/swift-pypy/pypy-5.1.1-linux64/site-packages/eventlet/greenio/base.py", line 334, in _recv_loop
    return recv_meth(*args)
File "/home/intel/Workspace/swift-pypy/pypy-5.1.1-linux64/lib-python/2.7/socket.py", line 232, in recv
    return self._sock.recv(buffersize, flags)
error: [Errno 9] Bad file descriptor (txn: txd507dbb5dd5b4d66aeeec-00573af568) (client_ip: 192.168.0.210)

I suspect that the implementation of the GreenSocket makefile function might have something to do with it.
Pypy maintains reference count for the the real socket object _sock and uses function reuse : to increment the reference count and drop to decrement the reference count. If the reference count reaches 0, the underline _sock (real socket) is closed as described in comment in socket.py :

"You make use (or a library you are using makes use) of the internal
classes '_socketobject' and '_fileobject' in socket.py, initializing
them with custom objects. On PyPy, these custom objects need two
extra methods, _reuse() and _drop(), that maintain an explicit
reference counter. When _drop() has been called as many times as
_reuse(), then the object should be freed."

I will like to discuss the implementation of GreenSocket makefile function in /eventlet/greenio/base.py :

    if six.PY3:
        def makefile(self, *args, **kwargs):
            return _original_socket.makefile(self, *args, **kwargs)
    else:
        def makefile(self, *args, **kwargs):
            dupped = self.dup()
            res = _python2_fileobject(dupped, *args, **kwargs)
            if hasattr(dupped, "_drop"):
                 dupped._drop()  # no socket gets dropped after commented this out
            return res

Drop function is explicitly called on the dupped object, so the reference count for the underline real socket _sock will be decremented by 1. The dupped variable created is a temporary variable and will go out of scope after this function. So when the Garbage Collector kicks in, it will claim this object by running its destructor. dupped is of type GreenSocket so it will call its destructor :

    def __del__(self):
        # This is in case self.close is not assigned yet (currently the constructor does it)
        close = getattr(self, 'close', None)
        if close is not None:
            close()

The destructor of dupped will call the close function which is reference to fd.close, that points to close function of the underlying socketobject.

    def close(self):
        try:
            if self._sock:
                self.flush()
        finally:
            s = self._sock
            self._sock = None
            if s is not None:
                if self._close:
                    s.close()
                else:
                    _do_reuse_or_drop(s, '_drop')

The close function of the socketobject calls drop again on the same socket _sock decrementing it further by 1. Since the reference held by dupped was already decremented once by explicit call to drop in makefile function, the second call is decrementing the refcount further leading to closing of the socket when garbage collector comes into the picture. This is causing the Bad File Descriptor error.

If the reference count is already being decremented by the destructor, then there is no need of explicit call to drop. If I run with this code, it runs fine for me :

_if six.PY3:
        def makefile(self, *args, **kwargs):
            return _original_socket.makefile(self, *args, **kwargs)
    else:
        def makefile(self, *args, **kwargs):
            dupped = self.dup()
            res = _python2_fileobject(dupped, *args, **kwargs)
            #if hasattr(dupped, "_drop"):
             #    dupped._drop()  # no socket gets dropped after commented this out
            return res

Can anyone please help me with this issue.

@temoto
Copy link
Member

temoto commented May 20, 2016

@ysingh7 I've tried this synthetic example to reproduce the problem, but it didn't manifest. Do you have a better idea how to test it?

import eventlet
import sys

host = 'google.com'
if len(sys.argv) > 1:
  host = sys.argv[1]
s1 = eventlet.connect((host, 80))
f = s1.makefile()
f.write(b'GET / HTTP/1.0\r\nConnection: close\r\n\r\n')
r = f.read(8<<10)
f.close()
print(r)
s1.close()

import gc
gc.collect()

@ysingh7
Copy link
Contributor Author

ysingh7 commented May 24, 2016

@temoto I have modified the above test case to reproduce the error :

import eventlet
import sys
import time

host = 'localhost'
if len(sys.argv) > 1:
  host = sys.argv[1]
s1 = eventlet.connect((host, 8888))
print("Socket and File Descriptor beneath is : ",s1.fd._sock,s1.fd.fileno())
f = s1.makefile()
f.write(b'GET / HTTP/1.0\r\nConnection: close\r\n\r\n')
f.close()
import gc
print("Garbage Collector Kicking In")
gc.collect()
print("Socket and File Descriptor beneath is : ",s1.fd._sock,s1.fd.fileno())
f = s1.makefile()
f.write(b'GET / HTTP/1.0\r\nConnection: close\r\n\r\n')
r = f.read(8<<10)
f.close()
print("No Error")
s1.close()

This issue is similar to one reported in Issue #213.

@ysingh7
Copy link
Contributor Author

ysingh7 commented Jun 2, 2016

We further investigated the issue and came up with this explaination :

Problem: exception leading to application to crash While running under PyPy, and driven by Swift benchmark ssbench/or a customized script simple_put_test.py, OpenStack Swift (the proxy processes on the proxy server) suffered unhandled exception, “Bad file descriptor”, leading to application failure. This happens only running PyPy as the Python interpreter, and appears to be in some random fashion. Same code runs with no issue under CPython.

Root cause found related to app code, not PyPy, but PyPy is not entirely out of the loop (it would be nice somehow PyPy could take care of this ref count, but that may be unlikely) It’s advised by PyPy developers that additional application codes must be written specifically for PyPy to deal with networking socket objects (customized ref counting). Eventlet, a standard Python library, was found handling the ref count incorrectly, thus leading to the exception.

Solution is to patch the eventlet library code A patch has been created to patch eventlib (base.py) for PyPy. The code change is not expected to impact any application to run under CPython.

Issue related to the way networking socket is used When lower level networking socket was shut down unexpectedly earlier, the Python application OpenStack Swift attempting to write to or read from a networking socket will crash (exception).

CPython and PyPy handles object memory management differently:
CPython uses ref count to clean up objects that are no longer being used, and re-claim the memory back. PyPy uses Garbage Collection (GC) to perform the same task. Fundamental difference: CPython tracks the object usage (references to it) and re-claim the memory back as soon as the object is out of scope (no longer has any reference to it). PyPy’s GC may kick in at any moment after the object is out of scope, i.e., an un-deterministic time, or delayed compared with CPython. This delay is actually an issue for apps like Swift. If the an environment variable, PYPY_GC_MIN value is set large enough such that the app never exceed that value, GC may never kicks in. This is OK in all except in one known case already, when networking socket is being used.

Closing socket requires two OS API calls in sequence:
To clean up a socket, two methods must be called, close() and shutdown() to the OS level via OS API. Every “close” call will remove 1 reference count to that socket, and you would think the socket would be “closed” when zero reference count is reached. However, the socket is not “closed” from OS point of view. Unless shutdown is called prior to close, that socket will remain OPEN until eventually timeout. This will appear as “socket leaking”, or with number of open connections to the peers keeping growing, eventually may slow down the peer or reaching the maximum number of connection allowed on the peer and resulting a peer denying any further connection to be made. This is exactly the case with PyPy, due to GC delay discussed above.

Applications such as OpenStack Swift make multiple abstraction of the underlying OS socket or with multiple Python object holding reference to the same underlying OS socket. Making a “close” call from one Python object does not or should not have any impact to other objects that are still holding the reference. Ideally, however, as soon as all Python objects holding the socket are out of scope, the socket should be shut down and closed immediately, which is the way CPython handles socket. PyPy can’t do the same way due to its delayed GC.

Solution proposed from PyPy developers is that app owner should track their socket usage, by making a _reuse() call from the application code any time an reference to a socket is made, and _drop() call any time an reference is removed. These two methods are implemented internally by PyPy interpreter and these calls are synchronous. As soon as the last _drop() is made leading to an ref count of zero (ref count internally maintained by PyPy), PyPy will call OS API shutdown and close, effectively close the connection. At the lower level, the socket.c (OS implementation) will flash out any remaining data in the outgoing buffer, and send a “FIN” message to the peer, leading to peer to close its side of the connection, while closing/shutting down the local socket.

With all being said and according to the PyPy documentation, the application is responsible to ensure the accounting is done correctly (equal number of _reuse and _drop are called).

In further details to give an example where the error occurs, a “conn” object holds a ref to the socket. A duplicate conn is made for converting to a different type of Python object, and then the code creates a new Python object “res” which is returned in “makefile”. The intermediate “duppped” (or a temporary variable) made a _drop call before being out of scope and then “dupped “can be reclaimed. When GC kicks in (any time), the destructor for “dupped” will be called, and a “close” call is made on behalf of the “dupped”. This triggers a call to _drop to the PyPy interpreter by the socket.py library. At this point, the socket is shut down and closed. The “res” object, however, holding a ref to the same socket, is still trying to read/or write from that same socket, and would suffer exception. Our trace indicates “fd=-1” when the exception occurs, which means the Python object using the socket has no idea the underlying socket was already closed.

The patch is setting dupp.close = None, meaning the dupped has no method called “close”, thus preventing a “close” call (triggers another _drop()) from happening when it’s destructor is called by PyPy GC. This corrects the ref accounting error, and make sure ref count would not reach zero until the right time.

--- ./site-packages/eventlet/greenio/base.py.orig       2016-05-12 16:35:00.561756000 -0700
+++ ./site-packages/eventlet/greenio/base.py    2016-05-20 17:16:20.965869009 -0700
@@ -299,6 +299,8 @@ class GreenSocket(object):
             res = _python2_fileobject(dupped, *args, **kwargs)
             if hasattr(dupped, "_drop"):
                 dupped._drop()
+            dupped.close = None
             return res

     def makeGreenFile(self, *args, **kw):

@notmyname
Copy link

@ysingh7 can you reformat the patch as a pull request? Might make it easier for people to review and then merge.

@ysingh7
Copy link
Contributor Author

ysingh7 commented Jun 10, 2016

@notmyname I have created a pull request for this issue : #326
@temoto Can you have a look at this pull request? Initially I was getting random failure in build test cases, now it is passing without failure.

@clayg
Copy link
Contributor

clayg commented Jul 7, 2016

@temoto I dug around through revision history as best I could trying to track down some sort of explanation as to why the eventlet implementation of makefile calls dup and got lost. I left some comments on the #326.

@ysingh7 Could you test pypy + swift with out calling dup in makefile and report the results? You seemed to indicate that not calling _drop on the dupped handle also worked - but was there not a problem then running out of file descriptors under load if the pypy gc was being lazy?

Maybe if just getting rid of dup works and looks more obviously correct perhaps we can loose the call to _drop there too and simplify all our lives?

@temoto
Copy link
Member

temoto commented Jul 7, 2016

@ysingh7 thank you very much for very detailed explanation and putting much work into it.

@notmyname patches are just fine, if not better, compared to pull requests.

@clayg maybe dup is redundant, I also don't know why it was required. If it works without dup, I'd be only happy to remove it.

temoto pushed a commit that referenced this issue Jul 7, 2016
On PyPy `[Errno 9] Bad file descriptor` because `makefile dup/_drop` dance leaves socket with PyPy special reference counter zero, so it may be garbage collected.

#318

(@temoto) maybe proper fix is to remove `dup` in `makefile()`.
@temoto
Copy link
Member

temoto commented Jul 7, 2016

I've merged the .close = None fix. If everything works without dup, please send new patch.

Thanks.

@temoto temoto closed this as completed Jul 7, 2016
@temoto temoto added this to the v0.20 milestone Jul 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants