We encountered a race condition in the FileSession class, one that causes our application to discard session data prematurely when a non-trivial load is generated against the application.
The issue is that on UNIX locks are associated with file objects and not pathnames, so this race which is clearly marked as a race:
# Open lockfile for writing without truncation:self.fp=open(path, 'r+')
# If the file doesn't exist, IOError is raised; Use a+ instead.# Note that there may be a race here. Multiple processes# could fail on the r+ open and open the file a+, but only# one will get the the lock and write a pid.self.fp=open(path, 'a+')
Can generate actually two different file objects, one that is overwritten by the other, but which still has an allocated space. These are marked as (deleted) in the lsof output.
With these two handles in place in two separate threads, each will successfully acquire a lock but in reality the critical section will run in parallel.
I quick workaround is to avoid removing the file altogether, so only the initial connection will have the race:
--- a/cherrypy/lib/sessions.py 2015-11-30 14:00:08.011323375 +0100+++ b/cherrypy/lib/sessions.py 2015-11-30 12:41:37.864640838 +0100@@ -554,7 +554,7 @@
def release_lock(self, path=None):
"""Release the lock on the currently-loaded session data."""
- self.lock.remove()+ #self.lock.remove()
self.locked = False
This solves the race because the lock file will never be removed and thus only the first branch will execute in LockFile constructor, causing all racing threads to open the already existing files. A small race in the initial request would still happen, if the lock file does not yet exist.
I see various alternatives to resolve this problem:
actually place the lock on the session file instead of a separate lockfile
always create the lockfile and never remove it
after locking recheck the pid in the file and see if we lost the race (after locking, we could read it after reopening the file by pathname, that would return the file object that won the race), in that case we could release our stale fd and relock the other one
But I do have a couple of issues how this lock works anyway:
we are writing a pid, but in a threaded application it is always the same (ever since NPTL took over linuxthreads, so 7-8 years at least on Linux)
I don't see too much advantage in writing the pid in the first place, it generates I/O on every request (ok I know it is probably cached, but anyway), since it's always the same it doesn't give the user any clues what is blocking the session
I don't see too much advantage in permitting lock_timeout either, it just causes waiting threads to spin, and it is far from fair. Blocking on fcntl() will probably be more fair (although I didn't check what kind of mechanisms the kernel uses for synchronizing waiters)
This is a pretty serious and difficult to find bug, so I'd consider this critical. I am not yet sure how I can work this around in our internal code base, I'd like to avoid having to patch cherrypy in our production.