test stability #737

Closed
tarekziade opened this Issue Feb 24, 2014 · 12 comments

Projects

None yet

3 participants

@tarekziade
Collaborator

we have a lot of unstable tests that are annoying us right now. We need to make sure we have a stable test suite

@tarekziade
Collaborator

I added the --randomize flag in the test suite so the tests are run in a random order - that should help us finding the leaks

@tarekziade
Collaborator

I am currently working on a leak script to track all the socket/files leaks. Will keep people informed here

@scottkmaxwell
Collaborator

FYI, we are seeing a pretty nasty leak in our deployment. Circusd ends up with 2G of virtual memory and stops responding to commands due to out of memory. Very unpredictable. We have some systems running for weeks stable at 300M and other nearly identical systems going RAM crazy. We tried to track down the issue using the iPython shell I built into circusd but looking at claimed sizes of objects that the gc tracks makes it appear that the gc is only tracking 20M of objects. So we are guessing that some .so is the culprit. Working on running under valgrind to try to get more information.

@Natim
Collaborator
Natim commented Mar 19, 2014

I had this problem once and I had to restart circus it never happend again :(

@tarekziade
Collaborator

@scottkmaxwell do you run plugins ?

@scottkmaxwell
Collaborator

Yes. Running the flapping plugin and a monitor plugin of my own. Do you think that could be a factor? Separate processes so I wasn't looking there.

@tarekziade
Collaborator

no you are right.

another option would be to try tracemalloc snapshots to see of you get any luck there

@scottkmaxwell
Collaborator

Good idea. Going to run valgrind first to see if the memory is going through PyMem_Malloc or not. Should have some info in a few days.

@tarekziade
Collaborator

welp - randomize break all the things. there's definitely some issues left on the test suite

https://travis-ci.org/mozilla-services/circus/builds/21106977

@tarekziade
Collaborator

This one is pretty recurrent

I am guessing we don't verify somehwhere that the event pub socket is closed before starting a new test.

ERROR: test_before_spawn (circus.tests.test_watcher.TestWatcherHooks)

----------------------------------------------------------------------

Traceback (most recent call last):

File "/home/travis/build/mozilla-services/circus/.tox/py27/local/lib/python2.7/site-packages/tornado/testing.py", line 427, in wrapper

functools.partial(f, self), timeout=timeout)

File "/home/travis/build/mozilla-services/circus/.tox/py27/local/lib/python2.7/site-packages/tornado/ioloop.py", line 389, in run_sync

return future_cell[0].result()

File "/home/travis/build/mozilla-services/circus/.tox/py27/local/lib/python2.7/site-packages/tornado/concurrent.py", line 129, in result

raise_exc_info(self.__exc_info)

File "/home/travis/build/mozilla-services/circus/.tox/py27/local/lib/python2.7/site-packages/tornado/gen.py", line 227, in wrapper

runner.run()

File "/home/travis/build/mozilla-services/circus/.tox/py27/local/lib/python2.7/site-packages/tornado/gen.py", line 529, in run

yielded = self.gen.throw(*exc_info)

File "/home/travis/build/mozilla-services/circus/circus/tests/test_watcher.py", line 516, in test_before_spawn

yield self._test_hooks(hook_name='before_spawn')

File "/home/travis/build/mozilla-services/circus/.tox/py27/local/lib/python2.7/site-packages/tornado/gen.py", line 557, in run

self.yield_point.start(self)

File "/home/travis/build/mozilla-services/circus/.tox/py27/local/lib/python2.7/site-packages/tornado/gen.py", line 399, in start

self.result = self.future.result()

File "/home/travis/build/mozilla-services/circus/.tox/py27/local/lib/python2.7/site-packages/tornado/concurrent.py", line 129, in result

raise_exc_info(self.__exc_info)

File "/home/travis/build/mozilla-services/circus/.tox/py27/local/lib/python2.7/site-packages/tornado/gen.py", line 227, in wrapper

runner.run()

File "/home/travis/build/mozilla-services/circus/.tox/py27/local/lib/python2.7/site-packages/tornado/gen.py", line 529, in run

yielded = self.gen.throw(*exc_info)

File "/home/travis/build/mozilla-services/circus/circus/tests/test_watcher.py", line 389, in _test_hooks

yield arbiter.start()

File "/home/travis/build/mozilla-services/circus/.tox/py27/local/lib/python2.7/site-packages/tornado/gen.py", line 557, in run

self.yield_point.start(self)

File "/home/travis/build/mozilla-services/circus/.tox/py27/local/lib/python2.7/site-packages/tornado/gen.py", line 399, in start

self.result = self.future.result()

File "/home/travis/build/mozilla-services/circus/.tox/py27/local/lib/python2.7/site-packages/tornado/concurrent.py", line 129, in result

raise_exc_info(self.__exc_info)

File "/home/travis/build/mozilla-services/circus/.tox/py27/local/lib/python2.7/site-packages/tornado/gen.py", line 227, in wrapper

runner.run()

File "/home/travis/build/mozilla-services/circus/.tox/py27/local/lib/python2.7/site-packages/tornado/gen.py", line 531, in run

yielded = self.gen.send(next)

File "/home/travis/build/mozilla-services/circus/circus/arbiter.py", line 504, in start

self.initialize()

File "/home/travis/build/mozilla-services/circus/circus/util.py", line 398, in _log

return func(self, *args, **kw)

File "/home/travis/build/mozilla-services/circus/circus/arbiter.py", line 469, in initialize

self.evpub_socket.bind(self.pubsub_endpoint)

File "socket.pyx", line 448, in zmq.backend.cython.socket.Socket.bind (zmq/backend/cython/socket.c:4126)

File "checkrc.pxd", line 21, in zmq.backend.cython.checkrc._check_rc (zmq/backend/cython/socket.c:6174)

ZMQError: Address already in use
@Natim
Collaborator
Natim commented Mar 20, 2014

Did some test with current master.

With psutil<2.0.0 everything works fine here.
With psutil==2.0.0 I've got 21 errors of those:

tornado.application: ERROR: Exception in callback None
Traceback (most recent call last):
  File "mozilla/circus/local/lib/python2.7/site-packages/tornado-3.2-py2.7-linux-x86_64.egg/tornado/ioloop.py", line 688, in start
    self._handlers[fd](fd, events)
KeyError: 1819044218
tornado.application: ERROR: Exception in callback None
Traceback (most recent call last):
  File "mozilla/circus/local/lib/python2.7/site-packages/tornado-3.2-py2.7-linux-x86_64.egg/tornado/ioloop.py", line 688, in start
    self._handlers[fd](fd, events)
KeyError: 1441
@tarekziade tarekziade added a commit that referenced this issue Mar 20, 2014
@tarekziade tarekziade more fixing for #737 de65a18
@tarekziade
Collaborator

we're looking way better - I guess I can close this one for now. We should track any failure and fix it immediatly under Travis

@tarekziade tarekziade closed this Mar 22, 2014
@tarekziade tarekziade added a commit that referenced this issue Mar 24, 2014
@tarekziade tarekziade more cleanup - refs #737 1a4953e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment