Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSError: [Errno 24] Too many open files after running the UI Server for a long time #44

Closed
kinow opened this issue May 24, 2019 · 15 comments · Fixed by #57
Closed

OSError: [Errno 24] Too many open files after running the UI Server for a long time #44

kinow opened this issue May 24, 2019 · 15 comments · Fixed by #57
Assignees
Labels
bug Something isn't working
Milestone

Comments

@kinow
Copy link
Member

kinow commented May 24, 2019

I was running a Cylc suite from the old dev examples (cylc run --no-detach --verbose --debug --hold complex), with cylc-uiserver (i.e.jupyterhub + spawned cylc_singleuser) since around 9:30 AM.

Then started looking at Vue.JS code, had lunch, came back, to find my console full of:

ERROR:asyncio:Exception in callback BaseAsyncIOLoop._handle_events(11, 1)
handle: <Handle BaseAsyncIOLoop._handle_events(11, 1)>
Traceback (most recent call last):
  File "/home/kinow/Development/python/anaconda3/lib/python3.7/asyncio/events.py", line 88, in _run
  File "/home/kinow/Development/python/workspace/cylc-uiserver/venv/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 138, in _handle_events
  File "/home/kinow/Development/python/workspace/cylc-uiserver/venv/lib/python3.7/site-packages/tornado/netutil.py", line 260, in accept_handler
  File "/home/kinow/Development/python/anaconda3/lib/python3.7/socket.py", line 212, in accept
OSError: [Errno 24] Too many open files

It appears to be known issue, reported against jupyter/notebook: jupyter/notebook#3748, with a sibling issue in pyzmq: zeromq/pyzmq#1170

@kinow kinow added the bug Something isn't working label May 24, 2019
@pdeitel
Copy link

pdeitel commented Jun 1, 2019

I am getting this for as few as 10 files open at once.

@kinow
Copy link
Member Author

kinow commented Jun 1, 2019

Hi @pdeitel

Are you using cylc-uiserver, or do you have another separate project with pyzmq?

I had a quick read at the linked issue when I created this one, and from what I had understood, it appeared to me that we are suffering it here in cylc-uiserver - my guess - for not closing the sockets, and having multiple sockets open... maybe in multiple co-routines started by Tornado.

I think one of the linked issues had some comments about the need to call del or .close() somewhere to make sure the file descriptors were freed. But if you have a simpler example I can try your code and see if I reproduce it on my environment too.

@pdeitel
Copy link

pdeitel commented Jun 1, 2019

Sorry, I actually commented on the wrong issues tab. I am having this issue in Jupyter notebooks.

@sadielbartholomew
Copy link
Collaborator

sadielbartholomew commented Jul 6, 2019

I've just run into this one, after having the server up for ~1.5 hours.

For one, I get the exact same traceback reported by @kinow in #44 (comment) (except with different file paths, of course), & it was spamming my terminal so fast I could not copy it until I killed the relevant processes to stop the output.

For the record, amongst those error messages I also get the following, which might be helpful to anyone who tackles this, as it points to specific Cylc modules involved:

ERROR:tornado.application:Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7f17f6b15470>>, <Task finished coro=<WorkflowsManager.gather_workflows() done, defined at /net/home/h06/sbarth/cylc-uiserver/workflows_mgr.py:83> exception=ZMQError('Too many open files')>)
Traceback (most recent call last):
  File "/home/h06/sbarth/miniconda3/envs/py3env/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/home/h06/sbarth/miniconda3/envs/py3env/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/net/home/h06/sbarth/cylc-uiserver/workflows_mgr.py", line 94, in gather_workflows
    items = await asyncio.gather(*gathers)
  File "/net/home/h06/sbarth/cylc-uiserver/workflows_mgr.py", line 69, in est_workflow
    client = SuiteRuntimeClient(reg, host=host, port=port, timeout=timeout)
  File "/net/home/h06/sbarth/cylc.git/cylc/flow/network/client.py", line 254, in __init__
    timeout_handler=partial(self._timeout_handler, suite, host, port)
  File "/net/home/h06/sbarth/cylc.git/cylc/flow/network/client.py", line 104, in __init__
    self.socket = CONTEXT.socket(zmq.REQ)
  File "/home/h06/sbarth/miniconda3/envs/py3env/lib/python3.7/site-packages/zmq/sugar/context.py", line 146, in socket
    s = self._socket_class(self, socket_type, **kwargs)
  File "/home/h06/sbarth/miniconda3/envs/py3env/lib/python3.7/site-packages/zmq/_future.py", line 134, in __init__
    super(_AsyncSocket, self).__init__(context, socket_type, **kwargs)
  File "/home/h06/sbarth/miniconda3/envs/py3env/lib/python3.7/site-packages/zmq/sugar/socket.py", line 59, in __init__
    super(Socket, self).__init__(*a, **kw)
  File "zmq/backend/cython/socket.pyx", line 328, in zmq.backend.cython.socket.Socket.__init__
zmq.error.ZMQError: Too many open files

This, when occuring, had a fatal impact on the suite listing page, which said 'No data available' despite the fact I had suites running:

scr-1

If I killed one of my suite cylc run processes, the error would cease for a bit, & the listing would come back, but eventually I would hit it again.

@kinow
Copy link
Member Author

kinow commented Jul 8, 2019

Exactly same issue @sadielbartholomew .

Now one interesting thing occurred to me just now... I've been running the proxy+jupyterhub the whole day (1pm now), and haven't seen that problem.

One thing different that I am doing now, is that I have a cylc run --hold ... command, and haven't released it, as the issues I am working on don't need the tree view.

Actually, I have looked at the tree view once or twice only today, and spent most of my day looking at the table with the list of suites. An interesting thing about the list of suites (Suites View) is that it contains a GraphQL query to the backend, that is executed every 5 seconds.

@sadielbartholomew did you have a cylc suite running by any chance before the error occurred?

@sadielbartholomew
Copy link
Collaborator

sadielbartholomew commented Jul 8, 2019

Hi @kinow, interesting! Yes, I did have a suite running (just a very minimal one which has a task which sleeps for effectively an infinite amount of time) alongside a suite which was held on the command line (cylc run <suite> --hold as you also did).

That seems to fit in with the fact that when I killed one suite process (I can't remember which of the two, though, sadly, but I can always check again) immediately the errors stopped emerging, though only for a bit.

As a further note, it would be very helpful if we could soon find a workaround or further clues on how to minimise the frequency of this problem, as it is quite distracting for development (though obviously we want to fix this properly before users get anywhere near the new UI!). So I will make notes anytime I spot something about the context & frequency these errors crop up. If I notice anything, I'll report here too.

@kinow
Copy link
Member Author

kinow commented Jul 8, 2019

Agreed. I am getting used to restarting the jupyterhub process, which is really bad.

@matthewrmshin
Copy link
Contributor

From the trace back, it looks like we are:

  • Opening new UIS<->WFS connections without closing the old ones; and/or
  • Not re-using the existing UIS<->WFS zeromq connection?

@dwsutherland
Copy link
Member

dwsutherland commented Jul 8, 2019

From the trace back, it looks like we are:

  • Opening new UIS<->WFS connections without closing the old ones; and/or
  • Not re-using the existing UIS<->WFS zeromq connection?

This was fixed in
#53
It reuses existing ZeroMQ client instance.. Client is just deleted if WS isn’t in the new scan
(BTW - The Client is imported from cylc-flow)

Not sure what the WS client does with new commands/requests, need to have a closer look...

@dwsutherland
Copy link
Member

Although the scan instantiates a client then use, but no different than the CLI...

@kinow
Copy link
Member Author

kinow commented Jul 8, 2019

I think this has been happening on master for cylc-flow & cylc-uiserver combo, which should include the changes from #53. I feel like @matthewrmshin's suggestion is close to what's going on. We probably have some connection/socket leaking somewhere.

@dwsutherland
Copy link
Member

dwsutherland commented Jul 8, 2019

I think this has been happening on master for cylc-flow & cylc-uiserver combo, which should include the changes from #53. I feel like @matthewrmshin's suggestion is close to what's going on. We probably have some connection/socket leaking somewhere.

Yes, there’s a leek somewhere... My point was, we are reusing the client, and using it in the same way as the CLI... However, we need to look closer because we’re definitely not manually closing the connection in the UIS (just deleting the client objects (python dropping them))

@kinow
Copy link
Member Author

kinow commented Jul 9, 2019

Quick experiment, with a poor-man's monitoring approach.

Started Cylc 8 (ui+uiserver+etc), and then looked at the number of open files:

kinow@kinow-VirtualBox:~$ while true; do date | tr '\n' ' ' && lsof -a -p $(ps aux  | grep [c]ylc-singleuser | awk '{ print $2 }') | wc -l && sleep 5; done
Tue Jul  9 00:53:54 NZST 2019 91
Tue Jul  9 00:53:59 NZST 2019 91
Tue Jul  9 00:54:04 NZST 2019 91
Tue Jul  9 00:54:09 NZST 2019 91
Tue Jul  9 00:54:14 NZST 2019 91

Now notice what happens when I run cylc run --no-detach --hold five for just a few seconds before stopping it.

Tue Jul  9 00:54:14 NZST 2019 91
Tue Jul  9 00:54:19 NZST 2019 91
Tue Jul  9 00:54:24 NZST 2019 100
Tue Jul  9 00:54:30 NZST 2019 99
Tue Jul  9 00:54:35 NZST 2019 100
Tue Jul  9 00:54:40 NZST 2019 99
Tue Jul  9 00:54:45 NZST 2019 99
Tue Jul  9 00:54:50 NZST 2019 99
Tue Jul  9 00:54:56 NZST 2019 99

If later I start the suite again, it goes up to 102, 105, 108, etc. Even without touching anything, the number of used file descriptors keeps increasing.

Tue Jul  9 01:04:09 NZST 2019 99
Tue Jul  9 01:04:14 NZST 2019 99
Tue Jul  9 01:04:20 NZST 2019 99
Tue Jul  9 01:04:25 NZST 2019 99
Tue Jul  9 01:04:30 NZST 2019 99
Tue Jul  9 01:04:35 NZST 2019 102
Tue Jul  9 01:04:40 NZST 2019 102
Tue Jul  9 01:04:45 NZST 2019 102
Tue Jul  9 01:04:51 NZST 2019 105
Tue Jul  9 01:04:56 NZST 2019 105
Tue Jul  9 01:05:01 NZST 2019 105
Tue Jul  9 01:05:06 NZST 2019 105
Tue Jul  9 01:05:11 NZST 2019 108
Tue Jul  9 01:05:16 NZST 2019 108
Tue Jul  9 01:05:21 NZST 2019 108
Tue Jul  9 01:05:26 NZST 2019 108
Tue Jul  9 01:05:32 NZST 2019 111
Tue Jul  9 01:05:37 NZST 2019 111
Tue Jul  9 01:05:42 NZST 2019 111
Tue Jul  9 01:05:47 NZST 2019 111
Tue Jul  9 01:05:52 NZST 2019 114
Tue Jul  9 01:05:57 NZST 2019 114
Tue Jul  9 01:06:03 NZST 2019 114
Tue Jul  9 01:06:08 NZST 2019 117
Tue Jul  9 01:06:13 NZST 2019 117
Tue Jul  9 01:06:18 NZST 2019 117
Tue Jul  9 01:06:23 NZST 2019 117
Tue Jul  9 01:06:28 NZST 2019 120
Tue Jul  9 01:06:34 NZST 2019 120
Tue Jul  9 01:06:39 NZST 2019 120
Tue Jul  9 01:06:44 NZST 2019 120
Tue Jul  9 01:06:49 NZST 2019 123
Tue Jul  9 01:06:54 NZST 2019 123
Tue Jul  9 01:06:59 NZST 2019 123
Tue Jul  9 01:07:05 NZST 2019 126
Tue Jul  9 01:07:10 NZST 2019 126
Tue Jul  9 01:07:15 NZST 2019 126
Tue Jul  9 01:07:20 NZST 2019 126
Tue Jul  9 01:07:25 NZST 2019 129
Tue Jul  9 01:07:30 NZST 2019 129

Then closing the suite stabilizes the number again.

@dwsutherland
Copy link
Member

dwsutherland commented Jul 10, 2019

@kinow : Great 👍

From the 5sec interval, we’ve narrowed it down to the data store request (with the data manager)... If we can reproduce in the CLI, then we may need to look at the flow/network/client.py or something

@sadielbartholomew
Copy link
Collaborator

Ah, excellent detective work @kinow! That's a really nice systematic approach to see what is going on. (I've noted to myself to explore the lsof command which I haven't used before directly on the command line at least.)

@kinow kinow added this to the 0.1 milestone Sep 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants