-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OSError: [Errno 24] Too many open files after running the UI Server for a long time #44
Comments
I am getting this for as few as 10 files open at once. |
Hi @pdeitel Are you using I had a quick read at the linked issue when I created this one, and from what I had understood, it appeared to me that we are suffering it here in I think one of the linked issues had some comments about the need to call |
Sorry, I actually commented on the wrong issues tab. I am having this issue in Jupyter notebooks. |
I've just run into this one, after having the server up for ~1.5 hours. For one, I get the exact same traceback reported by @kinow in #44 (comment) (except with different file paths, of course), & it was spamming my terminal so fast I could not copy it until I killed the relevant processes to stop the output. For the record, amongst those error messages I also get the following, which might be helpful to anyone who tackles this, as it points to specific Cylc modules involved: ERROR:tornado.application:Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7f17f6b15470>>, <Task finished coro=<WorkflowsManager.gather_workflows() done, defined at /net/home/h06/sbarth/cylc-uiserver/workflows_mgr.py:83> exception=ZMQError('Too many open files')>)
Traceback (most recent call last):
File "/home/h06/sbarth/miniconda3/envs/py3env/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/home/h06/sbarth/miniconda3/envs/py3env/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/net/home/h06/sbarth/cylc-uiserver/workflows_mgr.py", line 94, in gather_workflows
items = await asyncio.gather(*gathers)
File "/net/home/h06/sbarth/cylc-uiserver/workflows_mgr.py", line 69, in est_workflow
client = SuiteRuntimeClient(reg, host=host, port=port, timeout=timeout)
File "/net/home/h06/sbarth/cylc.git/cylc/flow/network/client.py", line 254, in __init__
timeout_handler=partial(self._timeout_handler, suite, host, port)
File "/net/home/h06/sbarth/cylc.git/cylc/flow/network/client.py", line 104, in __init__
self.socket = CONTEXT.socket(zmq.REQ)
File "/home/h06/sbarth/miniconda3/envs/py3env/lib/python3.7/site-packages/zmq/sugar/context.py", line 146, in socket
s = self._socket_class(self, socket_type, **kwargs)
File "/home/h06/sbarth/miniconda3/envs/py3env/lib/python3.7/site-packages/zmq/_future.py", line 134, in __init__
super(_AsyncSocket, self).__init__(context, socket_type, **kwargs)
File "/home/h06/sbarth/miniconda3/envs/py3env/lib/python3.7/site-packages/zmq/sugar/socket.py", line 59, in __init__
super(Socket, self).__init__(*a, **kw)
File "zmq/backend/cython/socket.pyx", line 328, in zmq.backend.cython.socket.Socket.__init__
zmq.error.ZMQError: Too many open files This, when occuring, had a fatal impact on the suite listing page, which said 'No data available' despite the fact I had suites running: If I killed one of my suite |
Exactly same issue @sadielbartholomew . Now one interesting thing occurred to me just now... I've been running the proxy+jupyterhub the whole day (1pm now), and haven't seen that problem. One thing different that I am doing now, is that I have a Actually, I have looked at the tree view once or twice only today, and spent most of my day looking at the table with the list of suites. An interesting thing about the list of suites (Suites View) is that it contains a GraphQL query to the backend, that is executed every 5 seconds. @sadielbartholomew did you have a cylc suite running by any chance before the error occurred? |
Hi @kinow, interesting! Yes, I did have a suite running (just a very minimal one which has a task which sleeps for effectively an infinite amount of time) alongside a suite which was held on the command line ( That seems to fit in with the fact that when I killed one suite process (I can't remember which of the two, though, sadly, but I can always check again) immediately the errors stopped emerging, though only for a bit. As a further note, it would be very helpful if we could soon find a workaround or further clues on how to minimise the frequency of this problem, as it is quite distracting for development (though obviously we want to fix this properly before users get anywhere near the new UI!). So I will make notes anytime I spot something about the context & frequency these errors crop up. If I notice anything, I'll report here too. |
Agreed. I am getting used to restarting the jupyterhub process, which is really bad. |
From the trace back, it looks like we are:
|
This was fixed in Not sure what the WS client does with new commands/requests, need to have a closer look... |
Although the scan instantiates a client then use, but no different than the CLI... |
I think this has been happening on |
Yes, there’s a leek somewhere... My point was, we are reusing the client, and using it in the same way as the CLI... However, we need to look closer because we’re definitely not manually closing the connection in the UIS (just deleting the client objects (python dropping them)) |
Quick experiment, with a poor-man's monitoring approach. Started Cylc 8 (ui+uiserver+etc), and then looked at the number of open files: kinow@kinow-VirtualBox:~$ while true; do date | tr '\n' ' ' && lsof -a -p $(ps aux | grep [c]ylc-singleuser | awk '{ print $2 }') | wc -l && sleep 5; done
Tue Jul 9 00:53:54 NZST 2019 91
Tue Jul 9 00:53:59 NZST 2019 91
Tue Jul 9 00:54:04 NZST 2019 91
Tue Jul 9 00:54:09 NZST 2019 91
Tue Jul 9 00:54:14 NZST 2019 91 Now notice what happens when I run Tue Jul 9 00:54:14 NZST 2019 91
Tue Jul 9 00:54:19 NZST 2019 91
Tue Jul 9 00:54:24 NZST 2019 100
Tue Jul 9 00:54:30 NZST 2019 99
Tue Jul 9 00:54:35 NZST 2019 100
Tue Jul 9 00:54:40 NZST 2019 99
Tue Jul 9 00:54:45 NZST 2019 99
Tue Jul 9 00:54:50 NZST 2019 99
Tue Jul 9 00:54:56 NZST 2019 99 If later I start the suite again, it goes up to 102, 105, 108, etc. Even without touching anything, the number of used file descriptors keeps increasing. Tue Jul 9 01:04:09 NZST 2019 99
Tue Jul 9 01:04:14 NZST 2019 99
Tue Jul 9 01:04:20 NZST 2019 99
Tue Jul 9 01:04:25 NZST 2019 99
Tue Jul 9 01:04:30 NZST 2019 99
Tue Jul 9 01:04:35 NZST 2019 102
Tue Jul 9 01:04:40 NZST 2019 102
Tue Jul 9 01:04:45 NZST 2019 102
Tue Jul 9 01:04:51 NZST 2019 105
Tue Jul 9 01:04:56 NZST 2019 105
Tue Jul 9 01:05:01 NZST 2019 105
Tue Jul 9 01:05:06 NZST 2019 105
Tue Jul 9 01:05:11 NZST 2019 108
Tue Jul 9 01:05:16 NZST 2019 108
Tue Jul 9 01:05:21 NZST 2019 108
Tue Jul 9 01:05:26 NZST 2019 108
Tue Jul 9 01:05:32 NZST 2019 111
Tue Jul 9 01:05:37 NZST 2019 111
Tue Jul 9 01:05:42 NZST 2019 111
Tue Jul 9 01:05:47 NZST 2019 111
Tue Jul 9 01:05:52 NZST 2019 114
Tue Jul 9 01:05:57 NZST 2019 114
Tue Jul 9 01:06:03 NZST 2019 114
Tue Jul 9 01:06:08 NZST 2019 117
Tue Jul 9 01:06:13 NZST 2019 117
Tue Jul 9 01:06:18 NZST 2019 117
Tue Jul 9 01:06:23 NZST 2019 117
Tue Jul 9 01:06:28 NZST 2019 120
Tue Jul 9 01:06:34 NZST 2019 120
Tue Jul 9 01:06:39 NZST 2019 120
Tue Jul 9 01:06:44 NZST 2019 120
Tue Jul 9 01:06:49 NZST 2019 123
Tue Jul 9 01:06:54 NZST 2019 123
Tue Jul 9 01:06:59 NZST 2019 123
Tue Jul 9 01:07:05 NZST 2019 126
Tue Jul 9 01:07:10 NZST 2019 126
Tue Jul 9 01:07:15 NZST 2019 126
Tue Jul 9 01:07:20 NZST 2019 126
Tue Jul 9 01:07:25 NZST 2019 129
Tue Jul 9 01:07:30 NZST 2019 129 Then closing the suite stabilizes the number again. |
@kinow : Great 👍 From the 5sec interval, we’ve narrowed it down to the data store request (with the data manager)... If we can reproduce in the CLI, then we may need to look at the flow/network/client.py or something |
Ah, excellent detective work @kinow! That's a really nice systematic approach to see what is going on. (I've noted to myself to explore the |
I was running a Cylc suite from the old dev examples (
cylc run --no-detach --verbose --debug --hold complex
), withcylc-uiserver
(i.e.jupyterhub
+ spawnedcylc_singleuser
) since around 9:30 AM.Then started looking at Vue.JS code, had lunch, came back, to find my console full of:
It appears to be known issue, reported against jupyter/notebook: jupyter/notebook#3748, with a sibling issue in pyzmq: zeromq/pyzmq#1170
The text was updated successfully, but these errors were encountered: