OSError: [Errno 24] Too many open files after running the UI Server for a long time #44

kinow · 2019-05-24T02:04:59Z

I was running a Cylc suite from the old dev examples (cylc run --no-detach --verbose --debug --hold complex), with cylc-uiserver (i.e.jupyterhub + spawned cylc_singleuser) since around 9:30 AM.

Then started looking at Vue.JS code, had lunch, came back, to find my console full of:

ERROR:asyncio:Exception in callback BaseAsyncIOLoop._handle_events(11, 1)
handle: <Handle BaseAsyncIOLoop._handle_events(11, 1)>
Traceback (most recent call last):
  File "/home/kinow/Development/python/anaconda3/lib/python3.7/asyncio/events.py", line 88, in _run
  File "/home/kinow/Development/python/workspace/cylc-uiserver/venv/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 138, in _handle_events
  File "/home/kinow/Development/python/workspace/cylc-uiserver/venv/lib/python3.7/site-packages/tornado/netutil.py", line 260, in accept_handler
  File "/home/kinow/Development/python/anaconda3/lib/python3.7/socket.py", line 212, in accept
OSError: [Errno 24] Too many open files

It appears to be known issue, reported against jupyter/notebook: jupyter/notebook#3748, with a sibling issue in pyzmq: zeromq/pyzmq#1170

The text was updated successfully, but these errors were encountered:

pdeitel · 2019-06-01T12:10:56Z

I am getting this for as few as 10 files open at once.

kinow · 2019-06-01T12:28:21Z

Hi @pdeitel

Are you using cylc-uiserver, or do you have another separate project with pyzmq?

I had a quick read at the linked issue when I created this one, and from what I had understood, it appeared to me that we are suffering it here in cylc-uiserver - my guess - for not closing the sockets, and having multiple sockets open... maybe in multiple co-routines started by Tornado.

I think one of the linked issues had some comments about the need to call del or .close() somewhere to make sure the file descriptors were freed. But if you have a simpler example I can try your code and see if I reproduce it on my environment too.

pdeitel · 2019-06-01T12:52:25Z

Sorry, I actually commented on the wrong issues tab. I am having this issue in Jupyter notebooks.

sadielbartholomew · 2019-07-06T14:49:25Z

I've just run into this one, after having the server up for ~1.5 hours.

For one, I get the exact same traceback reported by @kinow in #44 (comment) (except with different file paths, of course), & it was spamming my terminal so fast I could not copy it until I killed the relevant processes to stop the output.

For the record, amongst those error messages I also get the following, which might be helpful to anyone who tackles this, as it points to specific Cylc modules involved:

ERROR:tornado.application:Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7f17f6b15470>>, <Task finished coro=<WorkflowsManager.gather_workflows() done, defined at /net/home/h06/sbarth/cylc-uiserver/workflows_mgr.py:83> exception=ZMQError('Too many open files')>)
Traceback (most recent call last):
  File "/home/h06/sbarth/miniconda3/envs/py3env/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/home/h06/sbarth/miniconda3/envs/py3env/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/net/home/h06/sbarth/cylc-uiserver/workflows_mgr.py", line 94, in gather_workflows
    items = await asyncio.gather(*gathers)
  File "/net/home/h06/sbarth/cylc-uiserver/workflows_mgr.py", line 69, in est_workflow
    client = SuiteRuntimeClient(reg, host=host, port=port, timeout=timeout)
  File "/net/home/h06/sbarth/cylc.git/cylc/flow/network/client.py", line 254, in __init__
    timeout_handler=partial(self._timeout_handler, suite, host, port)
  File "/net/home/h06/sbarth/cylc.git/cylc/flow/network/client.py", line 104, in __init__
    self.socket = CONTEXT.socket(zmq.REQ)
  File "/home/h06/sbarth/miniconda3/envs/py3env/lib/python3.7/site-packages/zmq/sugar/context.py", line 146, in socket
    s = self._socket_class(self, socket_type, **kwargs)
  File "/home/h06/sbarth/miniconda3/envs/py3env/lib/python3.7/site-packages/zmq/_future.py", line 134, in __init__
    super(_AsyncSocket, self).__init__(context, socket_type, **kwargs)
  File "/home/h06/sbarth/miniconda3/envs/py3env/lib/python3.7/site-packages/zmq/sugar/socket.py", line 59, in __init__
    super(Socket, self).__init__(*a, **kw)
  File "zmq/backend/cython/socket.pyx", line 328, in zmq.backend.cython.socket.Socket.__init__
zmq.error.ZMQError: Too many open files

This, when occuring, had a fatal impact on the suite listing page, which said 'No data available' despite the fact I had suites running:

If I killed one of my suite cylc run processes, the error would cease for a bit, & the listing would come back, but eventually I would hit it again.

kinow · 2019-07-08T01:09:48Z

Exactly same issue @sadielbartholomew .

Now one interesting thing occurred to me just now... I've been running the proxy+jupyterhub the whole day (1pm now), and haven't seen that problem.

One thing different that I am doing now, is that I have a cylc run --hold ... command, and haven't released it, as the issues I am working on don't need the tree view.

Actually, I have looked at the tree view once or twice only today, and spent most of my day looking at the table with the list of suites. An interesting thing about the list of suites (Suites View) is that it contains a GraphQL query to the backend, that is executed every 5 seconds.

@sadielbartholomew did you have a cylc suite running by any chance before the error occurred?

sadielbartholomew · 2019-07-08T06:04:42Z

Hi @kinow, interesting! Yes, I did have a suite running (just a very minimal one which has a task which sleeps for effectively an infinite amount of time) alongside a suite which was held on the command line (cylc run <suite> --hold as you also did).

That seems to fit in with the fact that when I killed one suite process (I can't remember which of the two, though, sadly, but I can always check again) immediately the errors stopped emerging, though only for a bit.

As a further note, it would be very helpful if we could soon find a workaround or further clues on how to minimise the frequency of this problem, as it is quite distracting for development (though obviously we want to fix this properly before users get anywhere near the new UI!). So I will make notes anytime I spot something about the context & frequency these errors crop up. If I notice anything, I'll report here too.

kinow · 2019-07-08T07:17:11Z

Agreed. I am getting used to restarting the jupyterhub process, which is really bad.

matthewrmshin · 2019-07-08T10:42:23Z

From the trace back, it looks like we are:

Opening new UIS<->WFS connections without closing the old ones; and/or
Not re-using the existing UIS<->WFS zeromq connection?

dwsutherland · 2019-07-08T20:00:51Z

From the trace back, it looks like we are:

Opening new UIS<->WFS connections without closing the old ones; and/or

Not re-using the existing UIS<->WFS zeromq connection?

This was fixed in
#53
It reuses existing ZeroMQ client instance.. Client is just deleted if WS isn’t in the new scan
(BTW - The Client is imported from cylc-flow)

Not sure what the WS client does with new commands/requests, need to have a closer look...

dwsutherland · 2019-07-08T21:47:45Z

Although the scan instantiates a client then use, but no different than the CLI...

kinow · 2019-07-08T21:52:54Z

I think this has been happening on master for cylc-flow & cylc-uiserver combo, which should include the changes from #53. I feel like @matthewrmshin's suggestion is close to what's going on. We probably have some connection/socket leaking somewhere.

dwsutherland · 2019-07-08T22:12:07Z

I think this has been happening on master for cylc-flow & cylc-uiserver combo, which should include the changes from #53. I feel like @matthewrmshin's suggestion is close to what's going on. We probably have some connection/socket leaking somewhere.

Yes, there’s a leek somewhere... My point was, we are reusing the client, and using it in the same way as the CLI... However, we need to look closer because we’re definitely not manually closing the connection in the UIS (just deleting the client objects (python dropping them))

kinow · 2019-07-09T22:57:03Z

Quick experiment, with a poor-man's monitoring approach.

Started Cylc 8 (ui+uiserver+etc), and then looked at the number of open files:

kinow@kinow-VirtualBox:~$ while true; do date | tr '\n' ' ' && lsof -a -p $(ps aux  | grep [c]ylc-singleuser | awk '{ print $2 }') | wc -l && sleep 5; done
Tue Jul  9 00:53:54 NZST 2019 91
Tue Jul  9 00:53:59 NZST 2019 91
Tue Jul  9 00:54:04 NZST 2019 91
Tue Jul  9 00:54:09 NZST 2019 91
Tue Jul  9 00:54:14 NZST 2019 91

Now notice what happens when I run cylc run --no-detach --hold five for just a few seconds before stopping it.

Tue Jul  9 00:54:14 NZST 2019 91
Tue Jul  9 00:54:19 NZST 2019 91
Tue Jul  9 00:54:24 NZST 2019 100
Tue Jul  9 00:54:30 NZST 2019 99
Tue Jul  9 00:54:35 NZST 2019 100
Tue Jul  9 00:54:40 NZST 2019 99
Tue Jul  9 00:54:45 NZST 2019 99
Tue Jul  9 00:54:50 NZST 2019 99
Tue Jul  9 00:54:56 NZST 2019 99

If later I start the suite again, it goes up to 102, 105, 108, etc. Even without touching anything, the number of used file descriptors keeps increasing.

Tue Jul  9 01:04:09 NZST 2019 99
Tue Jul  9 01:04:14 NZST 2019 99
Tue Jul  9 01:04:20 NZST 2019 99
Tue Jul  9 01:04:25 NZST 2019 99
Tue Jul  9 01:04:30 NZST 2019 99
Tue Jul  9 01:04:35 NZST 2019 102
Tue Jul  9 01:04:40 NZST 2019 102
Tue Jul  9 01:04:45 NZST 2019 102
Tue Jul  9 01:04:51 NZST 2019 105
Tue Jul  9 01:04:56 NZST 2019 105
Tue Jul  9 01:05:01 NZST 2019 105
Tue Jul  9 01:05:06 NZST 2019 105
Tue Jul  9 01:05:11 NZST 2019 108
Tue Jul  9 01:05:16 NZST 2019 108
Tue Jul  9 01:05:21 NZST 2019 108
Tue Jul  9 01:05:26 NZST 2019 108
Tue Jul  9 01:05:32 NZST 2019 111
Tue Jul  9 01:05:37 NZST 2019 111
Tue Jul  9 01:05:42 NZST 2019 111
Tue Jul  9 01:05:47 NZST 2019 111
Tue Jul  9 01:05:52 NZST 2019 114
Tue Jul  9 01:05:57 NZST 2019 114
Tue Jul  9 01:06:03 NZST 2019 114
Tue Jul  9 01:06:08 NZST 2019 117
Tue Jul  9 01:06:13 NZST 2019 117
Tue Jul  9 01:06:18 NZST 2019 117
Tue Jul  9 01:06:23 NZST 2019 117
Tue Jul  9 01:06:28 NZST 2019 120
Tue Jul  9 01:06:34 NZST 2019 120
Tue Jul  9 01:06:39 NZST 2019 120
Tue Jul  9 01:06:44 NZST 2019 120
Tue Jul  9 01:06:49 NZST 2019 123
Tue Jul  9 01:06:54 NZST 2019 123
Tue Jul  9 01:06:59 NZST 2019 123
Tue Jul  9 01:07:05 NZST 2019 126
Tue Jul  9 01:07:10 NZST 2019 126
Tue Jul  9 01:07:15 NZST 2019 126
Tue Jul  9 01:07:20 NZST 2019 126
Tue Jul  9 01:07:25 NZST 2019 129
Tue Jul  9 01:07:30 NZST 2019 129

Then closing the suite stabilizes the number again.

dwsutherland · 2019-07-10T00:15:36Z

@kinow : Great 👍

From the 5sec interval, we’ve narrowed it down to the data store request (with the data manager)... If we can reproduce in the CLI, then we may need to look at the flow/network/client.py or something

sadielbartholomew · 2019-07-10T04:44:16Z

Ah, excellent detective work @kinow! That's a really nice systematic approach to see what is going on. (I've noted to myself to explore the lsof command which I haven't used before directly on the command line at least.)

kinow added the bug Something isn't working label May 24, 2019

kinow mentioned this issue Jun 30, 2019

zmq "too many open files" (sockets?) #54

Closed

This was referenced Jul 10, 2019

Automatically launch server #56

Closed

Close socket used by pyzmq when deleting the client #57

Merged

kinow self-assigned this Jul 10, 2019

matthewrmshin closed this as completed in #57 Jul 10, 2019

kinow added this to the 0.1 milestone Sep 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OSError: [Errno 24] Too many open files after running the UI Server for a long time #44

OSError: [Errno 24] Too many open files after running the UI Server for a long time #44

kinow commented May 24, 2019

pdeitel commented Jun 1, 2019

kinow commented Jun 1, 2019

pdeitel commented Jun 1, 2019

sadielbartholomew commented Jul 6, 2019 •

edited

Loading

kinow commented Jul 8, 2019

sadielbartholomew commented Jul 8, 2019 •

edited

Loading

kinow commented Jul 8, 2019

matthewrmshin commented Jul 8, 2019

dwsutherland commented Jul 8, 2019 •

edited

Loading

dwsutherland commented Jul 8, 2019

kinow commented Jul 8, 2019

dwsutherland commented Jul 8, 2019 •

edited

Loading

kinow commented Jul 9, 2019

dwsutherland commented Jul 10, 2019 •

edited

Loading

sadielbartholomew commented Jul 10, 2019

OSError: [Errno 24] Too many open files after running the UI Server for a long time #44

OSError: [Errno 24] Too many open files after running the UI Server for a long time #44

Comments

kinow commented May 24, 2019

pdeitel commented Jun 1, 2019

kinow commented Jun 1, 2019

pdeitel commented Jun 1, 2019

sadielbartholomew commented Jul 6, 2019 • edited Loading

kinow commented Jul 8, 2019

sadielbartholomew commented Jul 8, 2019 • edited Loading

kinow commented Jul 8, 2019

matthewrmshin commented Jul 8, 2019

dwsutherland commented Jul 8, 2019 • edited Loading

dwsutherland commented Jul 8, 2019

kinow commented Jul 8, 2019

dwsutherland commented Jul 8, 2019 • edited Loading

kinow commented Jul 9, 2019

dwsutherland commented Jul 10, 2019 • edited Loading

sadielbartholomew commented Jul 10, 2019

sadielbartholomew commented Jul 6, 2019 •

edited

Loading

sadielbartholomew commented Jul 8, 2019 •

edited

Loading

dwsutherland commented Jul 8, 2019 •

edited

Loading

dwsutherland commented Jul 8, 2019 •

edited

Loading

dwsutherland commented Jul 10, 2019 •

edited

Loading