Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exhaustion of ephemeral ports when running as a remote #449

Closed
fra967 opened this issue Oct 29, 2018 · 8 comments
Closed

Exhaustion of ephemeral ports when running as a remote #449

fra967 opened this issue Oct 29, 2018 · 8 comments

Comments

@fra967
Copy link

fra967 commented Oct 29, 2018

We are running out of sockets on CDN 3.1.0, used as remote under a not particularly high load

CDN doesn't respond anymore and systemd[10153]: Reached target Sockets. is logged in /var/log/syslog

The current range of ephemeral ports is

# cat /proc/sys/net/ipv4/ip_local_port_range
32768   60999

And TCP connection should be reused
net.ipv4.tcp_tw_reuse = 1 set in /etc/sysctl.conf
(See https://vincent.bernat.ch/en/blog/2014-tcp-time-wait-state-linux)

Nevertheless we end up with a huge amount of sockets stuck in TIME_WAIT

# netstat -an | grep TIME_WAIT | wc -l
32773

Possibly the library used for remote connections is not dropping / reusing the TCP connections?

As a temporary patch we have increased the number of ephemeral ports

@fra967
Copy link
Author

fra967 commented Nov 11, 2018

Unfortunately I had to reboot before investigating further, but the error below was reported on two CDN instances suffering of a similar issue, although not exactly the same: in this case file descriptors had been exhausted. The limit had already been increased from the default 1024 to 8192 - which is already quite high and should not be reached unless we have a socket leak

I believe sockets opened by Node are unlimited these days
https://nodejs.org/api/http.html#http_agent_maxsockets
Having a hard limit is unlikely to help, if we have a socket leak (or connections not being closed) it will just fill up memory while keeping all the requests in the queue, and we will end up with memory exhaustion instead. But possibly we would be able to reproduce the problem more quickly?

[ERROR]  Trace
    at process.on (/dadi/cdn/node_modules/@dadi/cdn/dadi/lib/index.js:25:13)
    at emitTwo (events.js:131:20)
    at process.emit (events.js:214:7)
    at emitPendingUnhandledRejections (internal/process/promises.js:108:22)
    at runMicrotasksCallback (internal/process/next_tick.js:124:9)
    at _combinedTickCallback (internal/process/next_tick.js:131:7)
    at process._tickDomainCallback (internal/process/next_tick.js:218:9)
[2018-11-11 10:40:48.752] [ERROR]  { Error: spawn /dadi/cdn/node_modules/jpegtran-bin/vendor/jpegtran EMFILE
    at _errnoException (util.js:1022:11)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:190:19)
    at onErrorNT (internal/child_process.js:372:16)
    at _combinedTickCallback (internal/process/next_tick.js:138:11)
    at process._tickDomainCallback (internal/process/next_tick.js:218:9)
  code: 'EMFILE',
  errno: 'EMFILE',
  syscall: 'spawn /dadi/cdn/node_modules/jpegtran-bin/vendor/jpegtran',
  path: '/dadi/cdn/node_modules/jpegtran-bin/vendor/jpegtran',
  spawnargs: 
   [ '-copy',
     'none',
     '-progressive',
     '-optimize',
     '-outfile',
     '/tmp/d1cbc9a7-1c78-4d36-b4ec-e0d5b070feb6',
     '/tmp/1e53be54-eb5c-4925-a72f-08460aadea68' ],
  killed: false,
  stdout: null,
  stderr: null,
  failed: true,
  signal: null,
  cmd: '/dadi/cdn/node_modules/jpegtran-bin/vendor/jpegtran -copy none -progressive -optimize -outfile /tmp/d1cbc9a7-1c78-4d36-b4ec-e0d5b070feb6 /tmp/1e53be54-eb5c-4925-a72f-08460aadea68',
  timedOut: false } 'Unhandled Rejection at Promise' Promise {
  <rejected> { Error: spawn /dadi/cdn/node_modules/jpegtran-bin/vendor/jpegtran EMFILE
    at _errnoException (util.js:1022:11)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:190:19)
    at onErrorNT (internal/child_process.js:372:16)
    at _combinedTickCallback (internal/process/next_tick.js:138:11)
    at process._tickDomainCallback (internal/process/next_tick.js:218:9)
  code: 'EMFILE',
  errno: 'EMFILE',
  syscall: 'spawn /dadi/cdn/node_modules/jpegtran-bin/vendor/jpegtran',
  path: '/dadi/cdn/node_modules/jpegtran-bin/vendor/jpegtran',
  spawnargs: 
   [ '-copy',
     'none',
     '-progressive',
     '-optimize',
     '-outfile',
     '/tmp/d1cbc9a7-1c78-4d36-b4ec-e0d5b070feb6',
     '/tmp/1e53be54-eb5c-4925-a72f-08460aadea68' ],
  killed: false,
  stdout: null,
  stderr: null,
  failed: true,
  signal: null,
  cmd: '/dadi/cdn/node_modules/jpegtran-bin/vendor/jpegtran -copy none -progressive -optimize -outfile /tmp/d1cbc9a7-1c78-4d36-b4ec-e0d5b070feb6 /tmp/1e53be54-eb5c-4925-a72f-08460aadea68',
  timedOut: false } }

@jimlambie
Copy link
Contributor

@fra967 it seems like the jpegtran dependency isn't properly installed. Can you try the following in the project root?

npm rebuild jpegtran-bin

@fra967
Copy link
Author

fra967 commented Nov 12, 2018

ok thank you @jimlambie , jpegtran is fixed, we are still monitoring the ports usage

@fra967
Copy link
Author

fra967 commented Nov 20, 2018

it seems fixed by adding socket.destroy();
as done here for the node-tunnel package
koichik/node-tunnel@b772cd5#diff-7cb25e6733372586a7d0530feba6f8ae

i.e. should be fixed in node-tunnel 0.0.6 (we are pulling 0.0.2)

See also discussion in request/request#2440

@fra967
Copy link
Author

fra967 commented Nov 21, 2018

the tunnel patch gave good results, but tunnel is not used any more by CDN (it was part of wget-improved)

After removing tunnel, the problem persists with tunnel-agent

The upgrade of jpegtran to 6.0.0 comes with tunnel-agent 0.6.0 instead of 0.4.3.
It seems better, although it still has some sockets leak.
tunnel-agent 0.6.0 does not include the CLOSE_WAIT patch yet

Sockets leak seems to be correlated with 404s errors in the remote, but it is not strictly correlated (i.e. numbers do not match).

@jimlambie
Copy link
Contributor

@fra967 I believe this may solve the issue: #463

@fra967
Copy link
Author

fra967 commented Nov 21, 2018

thank you @jimlambie i can confirm it works very well, not a single socket leaked in 500 requests

@jimlambie
Copy link
Contributor

Closed by release https://github.com/dadi/cdn/releases/tag/v3.4.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants