You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am seeing a Socket Hung up issues occurring between the Gateway and the Microservices, when using the ApolloServer in both Gateway and Microservices components, within Apollo Federation 2.
It appears to be a race condition but is happening frequently enough as to de-stabilize the connection between underlying microservice and the gateway.
The issue is reported in the gateway as;
{"level":50,"time":1671446399521,"pid":16,"hostname":"xxxxx",
"name":"service-gateway@2.0.1",
"message":"request to https://microservice.apps.domain.net/graphql failed, reason: socket hang up",
"extensions":{"code":"INTERNAL_SERVER_ERROR",
"stacktrace":["FetchError: request to https://microservice.apps.domain.net/graphql failed, reason: socket hang up",
" at ClientRequest.<anonymous> (/app/node_modules/minipass-fetch/lib/index.js:130:14)",
" at ClientRequest.emit (node:events:513:28)",
" at ClientRequest.emit (node:domain:489:12)",
" at TLSSocket.socketOnEnd (node:_http_client:518:9)",
" at TLSSocket.emit (node:events:525:35)",
" at TLSSocket.emit (node:domain:489:12)",
" at endReadableNT (node:internal/streams/readable:1358:12)",
" at processTicksAndRejections (node:internal/process/task_queues:83:21)"]}}
The issues seems to be related to the "freeSocketTimeout" implemented within the 'agentkeepalive' agent (which is used by 'make-fetch-happen'). In the 'make-fetch-happen', freeSocketTimeout is hard coded to 15000, (see https://github.com/npm/make-fetch-happen/blob/main/lib/agent.js#L80) but by default the server side socket keep alive timeout is set to just 5000 (see https://nodejs.org/docs/latest-v16.x/api/http.html#serverkeepalivetimeout). So there is a window of opportunity for the server to close the socket but the client still consider it as a usable socket.
This results in the “socket hang up” or ECONNRESET errors being reported at about 5000ms of ideal time on the socket.
The version of NPM and NODEJS which I am using is;
npm: '8.19.3',
node: '16.19.0',
The following is an output of the attached reproducer app, which simply uses ApolloServer as the server and then calls make-fetch-happen to handle the calls into the server. As you can see these fail about 50% of the time (when using a 4999ms delay)
Migrating this to federation as this is primarily about Gateway, though it is possible that one response might include improving documentation to encourage Apollo Server subgraph developers to increase their keepAliveTimeout. More thoughts to come later once I've checked with some folks who might have good ideas. Really great reproduction — thanks! One note is that IIUC make-fetch-happen's retry feature (which we do disable) does retry ECONNRESET by default... but not for POSTs. I suspect that we should just retry ECONNRESET for non-mutations in Gateway. It's also quite possible that Apollo Router handles this better than Node/make-fetch-happen does!
I think the root of this issue is with the make-fetch-happen library which has hardcoded freeSocketTimeout, therefore I've raised the following bug report at npm/make-fetch-happen#199
Yeah, though I could understand if their response was "the proper way to configure HttpAgent parameters is to override the agent, not to copy all of HttpAgent's options into make-fetch-happen". Though... the mechanisms they give to override the agent only let you set one fixed agent rather than use their per-host agent logic, right?
Issue Description
I am seeing a Socket Hung up issues occurring between the Gateway and the Microservices, when using the ApolloServer in both Gateway and Microservices components, within Apollo Federation 2.
It appears to be a race condition but is happening frequently enough as to de-stabilize the connection between underlying microservice and the gateway.
The issue is reported in the gateway as;
On investigation the issues appears to be similar to the one reported in https://medium.com/ssense-tech/reduce-networking-errors-in-nodejs-23b4eb9f2d83, where there is a race condition between the client and the server closing the connection.
The issues seems to be related to the "freeSocketTimeout" implemented within the 'agentkeepalive' agent (which is used by 'make-fetch-happen'). In the 'make-fetch-happen', freeSocketTimeout is hard coded to 15000, (see https://github.com/npm/make-fetch-happen/blob/main/lib/agent.js#L80) but by default the server side socket keep alive timeout is set to just 5000 (see https://nodejs.org/docs/latest-v16.x/api/http.html#serverkeepalivetimeout). So there is a window of opportunity for the server to close the socket but the client still consider it as a usable socket.
This results in the “socket hang up” or ECONNRESET errors being reported at about 5000ms of ideal time on the socket.
The version of NPM and NODEJS which I am using is;
The following is an output of the attached reproducer app, which simply uses ApolloServer as the server and then calls make-fetch-happen to handle the calls into the server. As you can see these fail about 50% of the time (when using a 4999ms delay)
I’ve also included two possible workarounds;
I suspect the real fix should be to get 'make-fetch-happen' to make freeSocketTimeout configurable and then set it to an appropriate value
Please advise what the long term fix should be for this issue.
Link to Reproduction
https://github.com/pcoulso1/connection-test
Reproduction Steps
Clone the repo and run the following command lines
The text was updated successfully, but these errors were encountered: