Apollo Server socket connection issue when used with Apollo Federation (and make-fetch-happen) #2313

pcoulso1 · 2023-01-04T12:30:03Z

Issue Description

I am seeing a Socket Hung up issues occurring between the Gateway and the Microservices, when using the ApolloServer in both Gateway and Microservices components, within Apollo Federation 2.

It appears to be a race condition but is happening frequently enough as to de-stabilize the connection between underlying microservice and the gateway.

The issue is reported in the gateway as;

{"level":50,"time":1671446399521,"pid":16,"hostname":"xxxxx",
"name":"service-gateway@2.0.1",
"message":"request to https://microservice.apps.domain.net/graphql failed, reason: socket hang up",
"extensions":{"code":"INTERNAL_SERVER_ERROR",
"stacktrace":["FetchError: request to https://microservice.apps.domain.net/graphql failed, reason: socket hang up",
"    at ClientRequest.<anonymous> (/app/node_modules/minipass-fetch/lib/index.js:130:14)",
"    at ClientRequest.emit (node:events:513:28)",
"    at ClientRequest.emit (node:domain:489:12)",
"    at TLSSocket.socketOnEnd (node:_http_client:518:9)",
"    at TLSSocket.emit (node:events:525:35)",
"    at TLSSocket.emit (node:domain:489:12)",
"    at endReadableNT (node:internal/streams/readable:1358:12)",
"    at processTicksAndRejections (node:internal/process/task_queues:83:21)"]}}

On investigation the issues appears to be similar to the one reported in https://medium.com/ssense-tech/reduce-networking-errors-in-nodejs-23b4eb9f2d83, where there is a race condition between the client and the server closing the connection.

The issues seems to be related to the "freeSocketTimeout" implemented within the 'agentkeepalive' agent (which is used by 'make-fetch-happen'). In the 'make-fetch-happen', freeSocketTimeout is hard coded to 15000, (see https://github.com/npm/make-fetch-happen/blob/main/lib/agent.js#L80) but by default the server side socket keep alive timeout is set to just 5000 (see https://nodejs.org/docs/latest-v16.x/api/http.html#serverkeepalivetimeout). So there is a window of opportunity for the server to close the socket but the client still consider it as a usable socket.

This results in the “socket hang up” or ECONNRESET errors being reported at about 5000ms of ideal time on the socket.

The version of NPM and NODEJS which I am using is;

npm: '8.19.3',
node: '16.19.0',

The following is an output of the attached reproducer app, which simply uses ApolloServer as the server and then calls make-fetch-happen to handle the calls into the server. As you can see these fail about 50% of the time (when using a 4999ms delay)

{
  successes: 26,
  failures: {
    'request to https://localhost:4001/graphql failed, reason: read ECONNRESET': 24
  }
}

I’ve also included two possible workarounds;

Option 1 : Override the agent and reduce the freeSocketTimeout on the client side
Option 2 : Increase the keepAliveTimeout to over 15000 on the server side

I suspect the real fix should be to get 'make-fetch-happen' to make freeSocketTimeout configurable and then set it to an appropriate value

Please advise what the long term fix should be for this issue.

Link to Reproduction

https://github.com/pcoulso1/connection-test

Reproduction Steps

Clone the repo and run the following command lines

npm install
npm run start:dev

The text was updated successfully, but these errors were encountered:

glasser · 2023-01-04T20:22:02Z

Migrating this to federation as this is primarily about Gateway, though it is possible that one response might include improving documentation to encourage Apollo Server subgraph developers to increase their keepAliveTimeout. More thoughts to come later once I've checked with some folks who might have good ideas. Really great reproduction — thanks! One note is that IIUC make-fetch-happen's retry feature (which we do disable) does retry ECONNRESET by default... but not for POSTs. I suspect that we should just retry ECONNRESET for non-mutations in Gateway. It's also quite possible that Apollo Router handles this better than Node/make-fetch-happen does!

pcoulso1 · 2023-01-06T10:40:46Z

I think the root of this issue is with the make-fetch-happen library which has hardcoded freeSocketTimeout, therefore I've raised the following bug report at npm/make-fetch-happen#199

glasser · 2023-01-06T15:49:53Z

Yeah, though I could understand if their response was "the proper way to configure HttpAgent parameters is to override the agent, not to copy all of HttpAgent's options into make-fetch-happen". Though... the mechanisms they give to override the agent only let you set one fixed agent rather than use their per-host agent logic, right?

glasser transferred this issue from apollographql/apollo-server Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apollo Server socket connection issue when used with Apollo Federation (and make-fetch-happen) #2313

Apollo Server socket connection issue when used with Apollo Federation (and make-fetch-happen) #2313

pcoulso1 commented Jan 4, 2023

glasser commented Jan 4, 2023

pcoulso1 commented Jan 6, 2023

glasser commented Jan 6, 2023

Apollo Server socket connection issue when used with Apollo Federation (and make-fetch-happen) #2313

Apollo Server socket connection issue when used with Apollo Federation (and make-fetch-happen) #2313

Comments

pcoulso1 commented Jan 4, 2023

Issue Description

Link to Reproduction

Reproduction Steps

glasser commented Jan 4, 2023

pcoulso1 commented Jan 6, 2023

glasser commented Jan 6, 2023