Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apollo Server socket connection issue when used with Apollo Federation (and make-fetch-happen) #2313

Open
pcoulso1 opened this issue Jan 4, 2023 · 3 comments

Comments

@pcoulso1
Copy link

pcoulso1 commented Jan 4, 2023

Issue Description

I am seeing a Socket Hung up issues occurring between the Gateway and the Microservices, when using the ApolloServer in both Gateway and Microservices components, within Apollo Federation 2.

It appears to be a race condition but is happening frequently enough as to de-stabilize the connection between underlying microservice and the gateway.

The issue is reported in the gateway as;

{"level":50,"time":1671446399521,"pid":16,"hostname":"xxxxx",
"name":"service-gateway@2.0.1",
"message":"request to https://microservice.apps.domain.net/graphql failed, reason: socket hang up",
"extensions":{"code":"INTERNAL_SERVER_ERROR",
"stacktrace":["FetchError: request to https://microservice.apps.domain.net/graphql failed, reason: socket hang up",
"    at ClientRequest.<anonymous> (/app/node_modules/minipass-fetch/lib/index.js:130:14)",
"    at ClientRequest.emit (node:events:513:28)",
"    at ClientRequest.emit (node:domain:489:12)",
"    at TLSSocket.socketOnEnd (node:_http_client:518:9)",
"    at TLSSocket.emit (node:events:525:35)",
"    at TLSSocket.emit (node:domain:489:12)",
"    at endReadableNT (node:internal/streams/readable:1358:12)",
"    at processTicksAndRejections (node:internal/process/task_queues:83:21)"]}}

On investigation the issues appears to be similar to the one reported in https://medium.com/ssense-tech/reduce-networking-errors-in-nodejs-23b4eb9f2d83, where there is a race condition between the client and the server closing the connection.

The issues seems to be related to the "freeSocketTimeout" implemented within the 'agentkeepalive' agent (which is used by 'make-fetch-happen'). In the 'make-fetch-happen', freeSocketTimeout is hard coded to 15000, (see https://github.com/npm/make-fetch-happen/blob/main/lib/agent.js#L80) but by default the server side socket keep alive timeout is set to just 5000 (see https://nodejs.org/docs/latest-v16.x/api/http.html#serverkeepalivetimeout). So there is a window of opportunity for the server to close the socket but the client still consider it as a usable socket.

This results in the “socket hang up” or ECONNRESET errors being reported at about 5000ms of ideal time on the socket.

The version of NPM and NODEJS which I am using is;

  • npm: '8.19.3',
  • node: '16.19.0',

The following is an output of the attached reproducer app, which simply uses ApolloServer as the server and then calls make-fetch-happen to handle the calls into the server. As you can see these fail about 50% of the time (when using a 4999ms delay)

{
  successes: 26,
  failures: {
    'request to https://localhost:4001/graphql failed, reason: read ECONNRESET': 24
  }
}

I’ve also included two possible workarounds;

  • Option 1 : Override the agent and reduce the freeSocketTimeout on the client side
  • Option 2 : Increase the keepAliveTimeout to over 15000 on the server side

I suspect the real fix should be to get 'make-fetch-happen' to make freeSocketTimeout configurable and then set it to an appropriate value

Please advise what the long term fix should be for this issue.

Link to Reproduction

https://github.com/pcoulso1/connection-test

Reproduction Steps

Clone the repo and run the following command lines

npm install
npm run start:dev
@glasser glasser transferred this issue from apollographql/apollo-server Jan 4, 2023
@glasser
Copy link
Member

glasser commented Jan 4, 2023

Migrating this to federation as this is primarily about Gateway, though it is possible that one response might include improving documentation to encourage Apollo Server subgraph developers to increase their keepAliveTimeout. More thoughts to come later once I've checked with some folks who might have good ideas. Really great reproduction — thanks! One note is that IIUC make-fetch-happen's retry feature (which we do disable) does retry ECONNRESET by default... but not for POSTs. I suspect that we should just retry ECONNRESET for non-mutations in Gateway. It's also quite possible that Apollo Router handles this better than Node/make-fetch-happen does!

@pcoulso1
Copy link
Author

pcoulso1 commented Jan 6, 2023

I think the root of this issue is with the make-fetch-happen library which has hardcoded freeSocketTimeout, therefore I've raised the following bug report at npm/make-fetch-happen#199

@glasser
Copy link
Member

glasser commented Jan 6, 2023

Yeah, though I could understand if their response was "the proper way to configure HttpAgent parameters is to override the agent, not to copy all of HttpAgent's options into make-fetch-happen". Though... the mechanisms they give to override the agent only let you set one fixed agent rather than use their per-host agent logic, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants