Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All firestore queries hang for almost exactly 1 hour #2495

Closed
beckjake opened this issue Mar 14, 2024 · 25 comments
Closed

All firestore queries hang for almost exactly 1 hour #2495

beckjake opened this issue Mar 14, 2024 · 25 comments
Assignees

Comments

@beckjake
Copy link

beckjake commented Mar 14, 2024

[REQUIRED] Step 2: Describe your environment

  • Operating System version: Alpine 3.16
  • Firebase SDK version: 12.0.0
  • Firebase Product: database, we also use auth
  • Node.js version: v18.11.0, also observed on v20.11.1
  • NPM version: 8.19.2

[REQUIRED] Step 3: Describe the problem

Periodically, reads from firebase using the admin SDK will hang for an extended period. We've isolated the hang to reading a single pretty small document out of a particular collection, which happens on every request. The document simply doesn't return for an extended period. Once a process gets in this state, it remains "stuck" for either 55 minutes or an hour, almost exactly. Firestore auth succeeds, but subsequent reads from firebase hang. Then, with no user intervention required, it just starts working properly again.

Steps to reproduce:

We've been unable to identify consistent reproduce steps, but we believe the steps to reproduce are simply doing concurrent auths, database reads, and potentially a small number of database writes (to different docs) for multiple users at the same time.

We were able to reproduce this with the following environment variables set (though we still don't really know what exactly reproduced it):

GRPC_VERBOSITY=debug
GRPC_TRACE=all

I notice that in the bad state, all queries end up logging load_balancing_call | [8682] Pick result: QUEUE subchannel: null status: undefined undefined

whereas in the happy state, we see Pick result: COMPLETE, a subchannel ID, and an IP address.

I've uploaded two log files: One is from shortly before the bad times started to shortly after it started, and the second one is from shortly before things started to correct to shortly after. I found the timings very suspicious, exactly 1 hour seems too good to be true. There's a lot of logging going on, and pulling it all from GCP + removing our app logs isn't a great experience so I didn't upload all our output for the hour, but these log files should be fairly representative.

Last successful read : 2024-03-13T20:42:24.614Z (first file)
First hanging read : 2024-03-13T20:42:25.901Z (first file)
Change in behavior : 2024-03-13T21:37:12.490Z (second file) - a bunch of what looks like connection cleanup happens here
Things fully resolve :~2024-03-13T21:42:???? (observed application acting healthy by this point, but it definitely could have been 5 minutes earlier as we'd stopped paying careful attention at this point)
bad-times-begin.txt
bad-times-end.txt

Another thing we've noticed, and which I have a hard time believing is more than a coincidence, is that this generally happens at roughly 45 minutes past the hour.

When this occurs, we see a DEADLINE_EXCEEDED from any writes we're doing concurrently (creating a doc in a collection).

Relevant Code:

This is with some abstractions removed, but the flow of it is:

// once
const admin = require("firebase-admin");
admin.initializeApp({ credential: admin.credential.applicationDefault() });

// every request
const idToken = await Post(`https://securetoken.googleapis.com/v1/token?key=${apiKey}`, { refresh_token, grant_type: "refresh_token" }).then(response => response.data.id_token);
const decodedToken = await admin.auth().verifyIdToken(idToken);
// hangs here
const userDoc = await admin.firestore().collection("users").doc(decodedToken.uid).get().then(doc => doc.data());

When it's in the hanging state, this will trigger a DEADLINE_EXCEEDED

const doc = admin.firestore().collection("orgs").doc()
// triggers an error with {"code":4,"details":"Deadline exceeded","metadata":{},"note":"Exception occurred in retry method that was not classified as transient"}
const result = await doc.set({/* some data */}).then(() => doc.id);
@ehsannas
Copy link

ehsannas commented Mar 14, 2024

Hey @beckjake,

(1) I presume your code snippets meant to use admin.firestore().collection... rather than admin.collection... ?

(2) What kind of environment are you running in? as far as I remember, the application default credentials are valid for 1 hour. So I think something might be wrong with the token? Have you tried following the instructions here: https://firebase.google.com/docs/admin/setup#initialize-sdk ?
depending on your environment, you may just need to call initializeApp().

@ehsannas ehsannas self-assigned this Mar 14, 2024
@beckjake
Copy link
Author

Hey @beckjake,

(1) I presume your code snippets meant to use admin.firestore().collection... rather than admin.collection... ?

Sorry, yes, that got lost in my translation as the result of admin.firestore() is a thing we pass between a couple functions. I'll update the original post.

(2) What kind of environment are you running in? as far as I remember, the application default credentials are valid for 1 hour. So I think something might be wrong with the token? Have you tried following the instructions here: https://firebase.google.com/docs/admin/setup#initialize-sdk ? depending on your environment, you may just need to call initializeApp().

We're running in a pod within Google Kubernetes Engine. We call admin.initializeApp({ credential: admin.credential.applicationDefault() }); when our application starts up. We set the GOOGLE_APPLICATION_CREDENTIALS environment variable, which points at a file on disk containing a service account JSON file.

The application works for long periods of time before periodically one pod (or multiple pods) in our cluster gets stuck in this state. Today we've been trying to reproduce the issue and have only been able to reproduce ~1m long drop-outs that have a lot of messages like this:

D 2024-03-14T21:02:40.816Z | transport | [object Object] connection failed with error connect EADDRNOTAVAIL 2607:f8b0:4001:c2b::5f:443 - Local (:::0)
D 2024-03-14T21:02:40.816Z | subchannel | (216) 2607:f8b0:4001:c2b::5f:443 CONNECTING -> TRANSIENT_FAILURE
D 2024-03-14T21:02:40.816Z | subchannel | (216) 2607:f8b0:4001:c2b::5f:443 TRANSIENT_FAILURE -> IDLE
D 2024-03-14T21:02:40.816Z | subchannel | (218) 2607:f8b0:4001:c18::5f:443 TRANSIENT_FAILURE -> CONNECTING

but we also have times where we see those messages and don't experience any service interruptions, so I'm not sure how much to read into TRANSIENT_FAILUREs

@beckjake
Copy link
Author

beckjake commented Mar 14, 2024

We're experiencing a real hang right this moment (Pick result: QUEUE) on one of our pods. One thing I find in the stuck pod's logs but not the others is the state=TRANSPARENT_ONLY handling status with progress PROCESSED from child:

D 2024-03-14T23:08:08.381Z | v1.10.2 1 | resolving_call | [180] cancelWithStatus code: 4 details: "Deadline exceeded"
D 2024-03-14T23:08:08.381Z | v1.10.2 1 | retrying_call | [181] cancelWithStatus code: 4 details: "Deadline exceeded"
D 2024-03-14T23:08:08.381Z | v1.10.2 1 | retrying_call | [181] ended with status: code=4 details="Deadline exceeded"
D 2024-03-14T23:08:08.381Z | v1.10.2 1 | load_balancing_call | [182] cancelWithStatus code: 4 details: "Deadline exceeded"
D 2024-03-14T23:08:08.381Z | v1.10.2 1 | load_balancing_call | [182] ended with status: code=4 details="Deadline exceeded"
D 2024-03-14T23:08:08.381Z | v1.10.2 1 | retrying_call | [181] Received status from child [182]
D 2024-03-14T23:08:08.381Z | v1.10.2 1 | retrying_call | [181] state=TRANSPARENT_ONLY handling status with progress PROCESSED from child [182] in state ACTIVE
D 2024-03-14T23:08:08.381Z | v1.10.2 1 | retrying_call | [181] ended with status: code=4 details="Deadline exceeded"
D 2024-03-14T23:08:08.381Z | v1.10.2 1 | resolving_call | [180] ended with status: code=4 details="Deadline exceeded"
D 2024-03-14T23:08:08.381Z | v1.10.2 1 | resolving_call | [180] Received status
D 2024-03-14T23:08:08.381Z | v1.10.2 1 | resolving_call | [180] Received status
D 2024-03-14T23:08:08.382Z | v1.10.2 1 | resolving_call | [180] cancelWithStatus code: 1 details: "Cancelled on client"
D 2024-03-14T23:08:08.382Z | v1.10.2 1 | retrying_call | [181] cancelWithStatus code: 1 details: "Cancelled on client"
D 2024-03-14T23:08:08.384Z | v1.10.2 1 | retrying_call | [181] ended with status: code=1 details="Cancelled on client"
D 2024-03-14T23:08:08.384Z | v1.10.2 1 | load_balancing_call | [182] cancelWithStatus code: 1 details: "Cancelled on client"
D 2024-03-14T23:08:08.384Z | v1.10.2 1 | resolving_call | [180] Received status

The log about D 2024-03-14T23:08:08.381Z | v1.10.2 1 | retrying_call | [181] state=TRANSPARENT_ONLY handling status with progress PROCESSED from child [182] in state ACTIVE repeats with different numbers in brackets and begins exactly 5 minutes (30,000 ms which I see is a deadline that gets logged a lot) after we first detected the outage.

This hang took >1hr. We attached a debugger and the process died, so unfortunately we don't know if it would have eventually self-corrected.

@ehsannas
Copy link

ehsannas commented Mar 14, 2024

hmmm... I see a lot of "deadline exceeded" errors. What is your usage pattern like? Are you reaching the Maximum Writes per second?

Also please check: this and this.

@beckjake
Copy link
Author

We're definitely not making API requests >10mb in the cases that are hanging. I suppose it is possible, albeit unlikely, that something else in our infrastructure is making such requests at the same time - would that trigger this behavior on all subsequent reads?.

We shouldn't be making transactions >270s, but we can check on that. If we are, they wouldn't be on the machine in question that's hitting the deadlines.

@beckjake
Copy link
Author

One thing that I think is different about our situation vs the linked one is that we don't see DEADLINE_EXCEEDED on reads, it just hangs forever. is that expected? Is there some way we can at least detect this from the read side instead of having our application just hang?

@michaelAtCoalesce
Copy link

michaelAtCoalesce commented Mar 15, 2024

Just adding to Jacob's comments above, is there a better way to pinpoint what may be hitting a platform limit? Some telemetry of some kind or log?

We are able to recreate this every few hours on only one of our 6 kubernetes replicas....

Also - there are some reasons I don't think it's related to the other issues you mentioned Ehsan

  1. we should get at least some response instead of a hang of an hour+, i don't think anyone has reported an hour+ hang on reads like we are seeing in any of the threads you linked Ehsan
  2. we would likely see it across all of our replicas as opposed to just one if it were load or limits related.
  3. it isn't reliably recreatable.. if it were a load issue one would expect that we would see it recreate more consistently than just once every couple hours...

any further ideas...?

@michaelAtCoalesce
Copy link

michaelAtCoalesce commented Mar 15, 2024

i'd like to bump the priority of this.. this is blocking for our team. can we get some help on this?

@michaelAtCoalesce
Copy link

closing as it was related to the linked above ticket.

@lox
Copy link

lox commented Mar 27, 2024

We are seeing the same issue, how did you resolve @michaelAtCoalesce?

@lox
Copy link

lox commented Mar 27, 2024

@ehsannas could you re-open this please? I believe it's an active issue.

@michaelAtCoalesce
Copy link

the grpc library 1.10.2 is broken, make sure that your dependencies are using any version other than that

@lox
Copy link

lox commented Mar 27, 2024

Strange, my Firestore was exhibiting the same problems with 1.10.1:

@google-cloud/firestore 7.3.0
└─┬ google-gax 4.0.5
  └── @grpc/grpc-js 1.9.13
@google-cloud/kms 4.1.0
└─┬ google-gax 4.3.1
  └── @grpc/grpc-js 1.10.1
@google-cloud/pubsub 4.3.2
└─┬ google-gax 4.3.1
  └── @grpc/grpc-js 1.10.1
@google-cloud/secret-manager 5.1.0
└─┬ google-gax 4.3.1
  └── @grpc/grpc-js 1.10.1
firebase-admin 12.0.0
└─┬ @google-cloud/firestore 7.3.0
  └─┬ google-gax 4.0.5
    └── @grpc/grpc-js 1.9.13
firebase-functions 4.7.0
└─┬ firebase-admin 12.0.0 peer
  └─┬ @google-cloud/firestore 7.3.0
    └─┬ google-gax 4.0.5
      └── @grpc/grpc-js 1.9.13
langchain 0.1.17
└─┬ @langchain/community 0.0.27
  └─┬ firebase-admin 12.0.0 peer
    └─┬ @google-cloud/firestore 7.3.0
      └─┬ google-gax 4.0.5
        └── @grpc/grpc-js 1.9.13

@michaelAtCoalesce
Copy link

i dont know when the issue was introduced, but 1.10.3 worked for us

@ehsannas
Copy link

googleapis/gax-nodejs#1576 is getting updated. Once that's released we'll update the nodejs-firestore and firebase-admin-node

@lox
Copy link

lox commented Mar 27, 2024

Then shouldn't this issue be open until resolved?

@ehsannas
Copy link

ehsannas commented Mar 27, 2024

Yes! (also internally tracked b/331664537)

@ehsannas
Copy link

ehsannas commented May 3, 2024

The gax-nodejs release (googleapis/gax-nodejs#1581) is in progress.

@lox
Copy link

lox commented May 3, 2024

The gax-nodejs release (googleapis/gax-nodejs#1581) is in progress.

Is there reason to believe that release will address the issue?

@ehsannas
Copy link

ehsannas commented May 8, 2024

Hi folks, please update your npm dependencies and make sure you're getting @google-cloud/firestore version 7.7.0, and google-gax version 4.3.3 or above. Please let me know if you still experience any issues.

@ehsannas
Copy link

ehsannas commented May 8, 2024

Is there reason to believe that release will address the issue?

Yes, we depend on the grpc library, which indicated issues with the old versions (such as grpc/grpc-node#2690). Firestore SDK does attempt to retry requests where possible, so, that might have manifested as slow request/response.

@lox
Copy link

lox commented May 8, 2024

@ehsannas is there a way we could verify that with debugging / logging that shows the retries?

@ehsannas
Copy link

ehsannas commented May 9, 2024

Hey @lox , I believe setting the following environment variables will provide additional grpc debug logs:

export GRPC_VERBOSITY=debug
export GRPC_TRACE=all

@ehsannas ehsannas removed their assignment May 17, 2024
@ehsannas
Copy link

the dependencies have been updated in the latest release. Thanks @lahirumaramba .

@ehsannas
Copy link

@lox @beckjake please let us know if you have any issues after upgrading to the latest release. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants