Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception occurred in retry method that was not classified as transient #536

Closed
lamstutz opened this issue Jul 23, 2019 · 167 comments
Closed

Comments

@lamstutz
Copy link

Related issues

#522

[REQUIRED] Version info

  "dependencies": {
    "@google-cloud/firestore": "^2.2.1",
    "firebase-admin": "^8.2.0",
    "firebase-functions": "^3.0.1",
  },
  "engines": {
    "node": "8"
  }

node: 8

firebase-functions: 3.0.1

firebase-tools: 7.0.1

firebase-admin: 8.2.0

Steps to reproduce

import * as admin from 'firebase-admin';

admin.initializeApp();
const db = admin.firestore();
db.settings({ timestampsInSnapshots: true });


const users = db.collection('users');

users.doc('myUserId').update({ fieldToUpdate: 'newValue' })

Update method throw this error :

{ Error
at Http2CallStream.call.on (/srv/node_modules/@grpc/grpc-js/build/src/client.js:101:45)
at emitOne (events.js:121:20)
at Http2CallStream.emit (events.js:211:7)
at process.nextTick (/srv/node_modules/@grpc/grpc-js/build/src/call-stream.js:71:22)
at _combinedTickCallback (internal/process/next_tick.js:132:7)
at process._tickDomainCallback (internal/process/next_tick.js:219:9)
code: 13,
details: '',
metadata: Metadata { options: undefined, internalRepr: Map {} },
note: 'Exception occurred in retry method that was not classified as transient' }

Were you able to successfully deploy your functions?

the deployment displays no errors

@google-oss-bot
Copy link
Collaborator

I found a few problems with this issue:

  • I couldn't figure out how to label this issue, so I've labeled it for a human to triage. Hang tight.
  • This issue does not seem to follow the issue template. Make sure you provide all the required information.

@thechenky
Copy link
Contributor

Hi @lamstutz does this occur on every function invocation? Would you mind pasting more logs? Also, does this happen when updating other collections?

@thechenky thechenky added api: firestore Needs: Author Feedback Issues awaiting author feedback and removed needs-triage labels Jul 24, 2019
@thechenky thechenky self-assigned this Jul 24, 2019
@audkar
Copy link

audkar commented Jul 25, 2019

Having same error

does this occur on every function invocation?

No. [following statement is might be wrong] this error happens only after idle function is invoked after longer period of time.

@thechenky
Copy link
Contributor

From the stack trace and the fact that it happens on idle function it looks like a grpc related error that is happening on a cold start. Does this happen on other collection updates? Are you properly handling promises in this function? Because the stack trace mentions retries, it may be happening because the value you're trying to write into Firestore is somehow problematic - are you able to replicate this with another collection with a very simple interface - something like updating one field that is string?

@lamstutz
Copy link
Author

The error occurs regularly but not systematically. Only with "update" function (whether with a simple object or an increment). Only with "triggers", not with a http function. Promises are well used and catched.

ex:

const categoryIds = ['a','b'];
  return users.doc(userId).update({ categoryIds });

Or

return statistics.doc('users').update({ users: gcFieldValue.increment(-1) });

@google-oss-bot google-oss-bot added Needs: Attention and removed Needs: Author Feedback Issues awaiting author feedback labels Jul 26, 2019
@thechenky
Copy link
Contributor

Hmm, it seems that something is going wrong with that particular Firestore update call - @schmidt-sebastian @hiranya911 I wonder if you have ideas on what could be going wrong here?

@McGroover-Bottleneck
Copy link

I am having the exact same problem

@spoxies
Copy link

spoxies commented Jul 29, 2019

I have to add that in my experience the problem is also present with 'http' calls and not only with 'triggers'. It seems it is (extra) present when an instance/function went to sleep and has indeed a 'cold start'.

@stshelton
Copy link

I'm having the same problem. I'm receiving this error.

Error
at Http2CallStream.call.on (/srv/node_modules/@grpc/grpc-js/build/src/client.js:96:45)
at emitOne (events.js:121:20)
at Http2CallStream.emit (events.js:211:7)
at process.nextTick (/srv/node_modules/@grpc/grpc-js/build/src/call-stream.js:71:22)
at _combinedTickCallback (internal/process/next_tick.js:132:7)
at process._tickDomainCallback (internal/process/next_tick.js:219:9

Mine seems to happen randomly in onCreate and onWrite functions. These functions with this error are triggered daily and the error has only occurred once, I've had them fire multiple times after error occurred and error has yet to return. These errors start appearing once I updated firebase functions from version 2.3.0 to 3.0.1 and firebase admin from 7.0.0 to 8.0.0

@jaycosaur
Copy link

jaycosaur commented Jul 30, 2019

Just thought I'd give some input here. My first instance of this error was on the 15th of July and I now get it regularly (but not consistently) across all our functions.

We have a logging system implemented on our functions that essentially tells us when a function is cold started or not, (we ping them to keep warm every minute). Prior to the 15th of July (from 2017 to now! so I have a lot of logs on this) (ie when these errors started happening to me) cloud functions would delete themselves at approx. 3-5 minute intervals from first creation, making the next invocation a cold start. Since the 15th of July this has increased substantially to greater than 5 hours(!!!) and we have seen today a function stay warm for 28 hours (causing a lot of issues to our caching). My guess would be that a previously short running connection is now having to cope with these much much much longer alive periods.

Now unfortunately we do not ping over the weekends, for cost reduction reasons, but on the 12th (and for the last 12+ months) it was cold starting every 3-5 minutes, and on the 15th it now doesn't cold start for 5+ hours. If this is a new 'feature' of cloud functions it is amazing btw! Almost makes them never have to hit cold starts if the keep warm invocations are done right.

@McGroover-Bottleneck
Copy link

I am getting it with Pub/Sub functions. I also think it is related to cold starts

@damienix
Copy link

Also started getting these recently :(

@schmidt-sebastian
Copy link

Sorry for all the trouble this is causing! While we are currently looking into this, we don't have a strong lead as to what is going on. Please do bear with us.

@schmidt-sebastian
Copy link

@damienix, @bottleneck-admin, @jaycosaur, @spoxies, @lamstutz:

Would you mind sending your project IDs and an approximate time window for these errors (including your timezone) to samstern@google.com? Thanks

@steren
Copy link

steren commented Jul 31, 2019

This also affects Cloud Run.

For googlers on this thread, 138705198 is the internal issue

@McGroover-Bottleneck
Copy link

How do I private message you @schmidt-sebastian

@schmidt-sebastian
Copy link

Thanks for sending us your project info. Our backend team will look into the errors. While they do,
can you quickly confirm where your GCF instances are all running in Europe and where your Firestore project is located.

Thanks!

@bottleneck-admin If you need to send project-specific or confidential data to us for issue triage, the recommended way is to open a Support ticket via https://support.google.com/

@McGroover-Bottleneck
Copy link

Mine are in europe-west2 and project is -> Google Cloud Platform (GCP) resource location
europe-west2

@ltomes
Copy link

ltomes commented Jul 31, 2019

@steren @schmidt-sebastian

I came across this issue doing a google search.

I am seeing the same behavior surface from GRPC calls made by @grpc/grpc-js which is a dependency of @google-cloud/storage in my case.

We see intermittent failures in k8s pods when a large number of files are being streamed to
storage objects via @google-cloud/storage.

If those logs might be useful let me know and I can provide a project id.

{ Error: The caller does not have permission at Http2CallStream.call.on (/home/app/node_modules/@grpc/grpc-js/build/src/client.js:101:45) at Http2CallStream.emit (events.js:194:15) at process.nextTick (/home/app/node_modules/@grpc/grpc-js/build/src/call-stream.js:71:22) at process._tickCallback (internal/process/next_tick.js:61:11)
code 7
details: 'The caller does not have permission',
metadata: Metadata { options: undefined, internalRepr: Map {} },
note:
'Exception occurred in retry method that was not classified as transient' }

Note we receive a code 7 instead of 13 like @lamstutz

@jaycosaur
Copy link

Our project is us-central1 for functions and project.

@schmidt-sebastian
Copy link

Our backend team believes that they know what the root cause is, but it might take quite a while for the issue to be fixed in all production environments.

@damienix
Copy link

damienix commented Aug 7, 2019

Any progress on that? This is a really severe issue ;/

As per docs https://firebase.google.com/docs/functions/retries

Cloud Functions guarantees at-least-once execution of a background function for each event emitted by an event source.

Which is no longer true. On my backend, this leads to more and more inconsistent data, as I'm getting random errors from triggered functions that would normally run without any problems :(

Has anyone tried to enable retries of a function to defend from this error, will it work for system-level errors?

@schmidt-sebastian
Copy link

With errors like these, your request will likely succeed if you retry. Our client only retries in a couple of cases where we know it is safe (we can always retry a get() request, but we cannot retry writes as we don't know whether there are any side effects). If you know that you can always retry (based on your data model), then I would recommend that as a solution.

You could also wrap your writes in a transaction, which the client retries.

@schmidt-sebastian
Copy link

schmidt-sebastian commented Jun 23, 2020

@bcoe for visibility.

@dzoba
Copy link

dzoba commented Jul 1, 2020

I started hitting this error today. Not even on a function, just on a nodejs script with firebase-admin I was running against firestore to do some DB updates.

@samstr
Copy link

samstr commented Jul 1, 2020

For me I was seeing this error as well as "No connection" and "Deadline exceeded" when performing large numbers of writes individually. Refactoring to use batched writes solved it for me.

@manwithsteelnerves
Copy link

manwithsteelnerves commented Jul 13, 2020

We got this error today
"Exception occurred in retry method that was not classified as transient"

Update :
This error seems to happen with cloud pubsub and @grpc/grpc-js libraries.
Fix : Unfortunately, I need to add Pubsub Editor role to my firebase admin sdk and then it started working again. I'm not sure why the role is suddenly required as it was not the case earlier. Is it like they fixed the earlier issue leading to this strict check or this being a new issue? @thechenky @mdietz94

@damienromito
Copy link

damienromito commented Jul 27, 2020

I have the same probleme on differents endpoints.
Here are the logs if it helps :
Screen Shot 2020-07-27 at 16 26 38

Screen Shot 2020-07-27 at 16 30 31

@UnJavaScripter
Copy link

UnJavaScripter commented Aug 3, 2020

Everything was woking fine with my functions with the emulator for several hours and suddenly I got that same error:

{
  code: 2,
  details: '',
  metadata: Metadata {
    internalRepr: Map(1) { 'content-type' => [Array] },
    options: {}
  },
  note: 'Exception occurred in retry method that was not classified as transient'
}

Using node 14 without any problem. Tried using 10 and nothing changed.

I've been using the Functions emulator with Firestore triggers. The data will be written regardless, but no logs are shown (and also the error we're all having is thrown).

EDIT:

I found the origin of the problem for my particular case: a trailing / :

functions.firestore.document('users/{userId}/')

I figured it out while trying to upload the function when I gave up and was going to test it live. I ended up getting to this thread that led me to the solution: https://stackoverflow.com/questions/46818082/error-http-error-400-the-request-has-errors-firebase-firestore-cloud-function

@kythin
Copy link

kythin commented Aug 9, 2020

This error was driving me nuts, but it turned out to be some form of permission thing for me. After updating the service account to just have 'Project -> Editor' access the firestore writes started working again.

Obviously not ideal to give the service account such wide access, but it's a start!

@jonrandahl
Copy link

jonrandahl commented Aug 14, 2020

Just adding my own two pence as I've landed on this thread too many times now not to ...

I can concur with @manwithsteelnerves in that adding the Pub/Sub Editor role to the service account has somehow stopped the DEADLINE EXCEEDED errors occurring in one of our cloud functions on one of our instances.

However, on another of our instances where the Service account does not have the Pub/Sub Editor role, the same cloud function is working fine via both the local emulator and deployed to that same non-local instance which the emulator was connected to, and the cloud function does not error on that instance.

Thanks to everyone nonetheless for all your hard work to resolve this issue, and for all the comments that have helped others including myself previously!

Please let me know if you would like further information. 🙏

@fabiank0
Copy link

fabiank0 commented Nov 6, 2020

Got it today while handling files that were in total over 3MB. No Triggers are used.

Error: 14 UNAVAILABLE: No connection established at Object.callErrorFromStatus (/workspace/node_modules/@grpc/grpc-js/build/src/call.js:30:26) at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client.js:175:52) at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:341:141) at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:304:181) at Http2CallStream.outputStatus (/workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:116:74) at Http2CallStream.maybeOutputStatus (/workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:155:22) at Http2CallStream.endCall (/workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:141:18) at Http2CallStream.cancelWithStatus (/workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:457:14) at ChannelImplementation.tryPick (/workspace/node_modules/@grpc/grpc-js/build/src/channel.js:237:32) at Object.updateState (/workspace/node_modules/@grpc/grpc-js/build/src/channel.js:106:26)

Node 10,
"@google-cloud/storage": "^4.3.1",
"firebase-admin": "^8.9.2",
"firebase-functions": "^3.8.0",

@thechenky thechenky removed their assignment Feb 22, 2021
@Pixelwelder
Copy link

I cannot tell you how many days I burned on this issue. In my case, I left a curly bracket out of a document address.

.document('a/{docA}/chat/{docB') // missing final bracket

The error was occurring in a totally different function.

@MattGoldwater
Copy link

Is anyone else still getting this error?

@Dric0
Copy link

Dric0 commented Apr 1, 2021

Is anyone else still getting this error?

I did, yesterday. But I only get these once every 2 months

@MattGoldwater
Copy link

Is anyone else still getting this error?

I did, yesterday. But I only get these once every 2 months

Thanks for answering. Yeah it turned out I got it because I had a syntax error I wasn't aware of.

@calclavia
Copy link

I'm getting the same issue when I have to write a lot of documents in many async calls (500+ async writes).

@inlined
Copy link
Member

inlined commented May 28, 2021

Hi there, it seems like the original bug has been fixed and we're getting more reports of possibly more than one bug. Collecting them in the wrong old issue can hurt our ability to triage and get your issues taken care of, so I'm going to close it and let you open new bugs that can be resolved with more specific conversations.

The original bug was due to a networking issue that can happen when a server is idle: the connection gets reset and the next request may fail. Originally this was a problem with the gRPC library because it wasn't handling a clean connection reset. This problem also happens more generally when the FIN packet isn't sent across the internet due to a number of reasons involving performance and security. If the library isn't already aware the connection is invalid (e.g. the FIN packet was dropped or the library isn't handling FIN correctly) the next request will fail. Thanks to the Two Generals Problem, it's impossible to know if a request failed before or after the server got your request. The library can retry if it knows the request is idempotent (e.g. GET) but it can't necessarily retry if the request isn't (e.g. POST). Fortunately, you might know that your code is idempotent. In fact, our guidance is that all cloud functions should be idempotent because you may get more than one invocation. So a retry at the application level should be safe.

Normally you can retry with a simple try/catch. Diving through some of this bug and the internal ones as well, it looks like you couldn't always catch an error in the gRPC library. If that's (still) the case, it's an issue and someone should file a new bug against the gRPC repo or possibly gax-nodejs that exceptions cannot be caught. A foolproof way to handle exceptions anywhere in your codebase is to turn on retries in your functions. This adds a risk that a crash loop will cause indefinite executions, so you'll need to find some way to drop events-of-death.

I can guarantee you that the event type you're listening to has no impact on this issue. It's happening because your function was idle this whole time and we didn't garbage collect the container so that you could avoid a cold start. Crashes popping up in the Firestore/Datastore library should probably be filed against those SDKs (nodejs-firestore and nodejs-datastore). If you get an obvious networking error, you could also consider filing a bug against the gRPC library instead. You can of course file a bug against this repo as a starting point, but you just might have a slower response as we find the right people and move your bug to the right location. You're our customers and we care about your experience; this repo just isn't where the exception lies so it's not where the fix will come.

@inlined inlined closed this as completed May 28, 2021
@MNorgren
Copy link

I cannot tell you how many days I burned on this issue. In my case, I left a curly bracket out of a document address.

.document('a/{docA}/chat/{docB') // missing final bracket

The error was occurring in a totally different function.

Wow! Thanks. I double I would have found this if you hadn't mentioned it!!! In my case, I was using a dollar sign in my path that caused this error... Seriously.

.document('a/{docA}/chat/${docB'})

@radhikadeo
Copy link

I am still getting this same issue, can someone help?

@Pixelwelder
Copy link

@radhikadeo Just to verify, you've checked all your paths?

@moritzmorgenroth
Copy link

moritzmorgenroth commented Oct 12, 2023

I ran into this problem when working with Firestore Point-in-time recovery (PITR) (which is an awesome beta feature! 🎉).

For me, the solution was to specify a timestamp in the transaction that actually resolves to a whole hour exactly, i.e.

const q = firestore.collectionGroup("trips");
  const querySnapshot = await firestore.runTransaction(
    (t) => t.get(q),
    { readOnly: true, readTime: new Timestamp(1696827600, 0) }
  );

✅ works, but

const q = firestore.collectionGroup("trips");
  const querySnapshot = await firestore.runTransaction(
    (t) => t.get(q),
    { readOnly: true, readTime: new Timestamp(1696827601, 0) }
  );

❌ fails.

This is not mentioned in the docs, will leave a comment there. ⛑️

@hkchakladar
Copy link

The issue arises randomly for me. I'm seeing it once/twice a month (out of ~ 10k in a month).

@vikasdduc
Copy link

getting error when trying to delete topic and subscription with delete method anyone can help?
code: 7,
details: 'User not authorized to perform this action.',
metadata: Metadata { internalRepr: Map(0) {}, options: {} },
note: 'Exception occurred in retry method that was not classified as transient'

@sandhya1349
Copy link

@vikasdduc , can you please let us know the module and node versions?

@trevor-rex
Copy link

@taeold Could we please re-open this issue? It appears to still be occurring for users. It happened to me yesterday on node 18 and firebase-functions@4.6.0

@olboghgc
Copy link

olboghgc commented Apr 4, 2024

@taeold Could we please re-open this issue? It appears to still be occurring for users. It happened to me yesterday on node 18 and firebase-functions@4.6.0

same here

@nojaf
Copy link

nojaf commented Jun 1, 2024

Got this today with node v20.11.0 and firebase-functions 5.0.1

at async runHTTPS (/home/nojaf/.bun/install/global/node_modules/firebase-tools/lib/emulator/functionsEmulatorRuntime.js:531:5)\n    at async /home/nojaf/.bun/install/global/node_modules/firebase-tools/lib/emulator/functionsEmulatorRuntime.js:694:21 {\n  code: 2,\n  details: '',\n  metadata: Metadata {\n    internalRepr: Map(1) { 'content-type' => [Array] },\n    options: {}\n  },\n  note: 'Exception occurred in retry method that was not classified as transient'\n}"}

@haayhappen
Copy link

Still happening to me as well cc @schmidt-sebastian

@dschnare
Copy link

dschnare commented Jun 27, 2024

We are getting this error as well. It was during a WriteBatch.commit when we were updating one document. We have been using @google-cloud/firestore@6.6.0.

EDIT
We did receive this error beforehand. We have exponential backoff as a retry strategy when errors like this are caught. So we proceeded to retry our update call, but then the "Exception occurred in retry method that was not classified as transient" error occurred.

firestoreRetryable retrying due to 13 INTERNAL Error: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests