Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: Automatically retry failed function deploys that fail due to "You have exceeded your deployment quota" #2606

Closed
ValentinFunk opened this issue Sep 9, 2020 · 30 comments · Fixed by #3246
Assignees

Comments

@ValentinFunk
Copy link
Contributor

ValentinFunk commented Sep 9, 2020

Problem & Background

We have a deployment of ~70 functions and deploy them as part of CI at once. Whenever we deploy a random set of functions fails to deploy due to "You have exceeded your deployment quota". This used to block us completely a few weeks ago when there still was a quota on build time. Since this has been moved to Cloud Build as far as I can tell there are no more limits and you simply pay for your build time. Which is awesome, no more blocking the build due to quotas!

Unfortunately there still seems to be some other quotas/rate limits that are causing the deploy to fail. From what I can tell these might be simply too many write requests against the Cloud Functions API (or perhaps cloud build?).

The error message suggests to deploy with --only - Unfortunately it is not easy for us to split these up into separately deployed functions. It is also impossible to do when a dependency is updated or we change a data model or utility library that is used by different functions. Analyzing which function changed and has to be redeployed is not possible (for us) automatically and has to be done manually. This then brings new pains, where automated deploys after code review become impossible.

Right now this means our pipelines all fail regularily - and we manually retry them until every functions deployed successfully once.
image

Suggestion

Since retrying the failed functions right after always works, my suggestion would be to retry failed deploys that hit code 8 here:

if (op.error.code === 8) {
logger.debug(op.error.message);
logger.info(
"You have exceeded your deployment quota, please deploy your functions in batches by using the --only flag, " +
"and wait a few minutes before deploying again. Go to " +
clc.underline("https://firebase.google.com/docs/cli/#deploy_specific_functions") +
" to learn more."
);
} else {

I suppose this could be implemented here:

return op.retryFunction().then(function(res) {

If this would be a way forward I'm happy to create a PR for this.

@samtstern
Copy link
Contributor

@joehan is there something we could do in deploy to rate limit or retry in these cases?

@KyleFoleyImajion
Copy link

I have 122 functions that I deploy and after updating my firebase-tools to v 8.10.0, firebase-tools no longer tells me which functions failed to deploy because of the rate limiting. Having something that retries the failed functions would be fantastic.

@SiarheiBokuts
Copy link

Agree, nice idea

@dzmitrynik
Copy link

Definitely it would be awesome to have it. We have the same problem (we have 74 functions) and have to redeploy them manually. It's a big pain and stops us from doing automatic deployments for firebase functions.

@mvergarair
Copy link

I have the same problem with aprox 90 functions. Not being able to rely on CI deploy is very limiting.

@ValentinFunk
Copy link
Contributor Author

@joehan friendly ping, did you get a change to take a look yet? I'm happy to take a stab at this in a MR if it has a chance of being accepted

@joehan
Copy link
Contributor

joehan commented Sep 28, 2020

Hey @kamshak, thanks for the reminder, this one slipped past me! With the recent switch to using Cloud Build, I don't see any reason we shouldn't add retries here. I would be more than happy to review a PR for this.

@KyleFoleyImajion
Copy link

Not sure if this is a bug or not, but did anyone else notice that sometimes the CLI tells you which functions failed to deploy due to rate limiting, but then other times the CLI doesn't tell you (even though some failed due to rate limiting)?

@ValentinFunk
Copy link
Contributor Author

@KyleFoleyImajion how many functions do you have in your project currently?

I've started looking into this and there are 2 cases when you can run into rate limits:

  1. Write API to update/create functions (done in the beginning of each deploy)
    This one was easy to solve without changing much of the code.
  2. Even if 1) succeeds you can get rate limits from the Cloud Build API
    A bit more tricky. You need to do a new wait request and poll the new long running task for status.

The solution is therefore a bit more complicated than I thought originally. The code right now is also written in a way that makes it a bit difficult to add retries. I've run into two problems:

  1. Logging / Messages: The logging is done in the API itself. There is for example no was to update a function right now without having the error logged - even if you retry after that error and therefore the error doesn't have to be shown.
  2. Mutable State: A lot of the API methods mutate the objects you pass in, which means for example an "error" property might stay after a retry since the retried operation object only merges in new fields but doesn't delete old one.

@joehan I'm not sure in how far you are OK with larger changes to the codebase here. Currently I'm thinking the best solution would be to refactor it a little bit. My Idea right now would be:

  • Support deploys for at least 200 functions. Print a warning when deploying more than 60 functions to inform users that using --only would be better.
  • Have no hard limit on how many functions can be deployed. Simply take longer.
  • Log when a build for a function has been started in cloud build and when it succeeds, so that users can see deploy progress. (similar to now - except that with concurrency limit you would get the log only when the update was started).
  • Log a deploy summary at the end (same as it is now)
  1. Treat a single function deploy as one operation that wraps everything needed to perform a deploy. Specifically:
    1. Do the create/update request against the cloud functions API to get an operation
    2. Poll the operation for error / success / rate limit
  2. Move logging out of the api client, move retries logic for errors that happen on a request level into the api client.
  3. Have a coordinator class/function that uses a Queue/Throttler that is used in other parts of the API to control concurrency and retries for deploys and long running operations. Concurrency would be set to 60 (build requests per minute for Cloud Build API).

@KyleFoleyImajion
Copy link

Currently at 119 functions, and I deploy using the firebase deploy --only functions. We don't have it automated at this point, but will be moving to that in the future.

Thanks for taking a look at this issue!

@joehan
Copy link
Contributor

joehan commented Oct 12, 2020

@kamshak I'm open to larger changes in this part of the codebase - its one of our older and less healthy flows at this point. Fair warning, I am planning on making some large refactors to this path later this year that might make these suggested changes obsolete - totally understand if that changes your appetite for refactoring this.

To your specific suggestions:

Logging / Messages: The logging is done in the API itself. There is for example no was to update a function right now without having the error logged - even if you retry after that error and therefore the error doesn't have to be shown.
Mutable State: A lot of the API methods mutate the objects you pass in, which means for example an "error" property might stay after a retry since the retried operation object only merges in new fields but doesn't delete old one.

Agreed that these, along with the heavy reliance on long promise chains, are the main problems with this code.

Log when a build for a function has been started in cloud build and when it succeeds, so that users can see deploy progress. (similar to now - except that with concurrency limit you would get the log only when the update was started).

I'm not sure if you'll be able to see when the Cloud Build itself starts from the CRUDFunction and GetOperation calls. In the interest of being as clear as possible, I think the logging here should be more along the lines of "Function deploy started" as opposed to "cloud build started". We also still support Node8 deploys, which don't use Cloud Build in the same way.

The rest of the design sounds good to me at a high level, particularly reusing the Queue/Throttler code

@spencerwhyte
Copy link
Contributor

We have the same problem with only around ~30 functions. In our case we deploy several times per day, or even several times per hour if we are really moving. We also leverage cloud build for other things (Cloud Run, Google App Engine), which perhaps makes the problem worse.

@emadalam
Copy link

We are facing the same issue of lately. With around ~60 functions that we deploy automatically from a CI/CD pipeline, the deployment fails randomly for 1-3 functions. Though checking the limits in GCP clearly shows that rate limit was never hit for any of the deployments, so not sure what's going on in here.

We have 2 identical projects, one for our staging and one for production and so far this issue seems to be happening only on staging project, and everything seem to work fine on production setup. So not sure what's going on in there. It's a bit cryptic to debug and resolve this issue without knowing where could the underlying problem be. We have had issues with Cloud Functions in the past and most of the time it turned out to be an issue on the GCP side itself which was confirmed by the Firebase Support Team. Not sure should we already reach out to the Firebase Support Team or what 🤷

@SamLoy
Copy link

SamLoy commented Dec 30, 2020

Has anyone explored the idea of querying the Functions API for a checksum for the function? Then using the function's hash, calculate if the local version has changed and only upload modified functions.

If a hash isn't available, then perhaps the functions API might have something similar perhaps?

I know this wouldn't fix the main issue raised in the ticket here, but for me I think a process which only publishes updated code would solve 90% of the quota issues that I experience.

@rhodgkins
Copy link
Contributor

Despite using the new update still getting this issue on 40%-60% of deployments... :(

@astefer
Copy link

astefer commented Oct 29, 2021

This is still happening to me @joehan

⚠  functions: got "Quota Exceeded" error while trying to update [redacted]. Waiting to retry...
Error: There was an error deploying functions:
- Error Failed to update function [redacted] in region europe-west1

@KyleFoleyImajion
Copy link

This just started happening to me as well again, was working perfectly for a while.
firebase --version 9.21.0

@sgilbert
Copy link

I am having similar issues and it doesn't seem to be retrying anymore.

@antonstefer
Copy link

This also has some weird side effects. Normally when I first time deploy an onCall function it has allow unauthenticated = true by default. Now when a function fails to deploy caused by this quota exceeded issue and is the redeployed manually this is not the case.
Could you please look into this or assign someone again? @samtstern
I think this should be reopened.

@mvergarair
Copy link

I'm having the same problem. This was fixed but now it's happening again.

@42ae
Copy link

42ae commented Dec 17, 2021

I am having the same issue than @antonstefer. When deploying about 70 GCF, some of them fails with functions: got "Quota Exceeded" error while trying to update and the permission principal allUsers with role Cloud Functions Invoker disappears from the Cloud Functions. This side effect causes CORS error in the client app while trying to communicate with GCP that failed during deployment.

@oande
Copy link

oande commented Jan 18, 2022

Still looks like an issue... Does Google/Firebase team have any recommendations here? Never deploy all functions at the same time?

Deploying 55 functions as part of our CI/CD, some functions randomly fail to deploy with
functions: got "Quota Exceeded" error while trying to update

@julschwarz
Copy link

Running in the same issue with about 130 GCFs. For function groups the official recommendation sais to not deploy more than 10 at a time. Deploying in batches fixes CI deploy but the pipeline takes forever.

@SamLoy
Copy link

SamLoy commented Feb 24, 2022

This is what I do. I love shipping my software 10% at a time

@dtran320
Copy link

We're running into this issue consistently deploying 72 functions using firebase-functions 3.15.7 from Github Actions. Will try version bumping to 3.22.0 and then try deploying in batches, but would be nice if we didn't have to slow down the deploy even more.

@timminata
Copy link

We also have this issue deploying about ~90 functions - currently on 4.1.0 of firebase-functions. Would love a solution or recommendation.

@charlespsdowd
Copy link

This issue has been going on since 2020.
As Google Cloud and Firebase team are aware, the --only suggestion is unusable for most production systems.
We will happily PAY for the build and reply lists to be raised but there is nowhere to up the quota.
Any help on how to expand this quota would be very helpful.

@MatteoAntolini
Copy link

MatteoAntolini commented Mar 3, 2023

Here is a simple python script to deploy all of your functions in batches of 10

import subprocess
import re
import datetime

# Read functions/index.js file
with open('./functions/index.js', 'r') as f:
    index_file = f.read()

# Scrape function names
function_names = re.findall(r'exports\.([^\s]+)\s*=\s*functions\.', index_file)

# Deploy Firebase functions in batches of 10
def deploy_functions():
    batches = [function_names[i:i+10] for i in range(0, len(function_names), 10)]
    for batch in batches:
        functions_str = ','.join([f'functions:{fn}' for fn in batch])
        command = f'firebase deploy --only \"{functions_str}\"'
        print(f'Deploying functions: {", ".join(batch)}...')
        print(command)
        try:
            # Create log file with current timestamp
            timestamp = datetime.datetime.now().strftime('%Y-%m-%d-%H-%M')
            log_file = f'deploy_logs/{timestamp}.log'
            
            # Execute command and write output to log file
            with open(log_file, 'w') as f:
                result = subprocess.run(command, shell=True, check=True, stdout=f, stderr=subprocess.STDOUT)
            
            print(f'Functions: {", ".join(batch)} deployed successfully. Log file: {log_file}')
        except subprocess.CalledProcessError as error:
            print(f'Error deploying functions: {batch}. Log file: {log_file}')
            print(f'{error}')

deploy_functions()

It took 25 minutes to deploy 125 functions

@radarcontact
Copy link

Having the same problem with 300 functions

@ser60
Copy link

ser60 commented Feb 15, 2024

Initially the 'retry' update fixed this for us, but the errors have recently resurfaced despite the retries. Deploying around 100 functions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.