Downtime when deploying new release #191

daniel-trevino · 2018-02-22T11:03:55Z

Problem:
After deploying a new release to any App, there is a downtime while it's compiling and restarting the docker container, so a 502 Bad Gateway is shown.

Suggestion of solution:
Implementation of Blue-Green deployment.

Scenario:

Application is running on version 1.
Changes to the app are done and a new version is deployed, before shutting down the container v1, captainduckduck builds the new container v2, and waits for a success state.
After having a success state, the routing is changed to container v2.
Now we could shutdown container v1. The image is kept since it was the latest stable version? So maybe it could be nice to have a button on the Dashboard to trigger a "quick rollback" in case something went wrong and we want to bring container v1 back again?
If everything went correctly, container v1 should be shutdown and we should be having the new container v2 up and running without any downtime.

Any suggestions? Maybe there might be a better solution for this?

opensourcekam · 2018-02-22T14:26:19Z

+1

githubsaturn · 2018-02-22T16:50:59Z

EDIT 2020:

CapRover now has zero downtime: #661 (comment)

CaptainDuckDuck/Docker does not stop your application when it's building (compiling) the new version. You are seeing 502 when the previous version is stopped and the new version is booting up. The same 502 can happen when your application crashes. In the case of a crash, right after the crash a new instance of your application gets created and deployed. In the transition time, 502 errors will be seen. This can be reduced via creating multiple instances of your app (preferably on different worker nodes).

The issue you're describing and the solution you're suggesting is not doable at CaptainDuckDuck level. There are many applications that should never run as multiple instances - they might corrupt data. For example, consider a database. This database has a persistent directory. If a new version of DB gets deployed and gets started before the old version is killed, they access the same files, and it will irrecoverably corrupt your data. On top of this, there are DNS issues with the proposed solution. CaptainDuckDuck uses Docker local DNS server to resolve the ip address of containers. Hardcoding IP addresses in nginx will make nginx lose the connection once your container gets restarted due to an update or system reboot. Also, success state is a relative term. Success is when your webapp responds with 200 to a GET / request? What if a webapp did not implement GET /? What if it's not a webapp that allows HTTP connections? What I am saying is that there is no universal standard for success state.

To sum up, what you are suggesting is surely doable, but not with Docker swarm + CaptainDuckDuck. It requires a more complex system. with container registry, customized health/success state checks, and etc.

If having "zero" downtime is essential to your business, CaptainDuckDuck is not a right product for you, I suggest you explore enterprise level deployment tools such as Octopus, Puppet/Chef, and etc.

Note that with the current implementation of CaptainDuckDuck your downtime between the updates should be less than 10s. If you need less than that, you can update your services to have multiple instances of them and use rolling updates - search for "Docker rolling updates". Also, you can use a customized 502 error page to make it look much better.

alecgorge · 2019-11-24T17:32:02Z

I feel like something similar to Dokku's CHECKS would be a good solution for this.

Zero-downtime should require some opt-in configuration but it doesn't feel out of scope. I'm interested in implementing this, but I need to get more familiar with the codebase first as I move over from dokku.

caprover could automatically scale to 2 as part of a deployment if it isn't already and then issue a rolling update with some sort of configurable delay to fit better in the docker service mold instead of a CHECKS file

githubsaturn · 2019-11-24T18:47:48Z

CHECKS operates on a different level. CapRover uses swarm. It also uses start-first update strategy. So it automatically scales the instances to 2 before updating. It does take care of everything that CHECKS do. Also, there is a healthcheck instruction in the Dockerfile that you can employ in order to signal Docker that your application is up and running. We let nginx use Docker DNS server to resolve service names to IP addresses. This will result in a "potential" downtime of fraction of second between the deploys when the DNS isn't updated properly.

Refactoring CapRover such that it doesn't rely on Docker DNS for IP resolution is just a bad idea as we'll lose lots of features that we get for free from Docker DNS and what we'll gain is a small win, which is not even visible in most cases.

alecgorge · 2019-11-25T20:45:26Z

I'm okay with a fraction of a second of downtime but my experience so far has been 10-30 seconds so perhaps I have something else mis-configured—I'll investigate more and report back.

githubsaturn · 2019-11-25T20:49:12Z

Does your app use "persistent data" by any chance?

githubsaturn · 2019-11-25T20:50:26Z

Apps that are flagged as persistent apps use stop-first update strategy. Again, this is intentional because otherwise we could have two instances accessing the same filesystem and I'll cause data corruption.

alecgorge · 2019-12-04T21:01:27Z

@githubsaturn yep that was it—it was an app designed to use a shared FS but it makes sense that that would be the default

nahtnam · 2020-09-02T04:38:09Z

Sorry to bring up an old thread but I am also seeing about 10-30 seconds of downtime (my start command takes < 2 seconds to run). I dont have persistent storage but I do have clustering set up. I noticed that sometimes I get the correct response and sometimes I get 502's (presumably from the clusters starting up). After about 30 seconds, then I consistently get the correct response. If I had to guess, its that the cluster servers are downloading and starting the new docker image, but for some reason they aren't serving the old instance while that is happening. Any idea what I could do to fix this? Or is there additional info I can give to help solve this?

githubsaturn · 2020-09-02T12:42:35Z

@nahtnam
30sec is too long specially if you don't have any persistent volume.

How many instances of that app do you have?
Right after deploy run docker service ps srv-captain--yourappname every few seconds to see the progress your app goes through while updating.

nahtnam · 2020-09-09T03:41:43Z

Sorry for the late response, I ended up switching to Dokku. But I re-created the cluster and deployed it again. Here is what the service command looks like when the url is sometimes returning the correct response and sometimes 502

fq5mfey9ab5x        srv-captain--myapp-api.1       registry.my.server.com:996/captain/img-captain-myapp-api:2   cap2                Running             Running 13 seconds ago
raogy537bu7n         \_ srv-captain--myapp-api.1   registry.my.server.com:996/captain/img-captain-myapp-api:2   cap1                Shutdown            Shutdown 9 seconds ago
5fx5wsqw72ee         \_ srv-captain--myapp-api.1   registry.my.server.com:996/captain/img-captain-myapp-api:1   cap2                Shutdown            Shutdown about a minute ago
8wr2pgoqrekx         \_ srv-captain--myapp-api.1   registry.my.server.com:996/captain/img-captain-myapp-api:1   cap1                Shutdown            Shutdown 3 minutes ago
ypqr1seyv7ye         \_ srv-captain--myapp-api.1   caprover/caprover-placeholder-app:latest                          cap2                Shutdown            Shutdown 5 minutes ago
1tw60igjca91        srv-captain--myapp-api.2       registry.my.server.com:996/captain/img-captain-myapp-api:2   cap2                Running             Running 13 seconds ago
o0xse7z483oq         \_ srv-captain--myapp-api.2   registry.my.server.com:996/captain/img-captain-myapp-api:2   cap1                Shutdown            Shutdown 9 seconds ago
u1lqi5dbz3iw         \_ srv-captain--myapp-api.2   registry.my.server.com:996/captain/img-captain-myapp-api:1   cap2                Shutdown            Shutdown about a minute ago
jn6cwihr6phd         \_ srv-captain--myapp-api.2   registry.my.server.com:996/captain/img-captain-myapp-api:1   cap1                Shutdown            Shutdown 3 minutes ago
q0cw6r2b6rg9         \_ srv-captain--myapp-api.2   registry.my.server.com:996/captain/img-captain-myapp-api:0   cap1                Shutdown            Shutdown 5 minutes ago
tgvok95k05yz        srv-captain--myapp-api.3       registry.my.server.com:996/captain/img-captain-myapp-api:2   cap2                Running             Running 13 seconds ago
uwbnvl0csa9a         \_ srv-captain--myapp-api.3   registry.my.server.com:996/captain/img-captain-myapp-api:2   cap1                Shutdown            Shutdown 10 seconds ago
lyog9gt7oh39         \_ srv-captain--myapp-api.3   registry.my.server.com:996/captain/img-captain-myapp-api:1   cap2                Shutdown            Shutdown about a minute ago
pxx6j37gkk09         \_ srv-captain--myapp-api.3   registry.my.server.com:996/captain/img-captain-myapp-api:1   cap1                Shutdown            Shutdown 3 minutes ago
vgo8mg33civv         \_ srv-captain--myapp-api.3   registry.my.server.com:996/captain/img-captain-myapp-api:0   cap1                Shutdown            Shutdown 5 minutes ago
n9hks5ilx6fa        srv-captain--myapp-api.4       registry.my.server.com:996/captain/img-captain-myapp-api:2   cap1                Running             Running 18 seconds ago
8v7zbjp1wy9p        srv-captain--myapp-api.5       registry.my.server.com:996/captain/img-captain-myapp-api:2   cap2                Running             Running 13 seconds ago

Note: I had initially deployed it with 3 nodes and that didn't have the issue. So in the settings I upped it to 5 and started seeing it so that is why the output for the first 3 look different than the last two

githubsaturn · 2020-09-09T12:26:29Z

Interesting, I don't see any weirdness in your Docker output, basically if we take out the services that are shutdown, everything else is just healthy:

fq5mfey9ab5x        srv-captain--myapp-api.1       registry.my.server.com:996/captain/img-captain-myapp-api:2   cap2                Running             Running 13 seconds ago
1tw60igjca91        srv-captain--myapp-api.2       registry.my.server.com:996/captain/img-captain-myapp-api:2   cap2                Running             Running 13 seconds ago
tgvok95k05yz        srv-captain--myapp-api.3       registry.my.server.com:996/captain/img-captain-myapp-api:2   cap2                Running             Running 13 seconds ago
n9hks5ilx6fa        srv-captain--myapp-api.4       registry.my.server.com:996/captain/img-captain-myapp-api:2   cap1                Running             Running 18 seconds ago
8v7zbjp1wy9p        srv-captain--myapp-api.5       registry.my.server.com:996/captain/img-captain-myapp-api:2   cap2                Running             Running 13 seconds ago

My guess is that this is an issue with Swarm DNS. During the update, when a request comes in, it provides 5 ip address, some of these IP addresses are going to die in a few seconds during the lifespan of the request. If the container gets updated before the request process is finished, then you'd get a 502 error. But again, 30 sec is just way too long, I can see that the containers are updated on all nodes in 13sec-18sec ago, so the max downtime should be 5 sec or so.

nahtnam · 2020-09-10T05:56:28Z

Hmmm, is there any solution for this? Maybe a healthcheck to ensure that its accepting connections before killing the old containers? I'm not very familiar with Docker swarm so I don't know

digitalsanity · 2020-09-10T16:58:40Z

Docker Swarm has checks and rolling update delays. When you push an update, you can also have it wait for N minutes/hours after the first before redeploying others -- where it monitors and if an error appears during that time, it aborts the update process and reverts. After success, there is another per-instance update delay. Shouldn't we be able to take advantage of all of this automatically?

githubsaturn · 2020-09-10T22:29:23Z

I don't think it helps with this scenario if the request is already being processed by a container and it dies while it's processing the request.

Regarding the checks in Docker swarm. Yes, CapRover respects that. You can either manually run docker service update and set your parallel update config manually. Or you can take advantage of service override functionality within CapRover if you want to override any default values: https://caprover.com/docs/service-update-override.html

nahtnam · 2020-09-11T01:35:22Z

Hmm I'm not sure that makes sense. I kept spamming refresh and I'd keep getting 502's randomly. If it just happened for one request I wouldn't mind but it seems to be that the container is being killed before the dns update happens so traffic is being sent to a dead container

githubsaturn · 2020-09-11T01:43:44Z

Hmm I'm not sure that makes sense.

What doesn't make sense? Docker healthcheck? Or the service override?

As a test, can you try reducing the TTL for DNS in your nginx config to something really small, like 1s?

Go to Apps > Your app > Click on EDIT NGINX CONFIG and change the time from 10s to 1s:

caprover/template/server-block-conf.ejs

Line 53 in cc8559b

resolver 127.0.0.11 valid=10s;

nahtnam · 2020-09-12T22:14:56Z

Sorry I meant "if the request is already being processed by a container and it dies while it's processing the request." doesnt make sense because multiple new requests will also die. (I'm not really experienced in this kind of stuff so IDK tbh)

Anyways, I did a bit more digging (I didn't end testing the nginx config thing). I installed LogDNA and made sure to including the logs of when the server starts as well. There is about a 1s delay between when the server sends the logs and it shows up in LogDNA. When I deploy a new app, it takes about 10-15 seconds to even see the "starting server" log so I have a feeling it has nothing to do with the app starting, but has to do with the servers pulling the container from the registry (I'm using the in built caprover one) and running the image. In those 10-15 seconds I'm seeing 502 errors instead of getting responses from the old containers. Does this information help in any way?

nahtnam · 2020-09-13T01:15:28Z

Just tried it with a small app and it doesn't happen. I think to reproduce it, you should have a huge app (size wise). To reproduce maybe install a bunch of random deps like gatsby, webpack, etc, etc

githubsaturn · 2020-09-13T01:56:35Z

Just tried it with a small app and it doesn't happen. I think to reproduce it, you should have a huge app (size wise). To reproduce maybe install a bunch of random deps like gatsby, webpack, etc, etc

Adding more dependecies only increases the build time, after that things should be relatively the same. The only other explanation is if your application just takes a long time to start up. In that case, you can simply use docker healthcheck in your Dockerfile (example) which signals Docker on when to consider your container alive.

nahtnam · 2020-09-13T04:23:26Z

Doesn't a larger docker image take longer to download and start? My server takes 6 seconds to start, pulled from LogDNA:

Sep 12 21:21:02 e523cd093cd8 myapp INFO [ start ] 🔥 igniting the server 🔥
Sep 12 21:21:08 03fea691fb30 myapp INFO [ listening ] on port 3000

However it takes about 20 seconds for those logs to show up so I think something else is the issue. 5 second downtime isn't bad, but its a lot longer than that generally.

I will try the healthcheck and see if that helps

nahtnam · 2020-10-12T03:02:17Z

I'm looking at caprover again, I was wondering if the following supported: https://medium.com/cherrychain/zero-downtime-deployment-with-docker-swarm-61b2cc3d4ae5

More specifically, can I do this somehow?

order: start-first
        failure_action: rollback
        delay: 5s

githubsaturn · 2020-10-12T03:14:08Z

Yes, you can definitely do this. You can Service Override to specifically override these values in your service.

The docke api schema that you should use is here: https://docs.docker.com/engine/api/v1.40/#operation/ServiceUpdate

nahtnam · 2020-10-12T03:55:31Z

Awesome! I'll give that a shot. The other (harder) question is: Is there a way to get notified if the deploy fails and rollsback because I believe the other servers in the cluster download the image after the deploy command exists? Additional, if I scale at 5 and one image fails for some reason, does the whole thing rollback to the previous image or just the one that failed?

githubsaturn · 2020-10-13T02:00:03Z

Is there a way to get notified

No, the failures happen asynchronously as you said.

does the whole thing rollback to the previous image or just the one that failed?

The whole thing (if it means all instances of the same service)

Note that these are all Docker behavior and there is nothing specific about CapRover here. If you find an application that monitors your Docker services, it will work just fine with CapRover.

githubsaturn closed this as completed Feb 22, 2018

githubsaturn added the out-of-scope label Feb 22, 2018

githubsaturn mentioned this issue Feb 26, 2018

Deployment into CDD #195

Closed

jackbrycesmith mentioned this issue May 29, 2020

Zero downtime deployments jackbrycesmith/laravel-caprover-template#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downtime when deploying new release #191

Downtime when deploying new release #191

daniel-trevino commented Feb 22, 2018

opensourcekam commented Feb 22, 2018

githubsaturn commented Feb 22, 2018 •

edited

alecgorge commented Nov 24, 2019 •

edited

githubsaturn commented Nov 24, 2019

alecgorge commented Nov 25, 2019

githubsaturn commented Nov 25, 2019

githubsaturn commented Nov 25, 2019

alecgorge commented Dec 4, 2019

nahtnam commented Sep 2, 2020

githubsaturn commented Sep 2, 2020

nahtnam commented Sep 9, 2020

githubsaturn commented Sep 9, 2020

nahtnam commented Sep 10, 2020

digitalsanity commented Sep 10, 2020 •

edited

githubsaturn commented Sep 10, 2020

nahtnam commented Sep 11, 2020

githubsaturn commented Sep 11, 2020

nahtnam commented Sep 12, 2020

nahtnam commented Sep 13, 2020

githubsaturn commented Sep 13, 2020

nahtnam commented Sep 13, 2020 •

edited

nahtnam commented Oct 12, 2020 •

edited

githubsaturn commented Oct 12, 2020 •

edited

nahtnam commented Oct 12, 2020

githubsaturn commented Oct 13, 2020 •

edited

Downtime when deploying new release #191

Downtime when deploying new release #191

Comments

daniel-trevino commented Feb 22, 2018

opensourcekam commented Feb 22, 2018

githubsaturn commented Feb 22, 2018 • edited

EDIT 2020:

alecgorge commented Nov 24, 2019 • edited

githubsaturn commented Nov 24, 2019

alecgorge commented Nov 25, 2019

githubsaturn commented Nov 25, 2019

githubsaturn commented Nov 25, 2019

alecgorge commented Dec 4, 2019

nahtnam commented Sep 2, 2020

githubsaturn commented Sep 2, 2020

nahtnam commented Sep 9, 2020

githubsaturn commented Sep 9, 2020

nahtnam commented Sep 10, 2020

digitalsanity commented Sep 10, 2020 • edited

githubsaturn commented Sep 10, 2020

nahtnam commented Sep 11, 2020

githubsaturn commented Sep 11, 2020

nahtnam commented Sep 12, 2020

nahtnam commented Sep 13, 2020

githubsaturn commented Sep 13, 2020

nahtnam commented Sep 13, 2020 • edited

nahtnam commented Oct 12, 2020 • edited

githubsaturn commented Oct 12, 2020 • edited

nahtnam commented Oct 12, 2020

githubsaturn commented Oct 13, 2020 • edited

githubsaturn commented Feb 22, 2018 •

edited

alecgorge commented Nov 24, 2019 •

edited

digitalsanity commented Sep 10, 2020 •

edited

nahtnam commented Sep 13, 2020 •

edited

nahtnam commented Oct 12, 2020 •

edited

githubsaturn commented Oct 12, 2020 •

edited

githubsaturn commented Oct 13, 2020 •

edited