New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Downtime when deploying new release #191
Comments
+1 |
EDIT 2020:CapRover now has zero downtime: #661 (comment) CaptainDuckDuck/Docker does not stop your application when it's building (compiling) the new version. You are seeing 502 when the previous version is stopped and the new version is booting up. The same 502 can happen when your application crashes. In the case of a crash, right after the crash a new instance of your application gets created and deployed. In the transition time, 502 errors will be seen. This can be reduced via creating multiple instances of your app (preferably on different worker nodes). The issue you're describing and the solution you're suggesting is not doable at CaptainDuckDuck level. There are many applications that should never run as multiple instances - they might corrupt data. For example, consider a database. This database has a persistent directory. If a new version of DB gets deployed and gets started before the old version is killed, they access the same files, and it will irrecoverably corrupt your data. On top of this, there are DNS issues with the proposed solution. CaptainDuckDuck uses Docker local DNS server to resolve the ip address of containers. Hardcoding IP addresses in nginx will make nginx lose the connection once your container gets restarted due to an update or system reboot. Also, success state is a relative term. Success is when your webapp responds with 200 to a To sum up, what you are suggesting is surely doable, but not with Docker swarm + CaptainDuckDuck. It requires a more complex system. with container registry, customized health/success state checks, and etc. If having "zero" downtime is essential to your business, CaptainDuckDuck is not a right product for you, I suggest you explore enterprise level deployment tools such as Octopus, Puppet/Chef, and etc. Note that with the current implementation of CaptainDuckDuck your downtime between the updates should be less than 10s. If you need less than that, you can update your services to have multiple instances of them and use rolling updates - search for "Docker rolling updates". Also, you can use a customized 502 error page to make it look much better. |
I feel like something similar to Dokku's Zero-downtime should require some opt-in configuration but it doesn't feel out of scope. I'm interested in implementing this, but I need to get more familiar with the codebase first as I move over from
|
CHECKS operates on a different level. CapRover uses swarm. It also uses Refactoring CapRover such that it doesn't rely on Docker DNS for IP resolution is just a bad idea as we'll lose lots of features that we get for free from Docker DNS and what we'll gain is a small win, which is not even visible in most cases. |
I'm okay with a fraction of a second of downtime but my experience so far has been 10-30 seconds so perhaps I have something else mis-configured—I'll investigate more and report back. |
Does your app use "persistent data" by any chance? |
Apps that are flagged as persistent apps use stop-first update strategy. Again, this is intentional because otherwise we could have two instances accessing the same filesystem and I'll cause data corruption. |
@githubsaturn yep that was it—it was an app designed to use a shared FS but it makes sense that that would be the default |
Sorry to bring up an old thread but I am also seeing about 10-30 seconds of downtime (my start command takes < 2 seconds to run). I dont have persistent storage but I do have clustering set up. I noticed that sometimes I get the correct response and sometimes I get 502's (presumably from the clusters starting up). After about 30 seconds, then I consistently get the correct response. If I had to guess, its that the cluster servers are downloading and starting the new docker image, but for some reason they aren't serving the old instance while that is happening. Any idea what I could do to fix this? Or is there additional info I can give to help solve this? |
@nahtnam
|
Sorry for the late response, I ended up switching to Dokku. But I re-created the cluster and deployed it again. Here is what the service command looks like when the url is sometimes returning the correct response and sometimes 502
Note: I had initially deployed it with 3 nodes and that didn't have the issue. So in the settings I upped it to 5 and started seeing it so that is why the output for the first 3 look different than the last two |
Interesting, I don't see any weirdness in your Docker output, basically if we take out the services that are shutdown, everything else is just healthy:
My guess is that this is an issue with Swarm DNS. During the update, when a request comes in, it provides 5 ip address, some of these IP addresses are going to die in a few seconds during the lifespan of the request. If the container gets updated before the request process is finished, then you'd get a 502 error. But again, 30 sec is just way too long, I can see that the containers are updated on all nodes in 13sec-18sec ago, so the max downtime should be 5 sec or so. |
Hmmm, is there any solution for this? Maybe a healthcheck to ensure that its accepting connections before killing the old containers? I'm not very familiar with Docker swarm so I don't know |
Docker Swarm has checks and rolling update delays. When you push an update, you can also have it wait for N minutes/hours after the first before redeploying others -- where it monitors and if an error appears during that time, it aborts the update process and reverts. After success, there is another per-instance update delay. Shouldn't we be able to take advantage of all of this automatically? |
I don't think it helps with this scenario if the request is already being processed by a container and it dies while it's processing the request. Regarding the checks in Docker swarm. Yes, CapRover respects that. You can either manually run |
Hmm I'm not sure that makes sense. I kept spamming refresh and I'd keep getting 502's randomly. If it just happened for one request I wouldn't mind but it seems to be that the container is being killed before the dns update happens so traffic is being sent to a dead container |
What doesn't make sense? Docker healthcheck? Or the service override? As a test, can you try reducing the TTL for DNS in your nginx config to something really small, like 1s? Go to Apps > Your app > Click on EDIT NGINX CONFIG and change the time from 10s to 1s: caprover/template/server-block-conf.ejs Line 53 in cc8559b
|
Sorry I meant "if the request is already being processed by a container and it dies while it's processing the request." doesnt make sense because multiple new requests will also die. (I'm not really experienced in this kind of stuff so IDK tbh) Anyways, I did a bit more digging (I didn't end testing the nginx config thing). I installed LogDNA and made sure to including the logs of when the server starts as well. There is about a 1s delay between when the server sends the logs and it shows up in LogDNA. When I deploy a new app, it takes about 10-15 seconds to even see the "starting server" log so I have a feeling it has nothing to do with the app starting, but has to do with the servers pulling the container from the registry (I'm using the in built caprover one) and running the image. In those 10-15 seconds I'm seeing 502 errors instead of getting responses from the old containers. Does this information help in any way? |
Just tried it with a small app and it doesn't happen. I think to reproduce it, you should have a huge app (size wise). To reproduce maybe install a bunch of random deps like gatsby, webpack, etc, etc |
Adding more dependecies only increases the build time, after that things should be relatively the same. The only other explanation is if your application just takes a long time to start up. In that case, you can simply use docker healthcheck in your Dockerfile (example) which signals Docker on when to consider your container alive. |
Doesn't a larger docker image take longer to download and start? My server takes 6 seconds to start, pulled from LogDNA:
However it takes about 20 seconds for those logs to show up so I think something else is the issue. 5 second downtime isn't bad, but its a lot longer than that generally. I will try the healthcheck and see if that helps |
I'm looking at caprover again, I was wondering if the following supported: https://medium.com/cherrychain/zero-downtime-deployment-with-docker-swarm-61b2cc3d4ae5 More specifically, can I do this somehow?
|
Yes, you can definitely do this. You can Service Override to specifically override these values in your service. The docke api schema that you should use is here: https://docs.docker.com/engine/api/v1.40/#operation/ServiceUpdate |
Awesome! I'll give that a shot. The other (harder) question is: Is there a way to get notified if the deploy fails and rollsback because I believe the other servers in the cluster download the image after the deploy command exists? Additional, if I scale at 5 and one image fails for some reason, does the whole thing rollback to the previous image or just the one that failed? |
No, the failures happen asynchronously as you said.
The whole thing (if it means all instances of the same service) Note that these are all Docker behavior and there is nothing specific about CapRover here. If you find an application that monitors your Docker services, it will work just fine with CapRover. |
Problem:
After deploying a new release to any App, there is a downtime while it's compiling and restarting the docker container, so a 502 Bad Gateway is shown.
Suggestion of solution:
Implementation of Blue-Green deployment.
Scenario:
Application is running on version 1.
Changes to the app are done and a new version is deployed, before shutting down the container v1, captainduckduck builds the new container v2, and waits for a success state.
After having a success state, the routing is changed to container v2.
Now we could shutdown container v1. The image is kept since it was the latest stable version? So maybe it could be nice to have a button on the Dashboard to trigger a "quick rollback" in case something went wrong and we want to bring container v1 back again?
If everything went correctly, container v1 should be shutdown and we should be having the new container v2 up and running without any downtime.
Any suggestions? Maybe there might be a better solution for this?
The text was updated successfully, but these errors were encountered: