Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ps:scale implementation #1117

Closed
michaelshobbs opened this issue Apr 17, 2015 · 20 comments · Fixed by #1118
Closed

ps:scale implementation #1117

michaelshobbs opened this issue Apr 17, 2015 · 20 comments · Fixed by #1118

Comments

@michaelshobbs
Copy link
Member

I believe this initially started with wanting to provide a core method of scaling processes on the same machine. #298 was the first attempt at this implementation but was not accepted as we let it sit around for too long. (I think)

Some questions come up in the new world of dokku that include things like zero-downtime checks and dockerfile deployments.

Issue/Question 1: Currently the zero-downtime checks rely on a web service being accessible for dokku to 'ping'. Therefore, if we run each Procfile entry in its own container things like workers can't be checked in this same manner.

There are two potential ways to go here. First would be to scale at the container level and somehow detect that a listener is running, then run checks. Otherwise, use the default container 'uptime' check we run when an app does not have a CHECKS file. The second method would be to use a similar pattern to the dokku-supervisor plugin and use a process manager in a single container to handle scaling.

Issue/Question 2: How should we scale dockerfile deployments.?
I don't use this method today and not sure I have a full grasp of implementation details. However, I think these can 'easily' be scaled using the multiple container approach.

refs: #733

ping @progrium @josegonzalez @joshco

@joshco
Copy link
Contributor

joshco commented Apr 17, 2015

Here are my initial, somewhat rambling thoughts
My current inclination is to avoid moving too quickly here. I've only just run into the issue, and it's a complicated one with multiple facets.

IMHO, the urgent issue is not having a way to start the non-web processes.

Thats a real blocker as many common types of applications just can't be deployed with dokku without it.
However, the supervisord solution seems really good. I'm going to use that and study/observe to gain insight.

While it might not be our preferred long term strategic approach, it's a good, solid solution people can use immediately and it gives us time to reason about a dokku solution. Are there any pressing downsides that require urgency?

As @michaelshobbs mentions, there are a number of issues that tie into this like CHECKS and thinking of a scaling model (container vs process)

Re Question 1
Yes, multi-process apps are even more difficult for dokku to generically assess availability. Our current model where the developer decides what the right checks are for their application process set within the docker seems like the pragmatic approach.
I don't need dokku assessing the state of every process in the docker.

As far as CHECKS go, I think we want to be careful to avoid being so CHECKY that a dev can't deploy an app that isnt working yet.
eg, Im pushing a new app that is still in early development and I need to set up the database or other services in arbitrary order. I want dokku to just ignore any problems and accept the push without resisting.
Call this the "I am ok with downtime deployment mode"

I worry that if we go overboard on multiprocess CHECKS, the checks will be very fragile and the slightest problem, or more likely a learning curve for dokku users will lead to mistakes and pushing an app to dokku will be a huge hassle and difficult thing to achieve. (Which seems to me is the opposite of what we're after)

I curious to see how the CHECKS are received by other users. For one thing, I'm sure we're going to get repeated versions of this question:

Why does Dokku wait for 35 seconds #1113

On that note, the default check could probably benefit from some logging pointing them towards more efficient checks. eg "You are using the default and likely inefficient availability CHECK. See to learn how to tune the CHECKS to your app" (If folks agree, I will take care of this)

@michaelshobbs
Copy link
Member Author

Yeah, I don't think we're in any rush at all actually. I do think, however, its been on the backburner for quite a long time. So having the discussion again seems prudent.

Regarding checks and early stage dev, I think we cover this case well if a dev does not include a CHECKS file by just making sure the container stays up. Would you agree?

My thinking on non-web process containers is that we would just use the default check to start. Any suggestions on implementing custom checks would be great too but probably not necessary in the initial pass. The issue here is, if we do have a CHECKS file, how do we know which container to execute against?

Also, I think printing out some message with the URL to the checks docs would be great in the case of no CHECKS file.

@michaelshobbs
Copy link
Member Author

So in doing some rubber duck "debugging" I had the thought we might just adjust the CHECKS file format to include the process name which would map to a container that we'd test against.

@josegonzalez
Copy link
Member

Are there any services on heroku with a "checks" type functionality? I'm thinking about how we might want to abstract this so that it would work regardless of application type.

@josegonzalez
Copy link
Member

Another thing: Can we have multiple http processes? How do we handle exposing multiple ports properly?

@josegonzalez
Copy link
Member

I think a 10 second start rather than 35 is a much better default (or even 5). @joshco feel free to pr the logging output change.

@michaelshobbs
Copy link
Member Author

  1. This? https://devcenter.heroku.com/articles/preboot
  2. I think there's a few different scenarios. (i.e. are we binding externally and are we using nginx-vhosts)
  3. I'm down with changing the default naive check to 10 seconds. I was also thinking we may want to implement retries in the default case. So like 5 second timeout with 6 retries... or 3 and 10 maybe? thoughts?

@michaelshobbs
Copy link
Member Author

Derp that last one. Not sure what I was thinking. It's either going to stay up or not. 10 seconds is good with me. I'll include it in the forthcoming PR.

@michaelshobbs
Copy link
Member Author

Next question: How do we return data from dokku logs <app>?

@josegonzalez
Copy link
Member

Are the logs aggregated on heroku? I think we need to provide both aggregated and non-aggregated versions of log output.

@michaelshobbs
Copy link
Member Author

Yes they are aggregated.

@joshco
Copy link
Contributor

joshco commented Apr 17, 2015

Agree on the retries for the default check.
I can do that along with the logging notice about using the tuned checks.

Sent from my iPhone

On Apr 17, 2015, at 5:59 PM, Michael Hobbs notifications@github.com wrote:

Yes they are aggregated.


Reply to this email directly or view it on GitHub.

@michaelshobbs
Copy link
Member Author

@joshco retries on the default check don't make much sense because we're not attempting to connect to the container; just checking that it exists. So a retry won't do anything useful.

@joshco
Copy link
Contributor

joshco commented Apr 17, 2015

Using a retry will let us reduce the wait time for the default check. Now any deployment blocks for 35 sec which the majority of time is unneeded.
With a retry we can check every 5 sec (or less) which will work most of the time

Sent from my iPhone

On Apr 17, 2015, at 6:22 PM, Michael Hobbs notifications@github.com wrote:

@joshco retries on the default check don't make much sense because we're not attempting to connect to the container; just checking that it exists. So a retry won't do anything useful.


Reply to this email directly or view it on GitHub.

@progrium
Copy link
Contributor

Btw, I'm working on a generic health check utility for Docker containers that might help out with some of this.

@joshco
Copy link
Contributor

joshco commented Apr 17, 2015

Workflow question: I've made the logging changes to guide the user towards the checks examples to tune CHECKS.
Do you prefer if I update my existing branch and PR or should I do new branch and new PR?

@michaelshobbs
Copy link
Member Author

@joshco I think we're referring to two different executions paths perhaps.
I'm specifically referring to this:
https://github.com/progrium/dokku/blob/5224a6c6783bcfba5c90839e2365f6a491cfb637/plugins/checks/check-deploy#L80-L92

In this block we sit for x number of seconds, check if the container is still around and then exit 0 if its still there or exit 1 if its dead. We never drop into the retry section because we're never really waiting on anything to bind to a port or anything. Instead we're just making sure the container didn't die.

I think the only change that's probably necessary is to drop the default timeout to 10 seconds as previously discussed.

@michaelshobbs
Copy link
Member Author

A new PR would be best I think.

@joshco
Copy link
Contributor

joshco commented Apr 17, 2015

Done: #1119

@josegonzalez
Copy link
Member

Closing as there is an open PR - #1118 - where we should contain any future discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants