ps:scale implementation #1117

michaelshobbs · 2015-04-17T18:27:27Z

I believe this initially started with wanting to provide a core method of scaling processes on the same machine. #298 was the first attempt at this implementation but was not accepted as we let it sit around for too long. (I think)

Some questions come up in the new world of dokku that include things like zero-downtime checks and dockerfile deployments.

Issue/Question 1: Currently the zero-downtime checks rely on a web service being accessible for dokku to 'ping'. Therefore, if we run each Procfile entry in its own container things like workers can't be checked in this same manner.

There are two potential ways to go here. First would be to scale at the container level and somehow detect that a listener is running, then run checks. Otherwise, use the default container 'uptime' check we run when an app does not have a CHECKS file. The second method would be to use a similar pattern to the dokku-supervisor plugin and use a process manager in a single container to handle scaling.

Issue/Question 2: How should we scale dockerfile deployments.?
I don't use this method today and not sure I have a full grasp of implementation details. However, I think these can 'easily' be scaled using the multiple container approach.

refs: #733

ping @progrium @josegonzalez @joshco

joshco · 2015-04-17T19:11:36Z

Here are my initial, somewhat rambling thoughts
My current inclination is to avoid moving too quickly here. I've only just run into the issue, and it's a complicated one with multiple facets.

IMHO, the urgent issue is not having a way to start the non-web processes.

Thats a real blocker as many common types of applications just can't be deployed with dokku without it.
However, the supervisord solution seems really good. I'm going to use that and study/observe to gain insight.

While it might not be our preferred long term strategic approach, it's a good, solid solution people can use immediately and it gives us time to reason about a dokku solution. Are there any pressing downsides that require urgency?

As @michaelshobbs mentions, there are a number of issues that tie into this like CHECKS and thinking of a scaling model (container vs process)

Re Question 1
Yes, multi-process apps are even more difficult for dokku to generically assess availability. Our current model where the developer decides what the right checks are for their application process set within the docker seems like the pragmatic approach.
I don't need dokku assessing the state of every process in the docker.

As far as CHECKS go, I think we want to be careful to avoid being so CHECKY that a dev can't deploy an app that isnt working yet.
eg, Im pushing a new app that is still in early development and I need to set up the database or other services in arbitrary order. I want dokku to just ignore any problems and accept the push without resisting.
Call this the "I am ok with downtime deployment mode"

I worry that if we go overboard on multiprocess CHECKS, the checks will be very fragile and the slightest problem, or more likely a learning curve for dokku users will lead to mistakes and pushing an app to dokku will be a huge hassle and difficult thing to achieve. (Which seems to me is the opposite of what we're after)

I curious to see how the CHECKS are received by other users. For one thing, I'm sure we're going to get repeated versions of this question:

Why does Dokku wait for 35 seconds #1113

On that note, the default check could probably benefit from some logging pointing them towards more efficient checks. eg "You are using the default and likely inefficient availability CHECK. See to learn how to tune the CHECKS to your app" (If folks agree, I will take care of this)

michaelshobbs · 2015-04-17T19:31:47Z

Yeah, I don't think we're in any rush at all actually. I do think, however, its been on the backburner for quite a long time. So having the discussion again seems prudent.

Regarding checks and early stage dev, I think we cover this case well if a dev does not include a CHECKS file by just making sure the container stays up. Would you agree?

My thinking on non-web process containers is that we would just use the default check to start. Any suggestions on implementing custom checks would be great too but probably not necessary in the initial pass. The issue here is, if we do have a CHECKS file, how do we know which container to execute against?

Also, I think printing out some message with the URL to the checks docs would be great in the case of no CHECKS file.

michaelshobbs · 2015-04-17T21:04:41Z

So in doing some rubber duck "debugging" I had the thought we might just adjust the CHECKS file format to include the process name which would map to a container that we'd test against.

josegonzalez · 2015-04-17T21:22:32Z

Are there any services on heroku with a "checks" type functionality? I'm thinking about how we might want to abstract this so that it would work regardless of application type.

josegonzalez · 2015-04-17T21:23:10Z

Another thing: Can we have multiple http processes? How do we handle exposing multiple ports properly?

josegonzalez · 2015-04-17T21:26:11Z

I think a 10 second start rather than 35 is a much better default (or even 5). @joshco feel free to pr the logging output change.

michaelshobbs · 2015-04-17T21:30:42Z

This? https://devcenter.heroku.com/articles/preboot
I think there's a few different scenarios. (i.e. are we binding externally and are we using nginx-vhosts)
I'm down with changing the default naive check to 10 seconds. I was also thinking we may want to implement retries in the default case. So like 5 second timeout with 6 retries... or 3 and 10 maybe? thoughts?

michaelshobbs · 2015-04-17T21:46:59Z

Derp that last one. Not sure what I was thinking. It's either going to stay up or not. 10 seconds is good with me. I'll include it in the forthcoming PR.

michaelshobbs · 2015-04-17T21:48:08Z

Next question: How do we return data from dokku logs <app>?

josegonzalez · 2015-04-17T21:53:32Z

Are the logs aggregated on heroku? I think we need to provide both aggregated and non-aggregated versions of log output.

michaelshobbs · 2015-04-17T21:58:58Z

Yes they are aggregated.

joshco · 2015-04-17T22:20:01Z

Agree on the retries for the default check.
I can do that along with the logging notice about using the tuned checks.

Sent from my iPhone

On Apr 17, 2015, at 5:59 PM, Michael Hobbs notifications@github.com wrote:

Yes they are aggregated.

—
Reply to this email directly or view it on GitHub.

michaelshobbs · 2015-04-17T22:22:51Z

@joshco retries on the default check don't make much sense because we're not attempting to connect to the container; just checking that it exists. So a retry won't do anything useful.

joshco · 2015-04-17T22:35:41Z

Using a retry will let us reduce the wait time for the default check. Now any deployment blocks for 35 sec which the majority of time is unneeded.
With a retry we can check every 5 sec (or less) which will work most of the time

Sent from my iPhone

On Apr 17, 2015, at 6:22 PM, Michael Hobbs notifications@github.com wrote:

@joshco retries on the default check don't make much sense because we're not attempting to connect to the container; just checking that it exists. So a retry won't do anything useful.

—
Reply to this email directly or view it on GitHub.

progrium · 2015-04-17T22:53:19Z

Btw, I'm working on a generic health check utility for Docker containers that might help out with some of this.

joshco · 2015-04-17T23:19:10Z

Workflow question: I've made the logging changes to guide the user towards the checks examples to tune CHECKS.
Do you prefer if I update my existing branch and PR or should I do new branch and new PR?

michaelshobbs · 2015-04-17T23:20:15Z

@joshco I think we're referring to two different executions paths perhaps.
I'm specifically referring to this:
https://github.com/progrium/dokku/blob/5224a6c6783bcfba5c90839e2365f6a491cfb637/plugins/checks/check-deploy#L80-L92

In this block we sit for x number of seconds, check if the container is still around and then exit 0 if its still there or exit 1 if its dead. We never drop into the retry section because we're never really waiting on anything to bind to a port or anything. Instead we're just making sure the container didn't die.

I think the only change that's probably necessary is to drop the default timeout to 10 seconds as previously discussed.

michaelshobbs · 2015-04-17T23:20:26Z

A new PR would be best I think.

joshco · 2015-04-17T23:28:24Z

Done: #1119

josegonzalez · 2015-04-18T07:39:40Z

Closing as there is an open PR - #1118 - where we should contain any future discussion.

michaelshobbs mentioned this issue Apr 17, 2015

Provide process type as $DYNO to container #733

Closed

michaelshobbs mentioned this issue Apr 17, 2015

container-level scaling #1118

Merged

12 tasks

josegonzalez closed this as completed Apr 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ps:scale implementation #1117

ps:scale implementation #1117

michaelshobbs commented Apr 17, 2015

joshco commented Apr 17, 2015

michaelshobbs commented Apr 17, 2015

michaelshobbs commented Apr 17, 2015

josegonzalez commented Apr 17, 2015

josegonzalez commented Apr 17, 2015

josegonzalez commented Apr 17, 2015

michaelshobbs commented Apr 17, 2015

michaelshobbs commented Apr 17, 2015

michaelshobbs commented Apr 17, 2015

josegonzalez commented Apr 17, 2015

michaelshobbs commented Apr 17, 2015

joshco commented Apr 17, 2015

michaelshobbs commented Apr 17, 2015

joshco commented Apr 17, 2015

progrium commented Apr 17, 2015

joshco commented Apr 17, 2015

michaelshobbs commented Apr 17, 2015

michaelshobbs commented Apr 17, 2015

joshco commented Apr 17, 2015

josegonzalez commented Apr 18, 2015

ps:scale implementation #1117

ps:scale implementation #1117

Comments

michaelshobbs commented Apr 17, 2015

joshco commented Apr 17, 2015

michaelshobbs commented Apr 17, 2015

michaelshobbs commented Apr 17, 2015

josegonzalez commented Apr 17, 2015

josegonzalez commented Apr 17, 2015

josegonzalez commented Apr 17, 2015

michaelshobbs commented Apr 17, 2015

michaelshobbs commented Apr 17, 2015

michaelshobbs commented Apr 17, 2015

josegonzalez commented Apr 17, 2015

michaelshobbs commented Apr 17, 2015

joshco commented Apr 17, 2015

michaelshobbs commented Apr 17, 2015

joshco commented Apr 17, 2015

progrium commented Apr 17, 2015

joshco commented Apr 17, 2015

michaelshobbs commented Apr 17, 2015

michaelshobbs commented Apr 17, 2015

joshco commented Apr 17, 2015

josegonzalez commented Apr 18, 2015