Skip to content
This repository has been archived by the owner on Nov 30, 2021. It is now read-only.

[meta] Deis v2 Beta #4809

Closed
50 of 63 tasks
slack opened this issue Dec 21, 2015 · 15 comments
Closed
50 of 63 tasks

[meta] Deis v2 Beta #4809

slack opened this issue Dec 21, 2015 · 15 comments
Assignees
Labels
Milestone

Comments

@slack
Copy link
Member

slack commented Dec 21, 2015

Meta issue for Deis v2 Beta; as specific issues are filed in other repos, we will link them back to this issue.

Beta Issues Search

Beta Goals

The Beta period is focused on stabilizing backing services, reaching availability and recoverability targets. Technology choices for each of backing services should be well understood for each of our blessed configurations.

e2e testing infrastructure for v2 platform should be in place. The long-running/soak tests are a requirement to leave beta (see test target below).

Beta should also exit with a strong opinion about production and staging configurations and how best to go from zero to kubernetes.

Project-level

  • Documentation: [meta] V2 Beta Documentation Requirements #4889
    • Services that support TLS should be deployed with TLS by default
  • Availability target:
    • Components should be recoverable with manual intervention and documentation
    • Self-healing components where possible
  • Recoverability target:
    • Given a backup of the platform, restoring to a known-good k8s cluster should take < 1 hour
  • Infrastructure targets:
    • Documentation for standup on AWS via coreos-kubernetes
    • Documentation for standup on GKE
  • Installation target:
    • Automated via Helm
  • Test target:
    • workflow-e2e fully automated
    • workflow-e2e green
  • Debugging target:
    • Deis Workflow Manager

Meta

Release Checklist

  • update documentation from alpha to beta
  • update workflow charts

Components

Client

  • Client v2 master builds pushed to bintray

Controller deis/workflow

Database deis/postgres

Etcd deis/etcd

  • Availability: HA configuration deployed by default
  • Metrics: expose operational metrics to platform monitor component
  • Recoverability: documented backup + restore
  • Testing: single, multi-node failure, recovery

Registry deis/registry

Object storage

Router deis/router

  • Router configuration and sub-system documented
  • basic http routing to services
  • Terminate SSL for workflow/controller
  • Terminate SSL for customer applications
  • Metrics: expose operational metrics to monitor component
  • Availability
    • Strategy: RC=1 out of box (gives us ssl session resumption)

Builder deis/builder

Slugbuilder deis/slugbuilder

  • Every demo app builds correctly
  • Metrics: expose operational metrics to monitor component

Slugrunner deis/slugrunner

  • Metrics: expose operational metrics to monitor component

Logger deis/logger

  • Logging sub-system internals documented
  • Functioning deis logs for all apps
  • Platform log-draining + documentation
  • Availability:

Monitoring deis/monitor

  • Monitor sub-system internals documented
  • Default monitors for workflow components: availability/health only
@slack slack self-assigned this Dec 21, 2015
@slack slack added this to the v2.0-beta1 milestone Dec 21, 2015
@arschles
Copy link
Member

two additional items to consider:

  • easing router startup config - it would be nice, from a user & setup perspective, to not have to edit the incoming domain in the router manifest
  • port non-Go components of builder to Go

@slack
Copy link
Member Author

slack commented Dec 22, 2015

I agree that the user-experience for folks first setting up the platform would be greatly aided by something which isn't raw k8s wrangling. I feel there is certainly a space for a deisctl2 type of tool, but I'm on the fence as to what that should be.

Builder can certainly benefit from a non-bash life; however, one of my motivating priorities is working code now, versus the thing we'd like to have. A builder go-rewrite could easily be delivered in a 2.0 point release, which is what I'm leaning toward at the moment. A huge caveat being that through beta, if builder is consistently melting our faces we can reconsider.

@arschles
Copy link
Member

@slack ok, fair enough. RE the deisctl2-like tool, not sure what that'd be for, maybe the aforementioned router config?

RE Go in the builder - it's actually not as big a task as a rewrite as Builder is mostly written in Go already. There are a few face melting shell scripts, however, that do a lot of important things and are untested. Those need to be ported and tests written for them. deis/builder#29 explains a bit further, and happy to explain further. Not sure if this all needs to be in the list - IMO it easily fits under a stability bullet point.

@arschles
Copy link
Member

also, @slack we should add persistent storage for minio, under 'object storage'

@slack
Copy link
Member Author

slack commented Dec 28, 2015

Storage added!

@rimusz
Copy link

rimusz commented Dec 28, 2015

👍

@rimusz
Copy link

rimusz commented Dec 28, 2015

minio persistent storage example:

          volumeMounts:
            - name: home-minio
              mountPath: /home/minio
      volumes:
        - name: home-minio
          hostPath:
            path: /data/deis/minio

minio's boot should run chown minio:minio /home/minio after mkdir /home/minio

@slack slack modified the milestones: v2.0-alpha1, v2.0-beta1 Dec 29, 2015
@smothiki
Copy link
Contributor

smothiki commented Jan 6, 2016

Do we need to expose slugbuilder metrics?? After all its an intermittent job that runs and produces slug and dies.
I think the functionality of buildpack type apps from builder side is done

@krancour
Copy link
Contributor

@slack question about the following that was stated WRT router:

Metrics: expose operational metrics to monitor component

Are any more solid requirements available on what sort of metrics are desired?

@slack
Copy link
Member Author

slack commented Jan 13, 2016

I haven't had the time to pull out specific sets of stats for each of the components. Since we are using prometheus to scrape metrics, a survey of existing prometheus exporters is a great place to start.

In the absence (or even presence) of existing exporters, what we want to enable is operators of the platform to be able to understand the health of the components through these metrics, which then become potential sources for alerts.

To your question @smothiki, emitting metrics for all the platform components lets us answer questions and debug problems with numbers rather than guts. So the more the 📈 📉 the better.

@jchauncey
Copy link
Member

Just an update on the metrics side of things. I have a working telegraf/influx setup. So I'm trying to determine what I should do next. If we have the VTS plugin back in the router component then I can take a stab at building a plugin that will scrape those metrics and send them over to our sink.

For other components we should figure out what is most important and how we want to get that data to influx.

For example, should we have a /stats endpoint in the controller that prints out metrics in json that we can scrape like we do for router? Or should it just send the metrics to influx as it calculates them?

@sgoings
Copy link
Member

sgoings commented Feb 16, 2016

I'm worried about some of the project level requirements such as:

  • Test target:
    • load test: running for 2 days (24/7) without < 5% dropped requests
    • load test: 1 deploy/minute across 5 independent applications
    • workflow-e2e fully automated
  • Dogfooding target:
    • two Deis internal services, long-running

I know there has been some chatter re: dogfooding, but I have yet to see long running tests of applications and performance/load testing. That historically has fallen into my purview but I'm currently working on improving the release process (mutable -> immutable Docker tags).

@jchauncey
Copy link
Member

@slack if this is the official document for releasing beta should we remove the metrics requirements from it?

@slack
Copy link
Member Author

slack commented Mar 1, 2016

Yeah, everything but router. Will update. Done!

@sgoings
Copy link
Member

sgoings commented Mar 15, 2016

The final push for v2 beta is visible on #4962

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

8 participants