Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Internal helm-deployed Concourse Instance #2876

Closed
scottietremendous opened this issue Nov 27, 2018 · 8 comments
Labels

Comments

@scottietremendous
Copy link
Contributor

@scottietremendous scottietremendous commented Nov 27, 2018

Summary

In order to test a helm-deployed Concourse instance at a small, but realistic scale, we'd like to create a new, internal-facing Concourse deployment using the helm charts.

For now, this deployment will exist as a type of "mock-Wings". We believe this deployment can be used for acceptance testing pipelines going forward. In order to accomplish this, we'll need to perform a few actions:

  • Compare all the bits from the current Wings BOSH deployment and see if we want to keep what we currently have, or sub-in Google equivalents
  • Set up a Grafana dashboard via Prometheus and monitor the same SLIs as Wings
  • Create public address
@cirocosta

This comment has been minimized.

Copy link
Member

@cirocosta cirocosta commented Nov 28, 2018

Hey,

Here's a table that @YoussB and I created describing the comparison between prod, wings, and hush-house (the k8s deployment):

feature wings prod hush-house
logs papertrail papertrail stackdriver
metrics from the host telegraf (host metrics) telegraf (host metrics) cadvisor (container metrics) + node exporter (node metrics)
tls yep yep yep
db encryption yep yep yep
token signing key yep yep yep
container_placement_stretegy random default default
default task limits nope 1024 (shares?) CPU; 5Gi RAM nope
github auth yep yep yep
metric emitter configured in atc influx influx prometheus
log level debug debug debug
team authorized keys (external workers that take load) yep yep nope
postgres GCP sql bosh deployed k8s deployed
secrets backends nope vault (bosh deployed) k8s secrets (needs testing)
worker drain timeout nope 10m nope
versions 4.2.1 latest latest
worker ephemerality nope nope yep
worker baggageclaim driver btrfs btrfs overlay
rebalancing nope nope 10m

Also, some details that @xtreme-sameer-vohra and I outlined (that are required to achieve those 3 high-level goals):

concourse

  • change test:test main password (configure github auth)
  • setup tls (PR submitted helm/charts#9665)
  • make use of DB encryption
  • setup the token signing key to not use a default one
  • make use of our own certificates & private keys for TSA instead of relying on the default ones generated by the Helm chart
  • make use of latest image
  • make use of SIGUSR2 for gracefully terminating the worker
  • configure worker drain timeout to 1m or something like that
  • update workers to use overlay instead of naive
  • worker ATC rebalancing

grafana

  • setup github auth
  • make use of latest metrics
    • http responses for requests that failed (500)
  • gather disk metrics that reflect the use of disk under the work-dir mount
    • given that it makes use of a volume, cadvisor can't really tell us about the disk usage. We can either make use of the kubelet PV metrics (see google/cadvisor#1702 (comment)) or have a sidecar that looks at the mounted filesystems and expose those metrics for us on a per-pod basis (I'm starting to think that this is the best way to go)
  • setup tls

generic

  • move hush-house to pivotal-cf?
  • keep track of the versions of the helm dependencies we have (prometheus & grafana) now deployments declare their versions
  • research whether having a single chart w/ dependencies OR having multiple releases is the best for managing a deployment that consists of multiple charts
    • end up going with a middleground, separating metrics into its own deployment, but having under metrics the dependencies set.
  • monitor prometheus persistent disk concourse/hush-house#1
    • we do that for both workers (emptyDir) and postgresql; we might just as well do that for Prometheus (eventually have some metrics from prometheus itself?)
      • eventually this can be something we just take as "pod metrics" like we're doing for Go?
  • add some more interesting workloads to it
  • produce concourse/concourse-rc in the main pipeline
  • automatically deploy hush-house from the main pipeline after k8s-topgun
    • automatically deployments
      • there's an ongoing issue w/ Helm that if it tries to deploy a given revision that creates some resources but for some reason fails, helm leaves this resources hanging there in k8s and then further upgrade fail (see helm/helm#1193). There's a pretty good PR (helm/helm#4871) that seems to do the right thing, but it's not released yet.
    • as a step, we might want to tie kubeval to verify whether what's been generated by helm is really valid for the k8s distro concourse/hush-house#2

thx!


Some observed issues:

@cirocosta

This comment has been minimized.

Copy link
Member

@cirocosta cirocosta commented Dec 10, 2018

Hey,

I'm leaving the issue as "paused" now after being pretty confident that we were able to achieve a reliable Helm deployed Concourse environment.

As a result of the efforts put into it, we submitted a good deal of changes to upstream (see charts PRs):

Alongside those, we also have some changes that are not in the PR form given that they relate to changes that will come only after 5.x:

As I see, the next step is to take that knowledge back to our internal testing under topgun (see https://github.com/concourse/concourse/tree/master/topgun/k8s) test suit so that we can later continuously deploy hush-house with full confidence - see the topgun epic.

Please let me know if you have any remarks / questions about it.

Thanks!

@cirocosta cirocosta added the paused label Dec 10, 2018
@cirocosta

This comment has been minimized.

Copy link
Member

@cirocosta cirocosta commented Dec 21, 2018

Update on disk metrics:

  • I tried making use of ephemeral-storage reservation + limits, and although that indeed works well (if you get to the limit, it'll evict your pod after some interval), but we're still unable to properly capture a decent metric from Kubernetes that would tell us the exact disk usage of such emptyDir (as something like xfs quotas would).

For now, I added a panel in our Grafana configuration that relates the filesystem usage of the node where the worker lands so we can have an idea of which workers are probably going to get into trouble (allowing us to alert based on that).

(
  kube_pod_container_info{pod=~"$worker"}
) 
   * on (kubernetes_node) group_left(device)
(
    1 - (node_filesystem_avail_bytes{mountpoint="/etc/hostname"}  / node_filesystem_size_bytes)
)

workers-disk-metric

ps.: if you're wondering why we make use of emptyDir: it's because we're then able to mount overlayfs - something that isn't possible if the underlying filesystem is already an overlay filesystem (see https://lkml.org/lkml/2018/1/8/81).

pps.: if you're deploying 1 worker per machine, then I guess that's certainly good enough*

@voor

This comment has been minimized.

Copy link

@voor voor commented Jan 14, 2019

While we wait for all of these PRs to get merged into the stable chart repo, is there a helm repo or git repo we can reference to get the "best" chart?

@cirocosta cirocosta removed the paused label Jan 14, 2019
@cirocosta

This comment has been minimized.

Copy link
Member

@cirocosta cirocosta commented Jan 14, 2019

Hey @voor ,

Thanks for taking a look at it!


Right now we have all of the changes lying under our fork of helm/charts: concourse/charts.

There you can find three branches:

  • maintenance: contains the automation to push a test Helm repository containing the changes we're pushing to upstream and some others that we're still holding;
  • merged: the representation of all of our PRs and other branches merged (continuously updated); and
  • gh-pages, a branch that serves merged as a Helm repository.

Right now, that's all experimental, and I'd not endorse it as something to use for production at all - that branch is for our internal testing only - some of the changes (not yet submitted as PR to helm/charts rely on behaviors that are only on 5.x.x).

In regards to submiting those changes to upstream - they're coming! We got in contact with @william-tran (one of the maintainers) and we'll be working towards having the team's contributions there very soon!

tl;dr: helm/charts is still considered "the best" for people to use, but the Concourse team is now comitted to maintaining what previously was just a community thing.


Btw, feel free to discuss that with us on Discord! There's a `#kubernetes` channel there now 😁

Thanks!

@paulczar

This comment has been minimized.

Copy link

@paulczar paulczar commented Jan 14, 2019

Hey folks, I'm a pivot, and also a helm charts maintainer, reach out and hit me up if you need stuff reviewed in the helm chart repo ... best way is on slack, either the kubernetes or pivotal slack. I'm working to shepherd the above PRs through now.

@paulczar

This comment has been minimized.

Copy link

@paulczar paulczar commented Jan 14, 2019

If y'all want to take over managing the concourse helm charts I can help with that. Now that the chart hub supports external repos, there's no reason we couldn't move the concourse chart to its own thing.

see https://github.com/helm/hub#distributed-search

@cirocosta

This comment has been minimized.

Copy link
Member

@cirocosta cirocosta commented Feb 4, 2019

Thanks so much for helping us out Paul!

Regarding this issue, I'll be closing it for now in favor of the issues created under https://github.com/concourse/hush-house, where the deployment lives.

I'll soon create a new issue where we can aggregate the set of issues needed to be tackled before having a "wings"-style deployment where other teams within Pivotal can push their workloads so we can have this path even more battle tested.

Thanks for everyone who helped!

@cirocosta cirocosta closed this Feb 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.