Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 107 additions & 9 deletions HOW_THE_PIPELINE_WORKS.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ The following is a high-level outline of the various stages of our
build and test pipeline. It is not complete but lists the major
steps. For a more complete diagram see RELEASE_PROCESS.md:

```
```text
+----------------+
+----------+ | Buildkite | +-----------+ +-----------+ +----------+ +-------------+ +---------------+
| Pull | | Verify | | Pull | | User and | | Habitat | | Packages | | deploy/dev |
Expand Down Expand Up @@ -84,9 +84,9 @@ and start actual Automate Habitat packages.

These tests comes in two different flavors:

* Studio Tests:
#### Studio Tests

```
```yaml
- label: "[unit] secrets-service"
command:
- hab studio run "source .studiorc && go_component_unit secrets-service && go_component_static_tests secrets-service"
Expand All @@ -107,9 +107,9 @@ To understand these tests, you need to read the related studio
functions. The dev studio is defined in a series of shell functions in
`.studiorc` and the `.studio` directory.

* "Makefile" tests:
#### "Makefile" tests

```
```yaml
- label: "[unit] automate lib"
command:
- scripts/install_golang
Expand All @@ -134,7 +134,7 @@ integration tests run. In the first phase of this pipeline, we build
all of the packages that changed as part of the pull request. This
happens in the build step:

```
```yaml
- label: build
command:
- scripts/verify_build
Expand Down Expand Up @@ -165,7 +165,7 @@ The tests in this pipeline come in two primary flavors:

These tests use the Habitat dev studio. They look like this:

```
```yaml
- label: "[integration] ingest-service"
command:
- . scripts/verify_setup
Expand All @@ -191,7 +191,7 @@ and then any related code that those functions may call.

These tests use our integration test framework. They look like this:

```
```yaml
- label: "[integration] deep upgrades"
command:
- integration/run_test integration/tests/deep_upgrade.sh
Expand Down Expand Up @@ -305,7 +305,9 @@ This pipeline builds any packages that have changed in the most recent
commit. The buildkite pipeline definition itself is autogenerated
using the data found in `.bldr.toml`. `.bldr.toml` is generated by

```shell
go run tools/bldr-config-gen
```

These packages are then be uploaded to the Habitat depot and promoted
into the dev channel.
Expand All @@ -319,7 +321,7 @@ The manifest is created using the `.expeditor/create-manifest.rb`
script.

Expeditor posts the outcome of this build and upload process into
#a2-notify. In the case of failure, Expeditor will
`#a2-notify`. In the case of failure, Expeditor will
include a link to the log file that contains the full output of the
action that failed.

Expand Down Expand Up @@ -356,3 +358,99 @@ deploys; however, failures here are still taken seriously and
investigated.

Our nightly pipeline is defined in `.expeditor/nightly.pipeline.yml`

## Pipeline Monitoring

There are a handful of processing pipelines that work together to achieve our continuous integration environment,
covering everything from validating a pull request to releasing code to customers.
These pipelines need to be monitored on a DAILY basis for any anomalies.

### Monitoring for FAILED Runs

There are two dashboards that show these pipelines in the browser.
The rightmost bar in each spark chart indicates the latest status.
Any that are red indicate a failure.
Some of these failures, though, are routine and require no action from someone monitoring them.
For example, the verify/private pipeline (https://buildkite.com/chef/chef-automate-master-verify-private) will report failing unit tests on a branch for an individual pull request, indicating only that the author has a bit more work to finish their PR.
But if that same pipeline fails on `master` that may indicate a problem elsewhere that needs remediation.
At the time of writing, only one pipeline is under "chef-oss"; all the rest are under "chef":

- Chef pipelines: https://buildkite.com/chef?filter=chef%2Fautomate%3A
- Chef OSS pipelines: https://buildkite.com/chef-oss?filter=chef%2Fautomate

### Monitoring for BLOCKED Runs

The dashboards mentioned above indicate the success or failure of each run; they do NOT indicate any pipelines that are blocked.
Thus, either you have to check builds one-by-one, or run the following shell function to check them all.
Run `pipelines --blocked` after sourcing scripts/pipeline_info.sh and satisfying its prerequisites.

Example output:

```json
{"branch":"Amol/project_and_roles_changes","created_at":"2021-01-28T09:38:49.345Z","state":"passed","pr":"4393","pipeline":"[chef/automate:master] verify_private","message":"Added Id testcase for policy.\n\nSigned-off-by: samshinde <ashinde@chef.io>"}
{"branch":"Himanshi/client_UI_changes","created_at":"2021-01-28T08:42:51.369Z","state":"passed","pr":"4661","pipeline":"[chef/automate:master] verify_private","message":"Addedd CSS changes in client details.\n\nSigned-off-by: Himanshi Chhabra <himanshi.chhabra@msystechnologies.com>"}
```

Occasionally you will see a false positive if someone resolved the blockage by just starting a new build instead of unblocking the original blocked build.
Example: https://buildkite.com/chef/chef-automate-master-verify-private/builds?branch=Himanshi/client_UI_changes

### Monitoring for STALE Runs

The dashboards mentioned above indicate the success or failure of each run; they do NOT indicate any pipelines that have stopped running.
Thus, either you have to check the date on each pipeline's last build one-by-one, or run the following shell function to check them all.
Run `pipelines --stale` after sourcing scripts/pipeline_info.sh and satisfying its prerequisites.
The output itemizes all pipelines that have not been run in the last day.
That does not, however, mean each has a problem; some pipelines are needed to run less frequently.

Example output:

```json
{"name":"[chef/automate:master] post-promote","createdAt":"2021-02-01T09:00:52.228Z"}
{"name":"[chef/automate:master] deploy/ui-library","createdAt":"2020-12-23T04:04:33.035Z"}
{"name":"[chef/automate:master] habitat/build","createdAt":"2021-02-01T19:46:53.664Z"}
{"name":"[chef/automate:master] deploy/acceptance","createdAt":"2021-01-30T16:58:23.482Z"}
```

`Post-promote` and `habitat/build`, for instance, only run after PRs that create a new deliverable.
This therefore excludes PRs that involve files that are just touching internal docs or build files, etc.
So if there are no PRs at all in a given day, or only PRs that update e.g. internal docs, then that pipeline will not run that day.
Another example: `deploy/acceptance` will only run on days that we promote code to acceptance.

So this list is a starting point of things you might need to check, compared to the whole list (run `pipelines` with no arguments).

### Monitoring Other Pipeline Issues

Some issues will show up in notifications the `#a2-notify` slack channel.

- Problems reported by Buildkite itself will show a "thumbs-down" indicator if there is a problem needing attention.
- Other problems, like those reported by Netlify, will require reading the text, e.g. "Deploy did not complete for chef-automate (deploy preview 4676)".
- Still others, like those from Semgrep, report only problems, so each one requires attention. (Well, in the Semgrep case, it depends: Semgrep reports issues on master, yes, but it also reports issues in open PRs; the latter are up to the developer to resolve).

## Ancillary Monitoring

Hand-in-hand with pipeline monitoring is monitoring of other crucial aspects of the health of the codebase.

### Dependabot

GitHub provides built-in support for Dependabot security updates.
Prerequisites to have a repository scanned are detailed [here](https://docs.github.com/en/github/managing-security-vulnerabilities/configuring-dependabot-security-updates#supported-repositories).
The Chef Automate repository satisfies the prerequisites and has Dependabot updates enabled; configuration settings are available [here](https://github.com/chef/automate/settings/security_analysis), though only admins have access to that link.
You can also view [past alerts](https://github.com/chef/automate/security/dependabot) on the dashboard as well as [current and past PRs](https://github.com/chef/automate/pulls?q=is%3Apr+dependabot+) automatically generated by Dependabot.

Periodically you will see pull requests automatically generated by Dependabot--these need to be reviewed and merged like any other PR, on a timely basis.

### Semgrep

Semgrep (short for _Semantic Grep_) is a static code analyzer for both security issues as well as best practice issues.
This is integrated into the Chef Automate workflows and runs:
(a) a differential scan for every PR when opened (a task in the [verify_private pipeline](https://buildkite.com/chef/chef-automate-master-verify-private))
(b) a differential scan for every PR when merged (same task)
(c) a full code base scan every day (a task in the [nightly pipeline](https://buildkite.com/chef/chef-automate-master-nightly))

Semgrep findings are detailed in the output log of those Buildkite tasks every time it runs.
Most significantly, any new findings generate slack notifications in the `#a2-notify` channel, so the channel should be monitored closely for any such reports.
Semgrep also provides a [dashboard](https://semgrep.dev/manage/findings?repo=chef%2Fautomate-nightly&ref_type=branch&ref=master&state_type=new&tab=findings) where you can see not only all current issues on master, but trends and past history on all branches.
Note that, at the time of writing there are about 21 open findings; issues that need to be resolved. But it would be counter-productive to generate slack notifications for each of those every day--no one would pay any attention. That is why notifications occur only for new findings.

Semgrep is available (through several make file entries) to be run locally as well, either on the whole code base or on individual components within the code base.
More details on this will be documented in the near future.
77 changes: 77 additions & 0 deletions HOW_TO_MAINTAIN_A2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Maintaining Automate

The automate code base needs continual maintenance to remain robust and fresh.

Who owns this task? YOU.
Every engineer should own this; taking the initiative to either fix something or at least report something to those who can, when issues arise.

There are two main aspects to this:

- Proactive: keeping up with third-party library releases -- covered in this document.
- Reactive: addressing system or pipeline issues that come up due to system changes, library changes, environment changes, etc. -- this is covered in the separate HOW_THE_PIPELINE_WORKS.md document.

## Third-party Dependency Update Monitoring

Proactively one should monitor releases of third-party libraries and re-sync our code base with them as soon as scheduling allows.
Automate uses two main languages, Go (back-end) and TypeScript (front-end), each having their own complex ecosystem.

### Back-end Updates

There is no central place to check on all dependencies.
The following, though, lists some of the key Golang dependencies--this is NOT a complete list by any mean.
One can set up a "watch" on each of these repositories to be automatically notified of new releases that can then be scheduled as technical debt items to be done in a timely fashion.

- https://github.com/golang/go/releases
- https://github.com/grpc/grpc-go/releases
- https://github.com/golang/protobuf/releases
- https://github.com/open-policy-agent/opa/releases
- https://github.com/grpc-ecosystem/grpc-gateway/releases
- https://github.com/golang/protobuf/protoc-gen-go/releases
- https://github.com/stretchr/testify/releases
- https://github.com/lib/pq/releases
- https://github.com/dexidp/dex/releases
- https://github.com/nginx/nginx/releases

Note that some of these may be pinned for various reasons; those reasons should be re-examined periodically to see if the issue requiring the pinning can be or has been resolved. Example: at the time of writing, `dex` is pinned at 2.19.0 as indicated here: https://github.com/chef/automate/blob/4b6d53641d3687bcf0f15303eed5b1dcd7eb251c/go.mod#L132-L133

### Front-end Updates

All of the following occurs in the context of components/automate-ui.
One can quickly identify any outdated packages using `npm outdated` on the command-line (from the automate-ui directory).
There are some packages that we deliberately pin (in automate-ui/package.json) for compatibility reasons.
These are documented in the "Dependency Management" section of automate-ui/README.md.
Be sure to review and update that document whenever updating any dependencies.

Notable in front-end updates are two things:

- Angular -- follow the releases at https://update.angular.io/ and follow the methodology outlined in https://github.com/chef/automate/pull/3082.
- Node -- follow the release at https://nodejs.org/en/about/releases/ and follow the methodology outlined in https://github.com/chef/automate/pull/2579.

Separate from keeping an eye on version numbers becoming out-of-date are how either updating a package or NOT updating a package may still cause problems to appear.
Some such problems generate warnings rather than errors (when you run unit tests or lint, for example).
But warnings do not fail the build so often these would go unresolved or even completely unnoticed for quite some time.
We therefore added a mechanism to treat warnings as errors and thus fail the build on any warnings.
This is in place for running both unit tests and lint in the UI (look for instances of build.sh in components/automate-ui/package.json).
But, alas, there are other problems that are not even reported as warnings (deprecations and such), and do NOT cause the build to fail.
Because of that, it is necessary for certain tasks to be checked periodically, specifically `make lint`, `make unit`, and `make e2e`.
At the time of writing, all of these are reporting issues that need to be resolved:

```text
make lint:
- CURRENT PROBLEM--does not fail build:
TSLint's support is discontinued and we're deprecating its support in Angular CLI.
To opt-in using the community driven ESLint builder, see: https://github.com/angular-eslint/angular-eslint#migrating-from-codelyzer-and-tslint.

make unit:
- CURRENT PROBLEM--does not fail build:
'karma-coverage-istanbul-reporter' usage has been deprecated since version 11.
Please install 'karma-coverage' and update 'karma.conf.js.' For more info, see https://github.com/karma-runner/karma-coverage/blob/master/README.md

make e2e:
- CURRENT PROBLEM--fails build when run locally, but not in Buildkite:
spawn Unknown system error -86
References:
https://stackoverflow.com/questions/65618558/osx-fix-selenium-chromedriver-launch-error-spawn-unknown-system-error-86-bad-cp
https://github.com/angular/webdriver-manager/issues/476

```
1 change: 1 addition & 0 deletions cspell.json
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@
"dehaze",
"deletable",
"denormalize",
"dependabot",
"dequeuer",
"dequeues",
"deregister",
Expand Down
40 changes: 40 additions & 0 deletions scripts/pipeline_info.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#!/usr/bin/env bash

# Pre-requisites:
# gq from https://github.com/hasura/graphqurl
# jq from https://stedolan.github.io/jq/
# env var $BUILDKITE_TOKEN populated from https://buildkite.com/user/api-access-tokens

# Usage:
# pipelines --blocked
# pipelines --stale
# pipelines

pipelines() {
mode=$1
gql='query {
organization(slug: "chef") {
pipelines(last: 100, search: "[chef/automate:master]") {
edges {
node {
name
builds(first: 1) {
edges {
node {
branch state message createdAt
} } } } } } } }'
if [[ $mode == "--blocked" ]]; then
# See https://buildkite.com/docs/apis/rest-api/builds for buildkite API
curl -sH "Authorization: Bearer $BUILDKITE_TOKEN" https://api.buildkite.com/v2/builds?state=blocked | \
jq -c '.[] | select(.pipeline.provider.settings.repository == "chef/automate") | { branch, created_at, state, pr:.pull_request.id, pipeline:.pipeline.name, message }'
elif [[ $mode == "--stale" ]]; then
# See https://buildkite.com/user/graphql/documentation for buildkite GraphQL API
echo "Note: any reported pipelines are NOT necessarily problems"
gq https://graphql.buildkite.com/v1 -H "Authorization: Bearer $BUILDKITE_TOKEN" --query="$gql" | \
jq -c --arg today "$(date +"%Y-%m-%d")" \
'.data.organization.pipelines.edges[].node | .builds.edges[].node.createdAt as $createdAt | select($createdAt < $today) | { name:.name, createdAt:$createdAt }'
else
gq https://graphql.buildkite.com/v1 -H "Authorization: Bearer $BUILDKITE_TOKEN" --query="$gql" | \
jq -c '.data.organization.pipelines.edges[].node | .builds.edges[].node.createdAt as $createdAt | { name:.name, createdAt:$createdAt }'
fi
}