chef · msorens · Feb 6, 2021 · Feb 2, 2021 · Feb 2, 2021 · Feb 4, 2021
diff --git a/HOW_THE_PIPELINE_WORKS.md b/HOW_THE_PIPELINE_WORKS.md
@@ -11,7 +11,7 @@ The following is a high-level outline of the various stages of our
 build and test pipeline.  It is not complete but lists the major
 steps. For a more complete diagram see RELEASE_PROCESS.md:
 
-```
+```text
                           +----------------+
        +----------+       | Buildkite      |       +-----------+       +-----------+      +----------+    +-------------+     +---------------+
        | Pull     |       |   Verify       |       |   Pull    |       | User and  |      | Habitat  |    | Packages    |     | deploy/dev    |
@@ -84,9 +84,9 @@ and start actual Automate Habitat packages.
 
 These tests comes in two different flavors:
 
-* Studio Tests:
+#### Studio Tests
 
-```
+```yaml
   - label: "[unit] secrets-service"
     command:
       - hab studio run "source .studiorc && go_component_unit secrets-service && go_component_static_tests secrets-service"
@@ -107,9 +107,9 @@ To understand these tests, you need to read the related studio
 functions. The dev studio is defined in a series of shell functions in
 `.studiorc` and the `.studio` directory.
 
-* "Makefile" tests:
+#### "Makefile" tests
 
-```
+```yaml
   - label: "[unit] automate lib"
     command:
       - scripts/install_golang
@@ -134,7 +134,7 @@ integration tests run. In the first phase of this pipeline, we build
 all of the packages that changed as part of the pull request. This
 happens in the build step:
 
-```
+```yaml
   - label: build
     command:
       - scripts/verify_build
@@ -165,7 +165,7 @@ The tests in this pipeline come in two primary flavors:
 
 These tests use the Habitat dev studio. They look like this:
 
-```
+```yaml
   - label: "[integration] ingest-service"
     command:
       - . scripts/verify_setup
@@ -191,7 +191,7 @@ and then any related code that those functions may call.
 
 These tests use our integration test framework. They look like this:
 
-```
+```yaml
   - label: "[integration] deep upgrades"
     command:
       - integration/run_test integration/tests/deep_upgrade.sh
@@ -305,7 +305,9 @@ This pipeline builds any packages that have changed in the most recent
 commit. The buildkite pipeline definition itself is autogenerated
 using the data found in `.bldr.toml`. `.bldr.toml` is generated by
 
+```shell
     go run tools/bldr-config-gen
+```
 
 These packages are then be uploaded to the Habitat depot and promoted
 into the dev channel.
@@ -319,7 +321,7 @@ The manifest is created using the `.expeditor/create-manifest.rb`
 script.
 
 Expeditor posts the outcome of this build and upload process into
-#a2-notify. In the case of failure, Expeditor will
+`#a2-notify`. In the case of failure, Expeditor will
 include a link to the log file that contains the full output of the
 action that failed.
 
@@ -356,3 +358,99 @@ deploys; however, failures here are still taken seriously and
 investigated.
 
 Our nightly pipeline is defined in `.expeditor/nightly.pipeline.yml`
+
+## Pipeline Monitoring
+
+There are a handful of processing pipelines that work together to achieve our continuous integration environment,
+covering everything from validating a pull request to releasing code to customers.
+These pipelines need to be monitored on a DAILY basis for any anomalies.
+
+### Monitoring for FAILED Runs
+
+There are two dashboards that show these pipelines in the browser.
+The rightmost bar in each spark chart indicates the latest status.
+Any that are red indicate a failure.
+Some of these failures, though, are routine and require no action from someone monitoring them.
+For example, the verify/private pipeline (https://buildkite.com/chef/chef-automate-master-verify-private) will report failing unit tests on a branch for an individual pull request, indicating only that the author has a bit more work to finish their PR.
+But if that same pipeline fails on `master` that may indicate a problem elsewhere that needs remediation.
+At the time of writing, only one pipeline is under "chef-oss"; all the rest are under "chef":
+
+- Chef pipelines: https://buildkite.com/chef?filter=chef%2Fautomate%3A
+- Chef OSS pipelines: https://buildkite.com/chef-oss?filter=chef%2Fautomate
+
+### Monitoring for BLOCKED Runs
+
+The dashboards mentioned above indicate the success or failure of each run; they do NOT indicate any pipelines that are blocked.
+Thus, either you have to check builds one-by-one, or run the following shell function to check them all.
+Run `pipelines --blocked` after sourcing scripts/pipeline_info.sh and satisfying its prerequisites.
+
+Example output:
+
+```json
+{"branch":"Amol/project_and_roles_changes","created_at":"2021-01-28T09:38:49.345Z","state":"passed","pr":"4393","pipeline":"[chef/automate:master] verify_private","message":"Added Id testcase for policy.\n\nSigned-off-by: samshinde <ashinde@chef.io>"}
+{"branch":"Himanshi/client_UI_changes","created_at":"2021-01-28T08:42:51.369Z","state":"passed","pr":"4661","pipeline":"[chef/automate:master] verify_private","message":"Addedd CSS changes in client details.\n\nSigned-off-by: Himanshi Chhabra <himanshi.chhabra@msystechnologies.com>"}
+```
+
+Occasionally you will see a false positive if someone resolved the blockage by just starting a new build instead of unblocking the original blocked build.
+Example: https://buildkite.com/chef/chef-automate-master-verify-private/builds?branch=Himanshi/client_UI_changes
+
+### Monitoring for STALE Runs
+
+The dashboards mentioned above indicate the success or failure of each run; they do NOT indicate any pipelines that have stopped running.
+Thus, either you have to check the date on each pipeline's last build one-by-one, or run the following shell function to check them all.
+Run `pipelines --stale` after sourcing scripts/pipeline_info.sh and satisfying its prerequisites.
+The output itemizes all pipelines that have not been run in the last day.
+That does not, however, mean each has a problem; some pipelines are needed to run less frequently.
+
+Example output:
+
+```json
+{"name":"[chef/automate:master] post-promote","createdAt":"2021-02-01T09:00:52.228Z"}
+{"name":"[chef/automate:master] deploy/ui-library","createdAt":"2020-12-23T04:04:33.035Z"}
+{"name":"[chef/automate:master] habitat/build","createdAt":"2021-02-01T19:46:53.664Z"}
+{"name":"[chef/automate:master] deploy/acceptance","createdAt":"2021-01-30T16:58:23.482Z"}
+```
+
+`Post-promote` and `habitat/build`, for instance, only run after PRs that create a new deliverable.
+This therefore excludes PRs that involve files that are just touching internal docs or build files, etc.
+So if there are no PRs at all in a given day, or only PRs that update e.g. internal docs, then that pipeline will not run that day.
+Another example: `deploy/acceptance` will only run on days that we promote code to acceptance.
+
+So this list is a starting point of things you might need to check, compared to the whole list (run `pipelines` with no arguments).
+
+### Monitoring Other Pipeline Issues
+
+Some issues will show up in notifications the `#a2-notify` slack channel.
+
+- Problems reported by Buildkite itself will show a "thumbs-down" indicator if there is a problem needing attention.
+- Other problems, like those reported by Netlify, will require reading the text, e.g. "Deploy did not complete for chef-automate (deploy preview 4676)".
+- Still others, like those from Semgrep, report only problems, so each one requires attention. (Well, in the Semgrep case, it depends: Semgrep reports issues on master, yes, but it also reports issues in open PRs; the latter are up to the developer to resolve).
+
+## Ancillary Monitoring
+
+Hand-in-hand with pipeline monitoring is monitoring of other crucial aspects of the health of the codebase.
+
+### Dependabot
+
+GitHub provides built-in support for Dependabot security updates.
+Prerequisites to have a repository scanned are detailed [here](https://docs.github.com/en/github/managing-security-vulnerabilities/configuring-dependabot-security-updates#supported-repositories).
+The Chef Automate repository satisfies the prerequisites and has Dependabot updates enabled; configuration settings are available [here](https://github.com/chef/automate/settings/security_analysis), though only admins have access to that link.
+You can also view [past alerts](https://github.com/chef/automate/security/dependabot) on the dashboard as well as [current and past PRs](https://github.com/chef/automate/pulls?q=is%3Apr+dependabot+) automatically generated by Dependabot.
+
+Periodically you will see pull requests automatically generated by Dependabot--these need to be reviewed and merged like any other PR, on a timely basis.
+
+### Semgrep
+
+Semgrep (short for _Semantic Grep_) is a static code analyzer for both security issues as well as best practice issues.
+This is integrated into the Chef Automate workflows and runs:
+(a) a differential scan for every PR when opened (a task in the [verify_private pipeline](https://buildkite.com/chef/chef-automate-master-verify-private))
+(b) a differential scan for every PR when merged (same task)
+(c) a full code base scan every day (a task in the [nightly pipeline](https://buildkite.com/chef/chef-automate-master-nightly))
+
+Semgrep findings are detailed in the output log of those Buildkite tasks every time it runs.
+Most significantly, any new findings generate slack notifications in the `#a2-notify` channel, so the channel should be monitored closely for any such reports.
+Semgrep also provides a [dashboard](https://semgrep.dev/manage/findings?repo=chef%2Fautomate-nightly&ref_type=branch&ref=master&state_type=new&tab=findings) where you can see not only all current issues on master, but trends and past history on all branches.
+Note that, at the time of writing there are about 21 open findings; issues that need to be resolved. But it would be counter-productive to generate slack notifications for each of those every day--no one would pay any attention. That is why notifications occur only for new findings.
+
+Semgrep is available (through several make file entries) to be run locally as well, either on the whole code base or on individual components within the code base.
+More details on this will be documented in the near future.
diff --git a/HOW_TO_MAINTAIN_A2.md b/HOW_TO_MAINTAIN_A2.md
@@ -0,0 +1,77 @@
+# Maintaining Automate
+
+The automate code base needs continual maintenance to remain robust and fresh.
+
+Who owns this task? YOU.
+Every engineer should own this; taking the initiative to either fix something or at least report something to those who can, when issues arise.
+
+There are two main aspects to this:
+
+- Proactive: keeping up with third-party library releases -- covered in this document.
+- Reactive: addressing system or pipeline issues that come up due to system changes, library changes, environment changes, etc. -- this is covered in the separate  HOW_THE_PIPELINE_WORKS.md document.
+
+## Third-party Dependency Update Monitoring
+
+Proactively one should monitor releases of third-party libraries and re-sync our code base with them as soon as scheduling allows.
+Automate uses two main languages, Go (back-end) and TypeScript (front-end), each having their own complex ecosystem.
+
+### Back-end Updates
+
+There is no central place to check on all dependencies.
+The following, though, lists some of the key Golang dependencies--this is NOT a complete list by any mean.
+One can set up a "watch" on each of these repositories to be automatically notified of new releases that can then be scheduled as technical debt items to be done in a timely fashion.
+
+- https://github.com/golang/go/releases
+- https://github.com/grpc/grpc-go/releases
+- https://github.com/golang/protobuf/releases
+- https://github.com/open-policy-agent/opa/releases
+- https://github.com/grpc-ecosystem/grpc-gateway/releases
+- https://github.com/golang/protobuf/protoc-gen-go/releases
+- https://github.com/stretchr/testify/releases
+- https://github.com/lib/pq/releases
+- https://github.com/dexidp/dex/releases
+- https://github.com/nginx/nginx/releases
+
+Note that some of these may be pinned for various reasons; those reasons should be re-examined periodically to see if the issue requiring the pinning can be or has been resolved. Example: at the time of writing, `dex` is pinned at 2.19.0 as indicated here: https://github.com/chef/automate/blob/4b6d53641d3687bcf0f15303eed5b1dcd7eb251c/go.mod#L132-L133
+
+### Front-end Updates
+
+All of the following occurs in the context of components/automate-ui.
+One can quickly identify any outdated packages using `npm outdated` on the command-line (from the automate-ui directory).
+There are some packages that we deliberately pin (in automate-ui/package.json) for compatibility reasons.
+These are documented in the "Dependency Management" section of automate-ui/README.md.
+Be sure to review and update that document whenever updating any dependencies.
+
+Notable in front-end updates are two things:
+
+- Angular -- follow the releases at https://update.angular.io/ and follow the methodology outlined in https://github.com/chef/automate/pull/3082.
+- Node -- follow the release at https://nodejs.org/en/about/releases/ and follow the methodology outlined in https://github.com/chef/automate/pull/2579.
+
+Separate from keeping an eye on version numbers becoming out-of-date are how either updating a package or NOT updating a package may still cause problems to appear.
+Some such problems generate warnings rather than errors (when you run unit tests or lint, for example).
+But warnings do not fail the build so often these would go unresolved or even completely unnoticed for quite some time.
+We therefore added a mechanism to treat warnings as errors and thus fail the build on any warnings.
+This is in place for running both unit tests and lint in the UI (look for instances of build.sh in components/automate-ui/package.json).
+But, alas, there are other problems that are not even reported as warnings (deprecations and such), and do NOT cause the build to fail.
+Because of that, it is necessary for certain tasks to be checked periodically, specifically `make lint`, `make unit`, and `make e2e`.
+At the time of writing, all of these are reporting issues that need to be resolved:
+
+```text
+make lint:
+  - CURRENT PROBLEM--does not fail build:
+      TSLint's support is discontinued and we're deprecating its support in Angular CLI.
+      To opt-in using the community driven ESLint builder, see: https://github.com/angular-eslint/angular-eslint#migrating-from-codelyzer-and-tslint.
+
+make unit:
+  - CURRENT PROBLEM--does not fail build:
+      'karma-coverage-istanbul-reporter' usage has been deprecated since version 11.
+      Please install 'karma-coverage' and update 'karma.conf.js.' For more info, see https://github.com/karma-runner/karma-coverage/blob/master/README.md
+
+make e2e:
+  - CURRENT PROBLEM--fails build when run locally, but not in Buildkite:
+      spawn Unknown system error -86
+      References:
+        https://stackoverflow.com/questions/65618558/osx-fix-selenium-chromedriver-launch-error-spawn-unknown-system-error-86-bad-cp
+        https://github.com/angular/webdriver-manager/issues/476
+
+```
diff --git a/cspell.json b/cspell.json
@@ -119,6 +119,7 @@
         "dehaze",
         "deletable",
         "denormalize",
+        "dependabot",
         "dequeuer",
         "dequeues",
         "deregister",

diff --git a/scripts/pipeline_info.sh b/scripts/pipeline_info.sh
@@ -0,0 +1,40 @@
+#!/usr/bin/env bash
+
+# Pre-requisites:
+# gq from https://github.com/hasura/graphqurl
+# jq from https://stedolan.github.io/jq/
+# env var $BUILDKITE_TOKEN populated from https://buildkite.com/user/api-access-tokens
+
+# Usage:
+# pipelines --blocked
+# pipelines --stale
+# pipelines
+
+pipelines() {
+  mode=$1
+  gql='query {
+    organization(slug: "chef") {
+      pipelines(last: 100, search: "[chef/automate:master]") {
+        edges {
+          node {
+	    name
+            builds(first: 1) {
+              edges {
+                node {
+                  branch state message createdAt
+                } } } } } } } }'
+  if [[ $mode == "--blocked" ]]; then
+    # See https://buildkite.com/docs/apis/rest-api/builds for buildkite API 
+    curl -sH "Authorization: Bearer $BUILDKITE_TOKEN" https://api.buildkite.com/v2/builds?state=blocked | \
+      jq -c '.[] | select(.pipeline.provider.settings.repository == "chef/automate") | { branch, created_at, state, pr:.pull_request.id, pipeline:.pipeline.name, message }'
+  elif [[ $mode == "--stale" ]]; then
+    # See https://buildkite.com/user/graphql/documentation for buildkite GraphQL API
+    echo "Note: any reported pipelines are NOT necessarily problems"
+    gq https://graphql.buildkite.com/v1 -H "Authorization: Bearer $BUILDKITE_TOKEN" --query="$gql" | \
+      jq -c --arg today "$(date +"%Y-%m-%d")" \
+        '.data.organization.pipelines.edges[].node | .builds.edges[].node.createdAt as $createdAt | select($createdAt < $today) | { name:.name, createdAt:$createdAt }'
+  else
+    gq https://graphql.buildkite.com/v1 -H "Authorization: Bearer $BUILDKITE_TOKEN" --query="$gql" | \
+      jq -c '.data.organization.pipelines.edges[].node | .builds.edges[].node.createdAt as $createdAt | { name:.name, createdAt:$createdAt }'
+  fi
+}