Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add troubleshooting docs #638

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/Development.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ This track is focused around the development of custom [Prometheus exporters](ht

We use [Helm](https://helm.sh) to provide an automated deployment and configuration experience for Pelorus. We are always doing work to cover more and more complex use cases with our helm charts. In order to be able to effectively contribute to these charts, you'll need a cluster that satisfies all of the installation prerequisites for Pelorus.

See the [Install guide](Install.md) for more details on that.
See the [installation guide](GettingStarted.md#installation) for more details.

Currently we have two charts:

Expand Down Expand Up @@ -292,7 +292,7 @@ Checkout the PR on top of your fork.

1. [Checkout](#checkout) the PR on top of your fork.

2. [Install Pelorus](Install.md) from checked out fork/branch.
2. [Install Pelorus](GettingStarted.md) from checked out fork/branch.

**NOTE:**

Expand Down Expand Up @@ -341,7 +341,7 @@ Each PR runs exporter tests in the CI systems, however those changes can be test

### Helm Install changes

For testing changes to the helm chart, you should just follow the [standard install process](Install.md), then verify that:
For testing changes to the helm chart, you should just follow the [standard install process](GettingStarted.md), then verify that:

* All expected pods are running and healthy
* Any expected behavior changes mentioned in the PR can be observed.
Expand Down
48 changes: 41 additions & 7 deletions docs/Install.md → docs/GettingStarted.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,29 @@
# Getting Started

# Installation
## Basic Concepts

Pelorus presents various [_measures_](../dashboards/SoftwareDeliveryPerformance#measures) to you, such as Lead Time for Change (how long it takes for a commit to wind up in production).

These measures are calculated from various _metrics_.
For example, the Lead Time for Change measure is calculated as the difference between the time of a deployment including that commit (`deploy_time`), and the time that commit was made (`commit_time`).

These measures are provided using _exporters_, which collect information from various sources.
For example, the `deploytime` exporter looks for running pods in OpenShift. The `committime` exporter looks for OpenShift Builds, and correlates them with git commit information from various _providers_, such as GitHub or Bitbucket.

### Preparing Your Data

<!-- TODO: explain how app names work in the first place -->

To properly collect various metrics, Pelorus will need to find certain metadata. In common cases, this metadata may already be there! If not, you will need to adjust how these resources are created in OpenShift.

For now, we'll focus on deploy time: to capture deployments, `Pod`s and their `ReplicationController`s must be labeled with the _app name_. This is `app.kubernetes.io/name` by default, but can be [customized](Configuration.md#labels).


## Installation

The following will walk through the deployment of Pelorus.

## Prerequisites
### Prerequisites

Before deploying the tooling, you must have the following prepared

Expand All @@ -14,7 +34,7 @@ Before deploying the tooling, you must have the following prepared
* jq
* git

## Initial Deployment
### Initial Deployment

Pelorus gets installed via helm charts. The first deploys the operators on which Pelorus depends, the second deploys the core Pelorus stack and the third deploys the exporters that gather the data. By default, the below instructions install into a namespace called `pelorus`, but you can choose any name you wish.

Expand All @@ -40,9 +60,7 @@ In a few seconds, you will see a number of resourced get created. The above comm
* The following exporters:
* Deploy Time

From here, some additional configuration is required in order to deploy other exporters, and make the Pelorus

See the [Configuration Guide](Configuration.md) for more information on exporters.
From here, some additional [configuration](Configuration.md) and [data preparation](#preparing-your-data-details) is required in order to deploy other exporters.

You may additionally want to enabled other features for the core stack. Read on to understand those options.

Expand Down Expand Up @@ -107,7 +125,7 @@ If you don't have an object storage provider, we recommend [NooBaa](https://www.

By default, this tool will pull in data from the cluster in which it is running. The tool also supports collecting data across mulitple OpenShift clusters. In order to do this, the thanos sidecar can be configured to read from a shared S3 bucket accross clusters. See [Pelorus Multi-Cluster Architecture](Architecture.md) for details. You define exporters for the desired meterics in each of the clusters which metrics will be evaluated. The main cluster's Grafana dashboard will display a combined view of the metrics collected in the shared S3 bucket via thanos.

#### Configure Production Cluster.
### Configure Production Cluster

The produciton configuration example with one deploytime exporter, which uses AWS S3 bucket and AWS volume for Prometheus and tracks deployments to production:

Expand Down Expand Up @@ -135,7 +153,23 @@ exporters:
- pelorus-config
- deploytime-config
```
## Preparing Your Data: Details

#### Commit Time

`Build`s must have a commit hash and repository URL associated with them.

The commit hash comes from either the build's `spec.revision.git.commit` (populated in Source to Image builds), or falls back to the [annotation](./Configuration.md#annotations-and-local-build-support) `io.openshift.build.commit.id`.

The repository URL comes from either the build's `spec.source.git.uri` (populated in Source to Image builds), or falls back to the [annotation](./Configuration.md#annotations-and-local-build-support) `io.openshift.build.source-location`.

The commit time exporter(s) must be [configured](./Configuration.md#commit-time-exporter) to point to the proper git provider(s).

<!-- TODO: info about image exporter -->

#### Failure Time

Metadata from issue trackers is provider-specific. See [Failure Time Exporter Configuration](./Configuration.md#failure-time-exporter) for details.

## Uninstalling

Expand Down
4 changes: 2 additions & 2 deletions docs/Noobaa.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

NooBaa is a software-driven data service that provides S3 object-storage interface that we use for testing and development of Pelorus project.

The following is a walkthrough for deploying NooBaa Operator on OpenShift and then configuring Pelorus to consume it as a [Long Term Storage](Install.md#configure-long-term-storage-recommended) solution.
The following is a walkthrough for deploying NooBaa Operator on OpenShift and then configuring Pelorus to consume it as a [Long Term Storage](GettingStarted.md#configure-long-term-storage-recommended) solution.

## Install NooBaa Operator CLI

Expand Down Expand Up @@ -105,7 +105,7 @@ noobaa bucket status thanos --namespace pelorus

## Update Pelorus Configuration

To update our Pelorus stack, follow the instructions provided in the [Long Term Storage](Install.md#configure-long-term-storage-recommended).
To update our Pelorus stack, follow the instructions provided in the [Long Term Storage](GettingStarted.md#configure-long-term-storage-recommended).

Ensure that `<s3 access key>`, `<s3 secred access key>` and the `<bucket name>` are used from the [Deploy NooBaa
](#deploy-noobaa) step and `s3.pelorus.svc:443`, which is an `S3 InternalDNS Address` from the `noobaa status --namespace pelorus` command, as bucket access point as in example:
Expand Down
10 changes: 10 additions & 0 deletions docs/Troubleshooting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Troubleshooting

## Information not showing up in dashboard

We've included a troubleshooting script to check if your data is labeled correctly,
as required for [the deploy time exporter](GettingStarted.md#preparing-your-data)
and [the other exporters](GettingStarted.md#preparing-your-data-details).

With a [local dev environment](Development.md#dev-environment-setup) set up,
run `./scripts/troubleshooting/missing_labels -h` for information about how to use it.
4 changes: 3 additions & 1 deletion docs/dashboards/SoftwareDeliveryPerformance.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@

_Software Delivery Performance_ is a measure of an organization's ability to effectively deliver software-based products they have built to their customers. It is comprised of 4 _measures_ that provide a balanced perspective, taking both speed to market and stability measures into account. Tracking _Software Delivery Performance_ over time provides IT organizations with data they can use to make smarter investments in their internal tools and processes to optimize their delivery processes based on the types of products they are delivering. This outcomes provides a bridge between development, operations and leadership, allowing them to better communicate about whether proposed work on infrastructure imrovements or process developments are in line with the overall vision and financial goals of the organization at large.


## Measures

![Software Delivery Performance dashboard](../img/sdp-dashboard.png)

The Pelorus _Software Delivery Performance_ dashboard tracks the four primary measures of software delivery:
Expand All @@ -13,7 +16,6 @@ The Pelorus _Software Delivery Performance_ dashboard tracks the four primary me

For more information about Software Delivery Performance, check out the book [Accelerate](https://itrevolution.com/book/accelerate/) by Forsgren, Kim and Humble.

## Measures

![Exporter relaionship diagram](../img/exporter-relationship-diagram.png)

Expand Down
5 changes: 4 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,19 @@ theme: readthedocs
markdown_extensions:
- def_list
- tables
- toc:
permalink: True
nav:
- Introduction:
- Welcome to Pelorus: index.md
- Our Philosphy: Philosophy.md
- Using Pelorus:
- Getting Started: GettingStarted.md
- Architecture: Architecture.md
- Installation: Install.md
- Demo: Demo.md
- Configuration: Configuration.md
- NooBaa for Long Term Storage: Noobaa.md
- Troubleshooting: Troubleshooting.md
- Dashboards:
- Dashboard Summary: Dashboards.md
- Software Delivery Performance: dashboards/SoftwareDeliveryPerformance.md
Expand Down
20 changes: 20 additions & 0 deletions scripts/troubleshooting/missing_labels.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@
import pelorus.utils
from pelorus.utils import paginate_resource

DOCS_BASE_URL = "https://pelorus.readthedocs.io/en/stable/"
DEPLOYTIME_PREPARE_DATA_URL = DOCS_BASE_URL + "GettingStarted#preparing-your-data"
COMMITTIME_PREPARE_DATA_URL = DOCS_BASE_URL + "GettingStarted#commit-time"

# A NOTE ON TERMINOLOGY:
# what you might call a "resource" in openshift is called a ResourceInstance by the client.
# to the client, a Resource is its "type definition".
Expand Down Expand Up @@ -197,6 +201,10 @@ class DeploytimeTroubleshootingReport:
pods_missing_app_label: list[PodId]
replicators_missing_app_label: dict[ReplicatorId, OwnedPods]

@property
def anything_to_report(self):
return self.pods_missing_app_label or self.replicators_missing_app_label

def _print_pods(self):
if not self.pods_missing_app_label:
print("No pods were missing the app label", self.app_label)
Expand All @@ -216,10 +224,16 @@ def _print_replicators(self):
for replicator in self.replicators_missing_app_label:
print(" ", replicator.kind_, replicator.name)

def _print_suggestion(self):
print(f"Add the label {self.app_label}.")
print("See", DEPLOYTIME_PREPARE_DATA_URL)

def print_human_readable(self):
self._print_pods()
print()
self._print_replicators()
if self.anything_to_report:
self._print_suggestion()

def to_json(self) -> dict:
pods_missing_label = [pod.name for pod in self.pods_missing_app_label]
Expand Down Expand Up @@ -253,6 +267,10 @@ class CommittimeTroubleshootingReport:

builds_missing_app_label: list[BuildId]

@property
def anything_to_report(self):
return bool(self.builds_missing_app_label)

def print_human_readable(self):
if not self.builds_missing_app_label:
print("No builds were missing the app label", self.app_label)
Expand All @@ -262,6 +280,8 @@ def print_human_readable(self):
for build in self.builds_missing_app_label:
print(build.name)

# TODO: app label committime docs?

def to_json(self) -> dict:
return dict(
namespace=namespace,
Expand Down