Skip to content

Commit

Permalink
feat: add monitor to operator, fix monitoring setup (#256)
Browse files Browse the repository at this point in the history
## Description

This adds a new portion to the `Package` spec, `monitor`. This
functionality is briefly documented
[here](https://github.com/defenseunicorns/uds-core/blob/operator-monitor-magic/docs/MONITOR.md).
Also included is a mutation for all service monitors to handle them in
the expected istio mTLS way. This PR also fixes some missing dashboards
and enables all default dashboards from the upstream
kube-prometheus-stack chart (this may be excessive?).

Currently monitored:
- Prometheus stack (operator, self, alertmanager)
- Loki
- kube-system things (kubelet, coredns, apiserver)
- Promtail
- Metrics-server
- Velero
- Keycloak
- Grafana
- All istio envoy proxies (podmonitor)

Not added here:
- NeuVector: Currently has limited config options with regards to auth,
keeping this disabled in anticipation of SSO on NeuVector, with a desire
to contribute upstream to enable our use case of SSO-only auth.

In addition this PR switches single package testing to use/add Istio.
This appears to add around 1 minute to each pipeline run, but:
- allows us to test the istio VS endpoints in single package checks
(some quirks with this in current state due to SSO redirects, but will
allow us to do e2e testing in those pipelines in the future)
- allows us to assume istio always and not build bespoke pepr code for
the specific no-istio scenario (ex: prometheus can assume certs)

This would still allow someone to run locally without istio for some
packages / un-inject, mess around with mTLS, etc - which may be useful
still to identify where Istio is causing problems. This just switches
our CI posture so that we assume istio always.

## Related Issue

Fixes #17

## Type of change

- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Other (security config, docs update, etc)

## Checklist before merging

- [x] Test, docs, adr added or updated as needed
- [x] [Contributor Guide
Steps](https://github.com/defenseunicorns/uds-template-capability/blob/main/CONTRIBUTING.md)(https://github.com/defenseunicorns/uds-template-capability/blob/main/CONTRIBUTING.md#submitting-a-pull-request)
followed

---------

Co-authored-by: Tristan Holaday <40547442+TristanHoladay@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
  • Loading branch information
3 people committed Apr 26, 2024
1 parent 1822bca commit bf67722
Show file tree
Hide file tree
Showing 38 changed files with 1,420 additions and 139 deletions.
1 change: 1 addition & 0 deletions .yamllint
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ ignore:
- '**/.terraform/**'
- '**/chart/templates**'
- 'node_modules/**'
- 'dist/**'

rules:
anchors: enable
Expand Down
50 changes: 50 additions & 0 deletions docs/MONITOR.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Monitoring / Metrics Scraping in UDS Core

UDS Core leverages Pepr to handle setup of Prometheus scraping metrics endpoints, with the particular configuration necessary to work in a STRICT mTLS (Istio) environment. We handle this with both mutations of existing service monitors and generation of service monitors via the `Package` CR.

## Mutations

All service monitors are mutated to set the scrape scheme to HTTPS and set the TLS Config to what is required for Istio mTLS scraping (see [this doc](https://istio.io/latest/docs/ops/integrations/prometheus/#tls-settings) for details). Beyond this, no other fields are mutated. Supporting existing service monitors is useful since some charts include service monitors by default with more advanced configurations, and it is in our best interest to enable those and use them where possible.

Assumptions are made about STRICT mTLS here for simplicity, based on the `istio-injection` namespace label. Without making these assumptions we would need to query `PeerAuthentication` resources or another resource to determine the exact workload mTLS posture.

Note: This mutation is the default behavior for all service monitors but can be skipped using the annotation key `uds/skip-sm-mutate` (with any value). Skipping this mutation should only be done if your service exposes metrics on a PERMISSIVE mTLS port.

## Package CR `monitor` field

UDS Core also supports generating service monitors from the `monitor` list in the `Package` spec. Charts do not always support service monitors, so generating them can be useful. This also provides a simplified way for other users to create service monitors, similar to the way we handle `VirtualServices` today. A full example of this can be seen below:

```yaml
...
spec:
monitor:
- selector: # Selector for the service to monitor
app: foobar
portName: metrics # Name of the port to monitor
targetPort: 1234 # Corresponding target port on the pod/container (for network policy)
# Optional properties depending on your application
description: "Metrics" # Add to customize the service monitor name
podSelector: # Add if pod labels are different than `selector` (for network policy)
app: barfoo
path: "/mymetrics" # Add if metrics are exposed on a different path than "/metrics"
```

This config is used to generate service monitors and corresponding network policies to setup scraping for your applications. The `ServiceMonitor`s will go through the mutation process to add `tlsConfig` and `scheme` to work in an istio environment.

This spec intentionally does not support all options available with a `ServiceMonitor`. While we may add additional fields in the future, we do not want to simply rebuild the `ServiceMonitor` spec since mutations are already available to handle Istio specifics. The current subset of spec options is based on the bare minimum necessary to craft resources.

NOTE: While this is a rather verbose spec, each of the above fields are strictly required to craft the necessary service monitor and network policy resources.

## Notes on Alternative Approaches

In coming up with this feature a few alternative approaches were considered but not chosen due to issues with each one. The current spec provides the best balance of a simplified interface compared to the `ServiceMonitor` spec, and a faster/easier reconciliation loop.

### Generation based on service lookup

An alternative spec option would use the service name instead of selectors/port name. The service name could then be used to lookup the corresponding service and get the necessary selectors/port name (based on numerical port). There are however 2 issues with this route:
1. There is a timing issue if the `Package` CR is applied to the cluster before the app chart itself (which is the norm with our UDS Packages). The service would not exist at the time the `Package` is reconciled. We could lean into eventual consistency here, if we implemented a retry mechanism for the `Package`, which would mitigate this issue.
2. We would need an "alert" mechanism (watch) to notify us when the service(s) are updated, to roll the corresponding updates to network policies and service monitors. While this is doable it feels like unnecessary complexity compared to other options.

### Generation of service + monitor

Another alternative approach would be to use a pod selector and port only. We would then generate both a service and servicemonitor, giving us full control of the port names and selectors. This seems like a viable path, but does add an extra resource for us to generate and manage. There could be unknown side effects of generating services that could clash with other services (particularly with istio endpoints). This would otherwise be a relative straightforward approach and is worth evaluating again if we want to simplify the spec later on.
4 changes: 4 additions & 0 deletions pepr.ts
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ import cfg from "./package.json";
import { istio } from "./src/pepr/istio";
import { operator } from "./src/pepr/operator";
import { policies } from "./src/pepr/policies";
import { prometheus } from "./src/pepr/prometheus";

new PeprModule(cfg, [
// UDS Core Operator
Expand All @@ -15,4 +16,7 @@ new PeprModule(cfg, [

// Istio service mesh
istio,

// Prometheus monitoring stack
prometheus,
]);
7 changes: 7 additions & 0 deletions src/grafana/chart/templates/uds-package.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,13 @@ spec:
redirectUris:
- "https://grafana.admin.{{ .Values.domain }}/login/generic_oauth"

monitor:
- selector:
app.kubernetes.io/name: grafana
targetPort: 3000
portName: service
description: Metrics

network:
expose:
- service: grafana
Expand Down
22 changes: 15 additions & 7 deletions src/grafana/tasks.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,18 @@ tasks:
name: "app.kubernetes.io/instance=grafana"
namespace: grafana
condition: Ready
# todo: Fix single package validation checks in CI where Istio isn't installed
# - description: Validate grafana interface
# wait:
# network:
# protocol: https
# address: grafana.admin.uds.dev
# code: 200
- description: Validate grafana virtual service
cmd: |
if [ "$(curl -isS https://grafana.admin.uds.dev --output /dev/null -w '%{http_code}')" = "302" ]; then
echo "Grafana is up and running."
else
echo "ERROR: Grafana returned a $(curl -isS https://grafana.admin.uds.dev --output /dev/null -w '%{http_code}') code."
exit 1
fi
if curl -L -isS https://grafana.admin.uds.dev --output /dev/null -w '%{url_effective}' | grep "sso.uds.dev" 2>&1 1>/dev/null; then
echo "Grafana is redirecting to SSO as expected."
else
echo "ERROR: Grafana is redirecting to $(curl -L -isS https://grafana.admin.uds.dev --output /dev/null -w '%{url_effective}')."
exit 1
fi
4 changes: 4 additions & 0 deletions src/grafana/values/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ sidecar:
dashboards:
enabled: true
label: grafana_dashboard
searchNamespace: ALL
datasources:
enabled: true
label: grafana_datasource
Expand Down Expand Up @@ -37,3 +38,6 @@ grafana.ini:
role_attribute_strict: true
# Automatically redirect to the SSO login page
auto_login: true

service:
appProtocol: "http"
13 changes: 11 additions & 2 deletions src/keycloak/chart/templates/istio-admin.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,19 @@ spec:
paths:
- "/admin*"
- "/realms/master*"
- "/metrics*"
from:
- source:
notNamespaces: ["istio-admin-gateway"]
notNamespaces:
- istio-admin-gateway
- to:
- operation:
paths:
- /metrics*
from:
- source:
notNamespaces:
- istio-admin-gateway
- monitoring
- to:
- operation:
paths:
Expand Down
10 changes: 10 additions & 0 deletions src/keycloak/chart/templates/uds-package.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,16 @@ metadata:
name: keycloak
namespace: {{ .Release.Namespace }}
spec:
monitor:
- selector:
app.kubernetes.io/name: keycloak
app.kubernetes.io/component: http
podSelector:
app.kubernetes.io/name: keycloak
targetPort: 8080
portName: http
description: Metrics

network:
allow:
- description: "UDS Operator"
Expand Down
28 changes: 12 additions & 16 deletions src/keycloak/tasks.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,20 @@ includes:
- config: https://raw.githubusercontent.com/defenseunicorns/uds-identity-config/v0.3.6/tasks.yaml

tasks:
# These tests break single capability test checks
- name: validate
actions:
- description: replace me
cmd: echo "hello"
# actions:
# - description: Validate admin interface
# wait:
# network:
# protocol: https
# address: keycloak.admin.uds.dev
# code: 200
# - description: Validate public interface
# wait:
# network:
# protocol: https
# address: sso.uds.dev
# code: 200
- description: Validate admin interface
wait:
network:
protocol: https
address: keycloak.admin.uds.dev
code: 200
- description: Validate public interface
wait:
network:
protocol: https
address: sso.uds.dev
code: 200

- name: dev-theme
actions:
Expand Down
20 changes: 20 additions & 0 deletions src/metrics-server/chart/templates/service-monitor.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{{- if .Capabilities.APIVersions.Has "monitoring.coreos.com/v1" }}
# The serviceMonitor for metrics-server is unique due to permissive mTLS on its port, so it is created outside of the Package spec
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
annotation:
uds/skip-sm-mutate: "true"
name: metrics-server-metrics
namespace: metrics-server
spec:
endpoints:
- path: /metrics
port: https
scheme: https
tlsConfig:
insecureSkipVerify: true
selector:
matchLabels:
app.kubernetes.io/name: metrics-server
{{- end }}
2 changes: 2 additions & 0 deletions src/metrics-server/values/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,5 @@ readinessProbe:
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 5
metrics:
enabled: true
7 changes: 0 additions & 7 deletions src/neuvector/chart/templates/uds-exemption.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,10 +41,3 @@ spec:
`/proc`: monitoring of processes for malicious activity
`/sys/fs/cgroup`: important files the controller wants to monitor for malicious content
https://github.com/neuvector/neuvector-helm/blob/master/charts/core/templates/enforcer-daemonset.yaml#L108"

- policies:
- DropAllCapabilities
matcher:
namespace: neuvector
name: "^neuvector-prometheus-exporter-pod.*"
title: "neuvector-prometheus-exporter-pod"
20 changes: 11 additions & 9 deletions src/neuvector/chart/templates/uds-package.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,16 @@ metadata:
name: neuvector
namespace: {{ .Release.Namespace }}
spec:
# This is disabled pending further discussion/upstream changes to handle metrics with SSO setup
# monitor:
# - selector:
# app: neuvector-prometheus-exporter
# podSelector:
# app: neuvector-prometheus-exporter-pod
# portName: metrics
# targetPort: 8068
# description: "Metrics"

sso:
- name: Neuvector
clientId: uds-core-admin-neuvector
Expand All @@ -23,6 +33,7 @@ spec:
global_role: admin
- group: /UDS Core/Auditor
global_role: reader
network:
expose:
- service: neuvector-service-webui
Expand Down Expand Up @@ -69,15 +80,6 @@ spec:
port: 30443
description: "Webhook"

- direction: Ingress
remoteNamespace: monitoring
remoteSelector:
app: prometheus
selector:
app: neuvector-prometheus-exporter-pod
port: 8068
description: "Prometheus Metrics"

- direction: Egress
remoteNamespace: tempo
remoteSelector:
Expand Down
13 changes: 6 additions & 7 deletions src/neuvector/tasks.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,9 @@ tasks:
name: app=neuvector-scanner-pod
condition: Ready
namespace: neuvector
# todo: Fix single package validation checks in CI where Istio isn't installed
# - description: Validate Neuvector Interface
# wait:
# network:
# protocol: https
# address: neuvector.admin.uds.dev
# code: 200
- description: Validate Neuvector Interface
wait:
network:
protocol: https
address: neuvector.admin.uds.dev
code: 200
7 changes: 4 additions & 3 deletions src/neuvector/values/monitor-values.yaml
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
leastPrivilege: false

exporter:
# Temporarily disabled until we can get the monitor chart working
# via https://github.com/neuvector/neuvector-helm/pull/355
# This is disabled pending further discussion/upstream changes to handle metrics with SSO setup
enabled: false
# Disable serviceMonitor to handle standalone testing, handled by UDS Package
serviceMonitor:
enabled: false
apiSvc: neuvector-svc-controller-api:10443
svc:
enabled: true
# Temporary disabling of service to allow adding appProtocol via custom service in the config chart
enabled: false
type: ClusterIP
7 changes: 2 additions & 5 deletions src/neuvector/values/registry1-monitor-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,5 @@ exporter:
tag: 5.3.2

containerSecurityContext:
runAsUser: 1002
runAsGroup: 1002
capabilities:
drop:
- ALL
runAsUser: 1001
runAsGroup: 1001
40 changes: 40 additions & 0 deletions src/pepr/operator/controllers/monitoring/service-monitor.spec.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import { describe, expect, it } from "@jest/globals";
import { generateServiceMonitor } from "./service-monitor";

describe("test generate service monitor", () => {
it("should return a valid Service Monitor object", () => {
const pkg = {
apiVersion: "uds.dev/v1alpha1",
kind: "Package",
metadata: {
name: "test",
uid: "f50120aa-2713-4502-9496-566b102b1174",
},
};
const portName = "http-metrics";
const metricsPath = "/test";
const selectorApp = "test";
const monitor = {
portName: portName,
path: metricsPath,
targetPort: 1234,
selector: {
app: selectorApp,
},
};
const namespace = "test";
const pkgName = "test";
const generation = "1";
const payload = generateServiceMonitor(pkg, monitor, namespace, pkgName, generation);

expect(payload).toBeDefined();
expect(payload.metadata?.name).toEqual(`${pkgName}-${selectorApp}-${portName}`);
expect(payload.metadata?.namespace).toEqual(namespace);
expect(payload.spec?.endpoints).toBeDefined();
if (payload.spec?.endpoints) {
expect(payload.spec.endpoints[0].port).toEqual(portName);
expect(payload.spec.endpoints[0].path).toEqual(metricsPath);
}
expect(payload.spec?.selector.matchLabels).toHaveProperty("app", "test");
});
});

0 comments on commit bf67722

Please sign in to comment.